Thread: [HACKERS] Group clear xid can leak semaphore count

[HACKERS] Group clear xid can leak semaphore count

From
Amit Kapila
Date:
During the review of Group update Clog patch [1], Dilip noticed an
issue with the patch where it can leak the semaphore count in one of
the corner case.  I have checked and found that similar issue exists
for Group clear xid (ProcArrayGroupClearXid) as well.  I think the
reason why this problem is not noticed by anyone till now is that it
can happen only in a rare scenario when the backend waiting for xid
clear is woken up due to some unrelated signal.  This problem didn't
exist in the original commit
(0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later
while fixing some issues in the committed patch, it got introduced in
commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the
issue is attached.


[1] -
https://www.postgresql.org/message-id/CAA4eK1J%2B67edo_Wnrfx8oJ%2BrWM_BAr%2Bv6JqvQjKPdLOxR%3D0d5g%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Group clear xid can leak semaphore count

From
Robert Haas
Date:
On Sat, Dec 31, 2016 at 12:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> During the review of Group update Clog patch [1], Dilip noticed an
> issue with the patch where it can leak the semaphore count in one of
> the corner case.  I have checked and found that similar issue exists
> for Group clear xid (ProcArrayGroupClearXid) as well.  I think the
> reason why this problem is not noticed by anyone till now is that it
> can happen only in a rare scenario when the backend waiting for xid
> clear is woken up due to some unrelated signal.  This problem didn't
> exist in the original commit
> (0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later
> while fixing some issues in the committed patch, it got introduced in
> commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the
> issue is attached.

Yeah, that looks like a bug.  Thanks for the detailed analysis;
committed and back-patched to 9.6.

I suppose in the worst case it's possible that we'd leak a semaphore
count and then every future time we enter a PGSemaphoreLock using that
PGPROC we have to eat up the leaked count (or counts) and then put it
(or them) back after we really wait.  That would suck.  But I wasn't
able to observe any leaks in a high-concurrency pgbench test on hydra,
so it's either very unlikely or requires some additional condition to
trigger the problem.

I think we have run into this kind of issue before.  I wonder if
there's any way to insert some kind of a guard - e.g. detect at
backend startup time that the semaphore has a non-zero value and fix
it, issuing a warning along the way...  maybe something like:

while (sem_trywait(sem) == 0)   ++bogus;
if (bogus > 0)   elog(WARNING, "uh oh");

I'm not sure if that would be prone to false positives though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Group clear xid can leak semaphore count

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think we have run into this kind of issue before.  I wonder if
> there's any way to insert some kind of a guard - e.g. detect at
> backend startup time that the semaphore has a non-zero value and fix
> it, issuing a warning along the way...  maybe something like:

See the PGSemaphoreReset near the end of InitProcess.
        regards, tom lane



Re: [HACKERS] Group clear xid can leak semaphore count

From
Robert Haas
Date:
On Thu, Jan 5, 2017 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think we have run into this kind of issue before.  I wonder if
>> there's any way to insert some kind of a guard - e.g. detect at
>> backend startup time that the semaphore has a non-zero value and fix
>> it, issuing a warning along the way...  maybe something like:
>
> See the PGSemaphoreReset near the end of InitProcess.

Oh ho.  So the worst consequence of this is a backend-lifetime leak,
not a cluster-lifetime leak.  That's good, at least.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Group clear xid can leak semaphore count

From
Amit Kapila
Date:
On Fri, Jan 6, 2017 at 1:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Dec 31, 2016 at 12:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> During the review of Group update Clog patch [1], Dilip noticed an
>> issue with the patch where it can leak the semaphore count in one of
>> the corner case.  I have checked and found that similar issue exists
>> for Group clear xid (ProcArrayGroupClearXid) as well.  I think the
>> reason why this problem is not noticed by anyone till now is that it
>> can happen only in a rare scenario when the backend waiting for xid
>> clear is woken up due to some unrelated signal.  This problem didn't
>> exist in the original commit
>> (0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later
>> while fixing some issues in the committed patch, it got introduced in
>> commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the
>> issue is attached.
>
> Yeah, that looks like a bug.  Thanks for the detailed analysis;
> committed and back-patched to 9.6.
>

Thanks for committing the fix.  I have updated the CF app entry for this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com