Thread: [HACKERS] Group clear xid can leak semaphore count
During the review of Group update Clog patch [1], Dilip noticed an issue with the patch where it can leak the semaphore count in one of the corner case. I have checked and found that similar issue exists for Group clear xid (ProcArrayGroupClearXid) as well. I think the reason why this problem is not noticed by anyone till now is that it can happen only in a rare scenario when the backend waiting for xid clear is woken up due to some unrelated signal. This problem didn't exist in the original commit (0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later while fixing some issues in the committed patch, it got introduced in commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the issue is attached. [1] - https://www.postgresql.org/message-id/CAA4eK1J%2B67edo_Wnrfx8oJ%2BrWM_BAr%2Bv6JqvQjKPdLOxR%3D0d5g%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Dec 31, 2016 at 12:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > During the review of Group update Clog patch [1], Dilip noticed an > issue with the patch where it can leak the semaphore count in one of > the corner case. I have checked and found that similar issue exists > for Group clear xid (ProcArrayGroupClearXid) as well. I think the > reason why this problem is not noticed by anyone till now is that it > can happen only in a rare scenario when the backend waiting for xid > clear is woken up due to some unrelated signal. This problem didn't > exist in the original commit > (0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later > while fixing some issues in the committed patch, it got introduced in > commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the > issue is attached. Yeah, that looks like a bug. Thanks for the detailed analysis; committed and back-patched to 9.6. I suppose in the worst case it's possible that we'd leak a semaphore count and then every future time we enter a PGSemaphoreLock using that PGPROC we have to eat up the leaked count (or counts) and then put it (or them) back after we really wait. That would suck. But I wasn't able to observe any leaks in a high-concurrency pgbench test on hydra, so it's either very unlikely or requires some additional condition to trigger the problem. I think we have run into this kind of issue before. I wonder if there's any way to insert some kind of a guard - e.g. detect at backend startup time that the semaphore has a non-zero value and fix it, issuing a warning along the way... maybe something like: while (sem_trywait(sem) == 0) ++bogus; if (bogus > 0) elog(WARNING, "uh oh"); I'm not sure if that would be prone to false positives though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think we have run into this kind of issue before. I wonder if > there's any way to insert some kind of a guard - e.g. detect at > backend startup time that the semaphore has a non-zero value and fix > it, issuing a warning along the way... maybe something like: See the PGSemaphoreReset near the end of InitProcess. regards, tom lane
On Thu, Jan 5, 2017 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think we have run into this kind of issue before. I wonder if >> there's any way to insert some kind of a guard - e.g. detect at >> backend startup time that the semaphore has a non-zero value and fix >> it, issuing a warning along the way... maybe something like: > > See the PGSemaphoreReset near the end of InitProcess. Oh ho. So the worst consequence of this is a backend-lifetime leak, not a cluster-lifetime leak. That's good, at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 6, 2017 at 1:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Dec 31, 2016 at 12:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> During the review of Group update Clog patch [1], Dilip noticed an >> issue with the patch where it can leak the semaphore count in one of >> the corner case. I have checked and found that similar issue exists >> for Group clear xid (ProcArrayGroupClearXid) as well. I think the >> reason why this problem is not noticed by anyone till now is that it >> can happen only in a rare scenario when the backend waiting for xid >> clear is woken up due to some unrelated signal. This problem didn't >> exist in the original commit >> (0e141c0fbb211bdd23783afa731e3eef95c9ad7a) of the patch, but later >> while fixing some issues in the committed patch, it got introduced in >> commit 4aec49899e5782247e134f94ce1c6ee926f88e1c. Patch to fix the >> issue is attached. > > Yeah, that looks like a bug. Thanks for the detailed analysis; > committed and back-patched to 9.6. > Thanks for committing the fix. I have updated the CF app entry for this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com