Thread: Re: SV: Problem with pg_notify / listen

Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
[ redirecting to pgsql-bugs ]

Gustavsson Mikael <mikael.gustavsson@smhi.se> writes:
> Clarification, we upgraded from PG 11.9 to PG 11.10.

Hmm ... so the only commit that touched async.c in that interval was
d4031d784.  That changed asyncQueueAdvanceTail() so that, rather
than updating QUEUE_TAIL atomically, it would first compute the new
tail pointer, then release AsyncQueueLock for awhile, then finally
update QUEUE_TAIL with the previously-computed value.

I think that's all right in HEAD and v13, because of their limited
usage of the QUEUE_TAIL pointer.  But I now realize that it's *not*
all right in older branches, because in those branches
asyncQueueAdvanceTail is only called in one of two ways:

1. If ProcessCompletedNotifies realizes that nobody is there to
read the notification it just sent, it'll call asyncQueueAdvanceTail.
In a system actively using notifications, that likely won't ever
happen.

2. Otherwise, asyncQueueAdvanceTail is called from asyncQueueUnregister
or asyncQueueReadAllNotifications, but only if the backend believes
itself to be the hindmost backend, ie it saw that QUEUE_TAIL was
equal to its own queue pointer.

Thus, we now have the possibility for the following race condition:

1. Backend A performs asyncQueueReadAllNotifications and thence
asyncQueueAdvanceTail (hence, A was the hindmost to start with).

2. A computes a new tail pointer, which no longer matches its own
queue pointer, but does match backend B's (ie, B is currently
hindmost).  A now releases AsyncQueueLock, allowing a small
interval in which ...

3. B performs asyncQueueReadAllNotifications.  It sees that its
queue pointer does not match QUEUE_TAIL, so it doesn't need to
call asyncQueueAdvanceTail.  But it does advance its own pointer.

4. A re-acquires AsyncQueueLock and updates QUEUE_TAIL with what
is now a stale value that matches no backend's queue pointer.

5. After this, no execution of asyncQueueUnregister or
asyncQueueReadAllNotifications will call asyncQueueAdvanceTail,
so unless we get to a case where a notify is sent with no
listeners to hear it, the queue will never be emptied.  Ooops.


So the question is what to do about this.  We could dodge the
problem by back-patching 51004c717, but that's a considerably
larger change than I want to stick into the older branches.
More practical possibilities seem to include:

* Change the code back to being atomic, ie go ahead and update
QUEUE_TAIL immediately, and truncate the SLRU only afterward.
Why is it necessary, or even safe, to perform the SLRU truncation
before changing QUEUE_TAIL?  (IOW, I wonder if there isn't a bug
there in HEAD too.)

* Revert d4031d784's effects in async.c in the pre-v13 branches,
on the grounds that the cure was worse than the disease.

* Change asyncQueueAdvanceTail so that it does a whole fresh
computation of the tail pointer after it re-acquires the lock.
I think this is OK; it might mean that we miss an opportunity
to truncate the SLRU, but there'll be another one.

* Find another pathway in which to call asyncQueueAdvanceTail
occasionally.  While the obvious place would be "if we're
about to complain about the queue being full", that's probably
not good enough, because it'd mean that the queue grows quite
large before we recover from the race condition --- and a very
stale tail pointer has bad implications for the cost of
Exec_ListenPreCommit.  I'm not sure about good places otherwise.

Thoughts?

            regards, tom lane



Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
I wrote:
> * Change the code back to being atomic, ie go ahead and update
> QUEUE_TAIL immediately, and truncate the SLRU only afterward.
> Why is it necessary, or even safe, to perform the SLRU truncation
> before changing QUEUE_TAIL?  (IOW, I wonder if there isn't a bug
> there in HEAD too.)

After thinking more about that, I'm pretty sure there is a bug there:
a newly-arriving backend could attempt to scan the queue starting at
QUEUE_TAIL, concurrently with SimpleLruTruncate removing the page(s)
it wants to scan.  In typical cases no failure will occur because
Exec_ListenPreCommit could advance its queue pointer to a safe place
by observing the pointers of other backends.  However, if the new
listener were the only one in its database, we'd have trouble.

What seems like the most expedient way to fix this is to separate
QUEUE_TAIL into two variables, one that is the "logical" tail
position from which new backends may start to scan the queue,
and one that is the "physical" tail position, ie, the oldest
page we have not yet truncated.  The physical tail need only be
tracked to page accuracy, so it can be a plain int not a QUEUE_POS.
asyncQueueAdvanceTail should update QUEUE_TAIL immediately, which
restores correct behavior pre-v13 and also dodges the race condition
described above.  But we don't update the physical tail pointer
until we've completed SimpleLruTruncate, keeping things honest
with respect to asyncQueueIsFull.

As a minor side benefit, asyncQueueAdvanceTail doesn't have to take
the NotifyQueueLock twice unless it actually does an SLRU truncation.

In short, I propose the attached patch (against HEAD, but the
same logic changes should work in the back branches).

            regards, tom lane

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dbcace3f9..786e3ee1ca 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -255,7 +255,7 @@ typedef struct QueueBackendStatus
  * When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
  * entries of other backends and also change the head pointer. When holding
  * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointer.
+ * can change the tail pointers.
  *
  * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
  * In order to avoid deadlocks, whenever we need multiple locks, we first get
@@ -276,6 +276,8 @@ typedef struct AsyncQueueControl
     QueuePosition head;            /* head points to the next free location */
     QueuePosition tail;            /* tail must be <= the queue position of every
                                  * listening backend */
+    int            tailPage;        /* oldest unrecycled page; must be <=
+                                 * tail.page */
     BackendId    firstListener;    /* id of first listener, or InvalidBackendId */
     TimestampTz lastQueueFillWarn;    /* time of last queue-full msg */
     QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
@@ -286,6 +288,7 @@ static AsyncQueueControl *asyncQueueControl;
 
 #define QUEUE_HEAD                    (asyncQueueControl->head)
 #define QUEUE_TAIL                    (asyncQueueControl->tail)
+#define QUEUE_TAIL_PAGE                (asyncQueueControl->tailPage)
 #define QUEUE_FIRST_LISTENER        (asyncQueueControl->firstListener)
 #define QUEUE_BACKEND_PID(i)        (asyncQueueControl->backend[i].pid)
 #define QUEUE_BACKEND_DBOID(i)        (asyncQueueControl->backend[i].dboid)
@@ -537,6 +540,7 @@ AsyncShmemInit(void)
         /* First time through, so initialize it */
         SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
         SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+        QUEUE_TAIL_PAGE = 0;
         QUEUE_FIRST_LISTENER = InvalidBackendId;
         asyncQueueControl->lastQueueFillWarn = 0;
         /* zero'th entry won't be used, but let's initialize it anyway */
@@ -1358,7 +1362,7 @@ asyncQueueIsFull(void)
     nexthead = QUEUE_POS_PAGE(QUEUE_HEAD) + 1;
     if (nexthead > QUEUE_MAX_PAGE)
         nexthead = 0;            /* wrap around */
-    boundary = QUEUE_POS_PAGE(QUEUE_TAIL);
+    boundary = QUEUE_TAIL_PAGE;
     boundary -= boundary % SLRU_PAGES_PER_SEGMENT;
     return asyncQueuePagePrecedes(nexthead, boundary);
 }
@@ -1577,7 +1581,7 @@ static double
 asyncQueueUsage(void)
 {
     int            headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
-    int            tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
+    int            tailPage = QUEUE_TAIL_PAGE;
     int            occupied;
 
     occupied = headPage - tailPage;
@@ -2178,7 +2182,23 @@ asyncQueueAdvanceTail(void)
     /* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
     LWLockAcquire(NotifyQueueTailLock, LW_EXCLUSIVE);
 
-    /* Compute the new tail. */
+    /*
+     * Compute the new tail.  Pre-v13, it's essential that QUEUE_TAIL be exact
+     * (ie, exactly match at least one backend's queue position), so it must
+     * be updated atomically with the actual computation.  Since v13, we could
+     * get away with not doing it like that, but it seems prudent to keep it
+     * so.
+     *
+     * Also, because incoming backends will scan forward from QUEUE_TAIL, that
+     * must be advanced before we can truncate any data.  Thus, QUEUE_TAIL is
+     * the logical tail, while QUEUE_TAIL_PAGE is the oldest not-yet-truncated
+     * page.  When QUEUE_TAIL_PAGE != QUEUE_POS_PAGE(QUEUE_TAIL), there are
+     * pages we can truncate but haven't yet finished doing so.
+     *
+     * For concurrency's sake, we don't want to hold NotifyQueueLock while
+     * performing SimpleLruTruncate.  This is OK because no backend will try
+     * to access the pages we are in the midst of truncating.
+     */
     LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
     min = QUEUE_HEAD;
     for (BackendId i = QUEUE_FIRST_LISTENER; i > 0; i = QUEUE_NEXT_LISTENER(i))
@@ -2186,7 +2206,8 @@ asyncQueueAdvanceTail(void)
         Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
         min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
     }
-    oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
+    QUEUE_TAIL = min;
+    oldtailpage = QUEUE_TAIL_PAGE;
     LWLockRelease(NotifyQueueLock);
 
     /*
@@ -2205,16 +2226,16 @@ asyncQueueAdvanceTail(void)
          * release the lock again.
          */
         SimpleLruTruncate(NotifyCtl, newtailpage);
-    }
 
-    /*
-     * Advertise the new tail.  This changes asyncQueueIsFull()'s verdict for
-     * the segment immediately prior to the new tail, allowing fresh data into
-     * that segment.
-     */
-    LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
-    QUEUE_TAIL = min;
-    LWLockRelease(NotifyQueueLock);
+        /*
+         * Update QUEUE_TAIL_PAGE.  This changes asyncQueueIsFull()'s verdict
+         * for the segment immediately prior to the new tail, allowing fresh
+         * data into that segment.
+         */
+        LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+        QUEUE_TAIL_PAGE = newtailpage;
+        LWLockRelease(NotifyQueueLock);
+    }
 
     LWLockRelease(NotifyQueueTailLock);
 }

Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> ... agreed.  In general, recycling SLRU space entails three steps that shall
> not overlap:

> 1. Stop reading data in the space, regulated by some "logical tail".
> 2. Unlink files wholly within the bounds of the space.
> 3. Start writing data into the space, regulated by some "physical tail" (most
>    often called a "stop limit").

Check.

> Commit d4031d7 fixed overlap of (2) and (3).  For pg_notify, though, it
> introduced overlap of (1) and (2).  I've now checked the other SLRUs for
> similar problems, but I found nothing urgent:

Good, I was wondering if we had any similar issues elsewhere.

> I think we don't yet have the right name here, seeing QUEUE_TAIL_PAGE !=
> QUEUE_POS_PAGE(QUEUE_TAIL) sounds paradoxical, yet happens regularly.  How
> about naming it QUEUE_STOP_PAGE?

Hmm, it's not very clear what "stop" means here.  What do you think of
QUEUE_OLDEST_PAGE?

            regards, tom lane



Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here.  What do you think of
>> QUEUE_OLDEST_PAGE?

> QUEUE_OLDEST_PAGE is fine.  I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.

I went with QUEUE_STOP_PAGE.

In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail.  That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary.  So I reverted it to measuring the distance to
QUEUE_TAIL.  You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.

Anyhow, pushed with that fix.

Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa

            regards, tom lane



SV: SV: Problem with pg_notify / listen

From
Gustavsson Mikael
Date:

Great!


I will talk to the rest of the team on which path to take.


And thank you for finding and fixing this so quickly!


kr

Mikael Gustavsson

SMHI


Från: Tom Lane <tgl@sss.pgh.pa.us>
Skickat: den 28 november 2020 20:15
Till: Noah Misch
Kopia: Gustavsson Mikael; pgsql-bugs@lists.postgresql.org; Svensson Peter; Almen Anders
Ämne: Re: SV: Problem with pg_notify / listen
 
Noah Misch <noah@leadboat.com> writes:
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here.  What do you think of
>> QUEUE_OLDEST_PAGE?

> QUEUE_OLDEST_PAGE is fine.  I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.

I went with QUEUE_STOP_PAGE.

In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail.  That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary.  So I reverted it to measuring the distance to
QUEUE_TAIL.  You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.

Anyhow, pushed with that fix.

Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa

                        regards, tom lane

Re: SV: Problem with pg_notify / listen

From
Piotr Włodarczyk
Date:
Hi,

I have a question. Is this problem in postgres 12.5 ? We have very similar problem after upgrade from 12.4 to 12.5: queue size grows to 15% in 20 hours and still is raising.

On Sun, Feb 7, 2021 at 2:15 AM Gustavsson Mikael <mikael.gustavsson@smhi.se> wrote:

Great!


I will talk to the rest of the team on which path to take.


And thank you for finding and fixing this so quickly!


kr

Mikael Gustavsson

SMHI


Från: Tom Lane <tgl@sss.pgh.pa.us>
Skickat: den 28 november 2020 20:15
Till: Noah Misch
Kopia: Gustavsson Mikael; pgsql-bugs@lists.postgresql.org; Svensson Peter; Almen Anders
Ämne: Re: SV: Problem with pg_notify / listen
 
Noah Misch <noah@leadboat.com> writes:
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here.  What do you think of
>> QUEUE_OLDEST_PAGE?

> QUEUE_OLDEST_PAGE is fine.  I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.

I went with QUEUE_STOP_PAGE.

In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail.  That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary.  So I reverted it to measuring the distance to
QUEUE_TAIL.  You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.

Anyhow, pushed with that fix.

Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa

                        regards, tom lane


--

Pozdrawiam
Piotr Włodarczyk

Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
=?UTF-8?Q?Piotr_W=C5=82odarczyk?= <piotrwlodarczyk89@gmail.com> writes:
> I have a question. Is this problem in postgres 12.5 ?

Yes.

            regards, tom lane



Re: SV: Problem with pg_notify / listen

From
Piotr Włodarczyk
Date:
Ok. And will by fixed in 12.6? 

On Sun, 7 Feb 2021, 2:28 am Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Piotr Włodarczyk <piotrwlodarczyk89@gmail.com> writes:
> I have a question. Is this problem in postgres 12.5 ?

Yes.

                        regards, tom lane

Re: SV: Problem with pg_notify / listen

From
Tom Lane
Date:
=?UTF-8?Q?Piotr_W=C5=82odarczyk?= <piotrwlodarczyk89@gmail.com> writes:
> Ok. And will by fixed in 12.6?

Yes.

            regards, tom lane