Thread: Re: SV: Problem with pg_notify / listen
[ redirecting to pgsql-bugs ] Gustavsson Mikael <mikael.gustavsson@smhi.se> writes: > Clarification, we upgraded from PG 11.9 to PG 11.10. Hmm ... so the only commit that touched async.c in that interval was d4031d784. That changed asyncQueueAdvanceTail() so that, rather than updating QUEUE_TAIL atomically, it would first compute the new tail pointer, then release AsyncQueueLock for awhile, then finally update QUEUE_TAIL with the previously-computed value. I think that's all right in HEAD and v13, because of their limited usage of the QUEUE_TAIL pointer. But I now realize that it's *not* all right in older branches, because in those branches asyncQueueAdvanceTail is only called in one of two ways: 1. If ProcessCompletedNotifies realizes that nobody is there to read the notification it just sent, it'll call asyncQueueAdvanceTail. In a system actively using notifications, that likely won't ever happen. 2. Otherwise, asyncQueueAdvanceTail is called from asyncQueueUnregister or asyncQueueReadAllNotifications, but only if the backend believes itself to be the hindmost backend, ie it saw that QUEUE_TAIL was equal to its own queue pointer. Thus, we now have the possibility for the following race condition: 1. Backend A performs asyncQueueReadAllNotifications and thence asyncQueueAdvanceTail (hence, A was the hindmost to start with). 2. A computes a new tail pointer, which no longer matches its own queue pointer, but does match backend B's (ie, B is currently hindmost). A now releases AsyncQueueLock, allowing a small interval in which ... 3. B performs asyncQueueReadAllNotifications. It sees that its queue pointer does not match QUEUE_TAIL, so it doesn't need to call asyncQueueAdvanceTail. But it does advance its own pointer. 4. A re-acquires AsyncQueueLock and updates QUEUE_TAIL with what is now a stale value that matches no backend's queue pointer. 5. After this, no execution of asyncQueueUnregister or asyncQueueReadAllNotifications will call asyncQueueAdvanceTail, so unless we get to a case where a notify is sent with no listeners to hear it, the queue will never be emptied. Ooops. So the question is what to do about this. We could dodge the problem by back-patching 51004c717, but that's a considerably larger change than I want to stick into the older branches. More practical possibilities seem to include: * Change the code back to being atomic, ie go ahead and update QUEUE_TAIL immediately, and truncate the SLRU only afterward. Why is it necessary, or even safe, to perform the SLRU truncation before changing QUEUE_TAIL? (IOW, I wonder if there isn't a bug there in HEAD too.) * Revert d4031d784's effects in async.c in the pre-v13 branches, on the grounds that the cure was worse than the disease. * Change asyncQueueAdvanceTail so that it does a whole fresh computation of the tail pointer after it re-acquires the lock. I think this is OK; it might mean that we miss an opportunity to truncate the SLRU, but there'll be another one. * Find another pathway in which to call asyncQueueAdvanceTail occasionally. While the obvious place would be "if we're about to complain about the queue being full", that's probably not good enough, because it'd mean that the queue grows quite large before we recover from the race condition --- and a very stale tail pointer has bad implications for the cost of Exec_ListenPreCommit. I'm not sure about good places otherwise. Thoughts? regards, tom lane
I wrote: > * Change the code back to being atomic, ie go ahead and update > QUEUE_TAIL immediately, and truncate the SLRU only afterward. > Why is it necessary, or even safe, to perform the SLRU truncation > before changing QUEUE_TAIL? (IOW, I wonder if there isn't a bug > there in HEAD too.) After thinking more about that, I'm pretty sure there is a bug there: a newly-arriving backend could attempt to scan the queue starting at QUEUE_TAIL, concurrently with SimpleLruTruncate removing the page(s) it wants to scan. In typical cases no failure will occur because Exec_ListenPreCommit could advance its queue pointer to a safe place by observing the pointers of other backends. However, if the new listener were the only one in its database, we'd have trouble. What seems like the most expedient way to fix this is to separate QUEUE_TAIL into two variables, one that is the "logical" tail position from which new backends may start to scan the queue, and one that is the "physical" tail position, ie, the oldest page we have not yet truncated. The physical tail need only be tracked to page accuracy, so it can be a plain int not a QUEUE_POS. asyncQueueAdvanceTail should update QUEUE_TAIL immediately, which restores correct behavior pre-v13 and also dodges the race condition described above. But we don't update the physical tail pointer until we've completed SimpleLruTruncate, keeping things honest with respect to asyncQueueIsFull. As a minor side benefit, asyncQueueAdvanceTail doesn't have to take the NotifyQueueLock twice unless it actually does an SLRU truncation. In short, I propose the attached patch (against HEAD, but the same logic changes should work in the back branches). regards, tom lane diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 8dbcace3f9..786e3ee1ca 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -255,7 +255,7 @@ typedef struct QueueBackendStatus * When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the * entries of other backends and also change the head pointer. When holding * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends - * can change the tail pointer. + * can change the tail pointers. * * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers. * In order to avoid deadlocks, whenever we need multiple locks, we first get @@ -276,6 +276,8 @@ typedef struct AsyncQueueControl QueuePosition head; /* head points to the next free location */ QueuePosition tail; /* tail must be <= the queue position of every * listening backend */ + int tailPage; /* oldest unrecycled page; must be <= + * tail.page */ BackendId firstListener; /* id of first listener, or InvalidBackendId */ TimestampTz lastQueueFillWarn; /* time of last queue-full msg */ QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER]; @@ -286,6 +288,7 @@ static AsyncQueueControl *asyncQueueControl; #define QUEUE_HEAD (asyncQueueControl->head) #define QUEUE_TAIL (asyncQueueControl->tail) +#define QUEUE_TAIL_PAGE (asyncQueueControl->tailPage) #define QUEUE_FIRST_LISTENER (asyncQueueControl->firstListener) #define QUEUE_BACKEND_PID(i) (asyncQueueControl->backend[i].pid) #define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid) @@ -537,6 +540,7 @@ AsyncShmemInit(void) /* First time through, so initialize it */ SET_QUEUE_POS(QUEUE_HEAD, 0, 0); SET_QUEUE_POS(QUEUE_TAIL, 0, 0); + QUEUE_TAIL_PAGE = 0; QUEUE_FIRST_LISTENER = InvalidBackendId; asyncQueueControl->lastQueueFillWarn = 0; /* zero'th entry won't be used, but let's initialize it anyway */ @@ -1358,7 +1362,7 @@ asyncQueueIsFull(void) nexthead = QUEUE_POS_PAGE(QUEUE_HEAD) + 1; if (nexthead > QUEUE_MAX_PAGE) nexthead = 0; /* wrap around */ - boundary = QUEUE_POS_PAGE(QUEUE_TAIL); + boundary = QUEUE_TAIL_PAGE; boundary -= boundary % SLRU_PAGES_PER_SEGMENT; return asyncQueuePagePrecedes(nexthead, boundary); } @@ -1577,7 +1581,7 @@ static double asyncQueueUsage(void) { int headPage = QUEUE_POS_PAGE(QUEUE_HEAD); - int tailPage = QUEUE_POS_PAGE(QUEUE_TAIL); + int tailPage = QUEUE_TAIL_PAGE; int occupied; occupied = headPage - tailPage; @@ -2178,7 +2182,23 @@ asyncQueueAdvanceTail(void) /* Restrict task to one backend per cluster; see SimpleLruTruncate(). */ LWLockAcquire(NotifyQueueTailLock, LW_EXCLUSIVE); - /* Compute the new tail. */ + /* + * Compute the new tail. Pre-v13, it's essential that QUEUE_TAIL be exact + * (ie, exactly match at least one backend's queue position), so it must + * be updated atomically with the actual computation. Since v13, we could + * get away with not doing it like that, but it seems prudent to keep it + * so. + * + * Also, because incoming backends will scan forward from QUEUE_TAIL, that + * must be advanced before we can truncate any data. Thus, QUEUE_TAIL is + * the logical tail, while QUEUE_TAIL_PAGE is the oldest not-yet-truncated + * page. When QUEUE_TAIL_PAGE != QUEUE_POS_PAGE(QUEUE_TAIL), there are + * pages we can truncate but haven't yet finished doing so. + * + * For concurrency's sake, we don't want to hold NotifyQueueLock while + * performing SimpleLruTruncate. This is OK because no backend will try + * to access the pages we are in the midst of truncating. + */ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE); min = QUEUE_HEAD; for (BackendId i = QUEUE_FIRST_LISTENER; i > 0; i = QUEUE_NEXT_LISTENER(i)) @@ -2186,7 +2206,8 @@ asyncQueueAdvanceTail(void) Assert(QUEUE_BACKEND_PID(i) != InvalidPid); min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i)); } - oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL); + QUEUE_TAIL = min; + oldtailpage = QUEUE_TAIL_PAGE; LWLockRelease(NotifyQueueLock); /* @@ -2205,16 +2226,16 @@ asyncQueueAdvanceTail(void) * release the lock again. */ SimpleLruTruncate(NotifyCtl, newtailpage); - } - /* - * Advertise the new tail. This changes asyncQueueIsFull()'s verdict for - * the segment immediately prior to the new tail, allowing fresh data into - * that segment. - */ - LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE); - QUEUE_TAIL = min; - LWLockRelease(NotifyQueueLock); + /* + * Update QUEUE_TAIL_PAGE. This changes asyncQueueIsFull()'s verdict + * for the segment immediately prior to the new tail, allowing fresh + * data into that segment. + */ + LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE); + QUEUE_TAIL_PAGE = newtailpage; + LWLockRelease(NotifyQueueLock); + } LWLockRelease(NotifyQueueTailLock); }
Noah Misch <noah@leadboat.com> writes: > ... agreed. In general, recycling SLRU space entails three steps that shall > not overlap: > 1. Stop reading data in the space, regulated by some "logical tail". > 2. Unlink files wholly within the bounds of the space. > 3. Start writing data into the space, regulated by some "physical tail" (most > often called a "stop limit"). Check. > Commit d4031d7 fixed overlap of (2) and (3). For pg_notify, though, it > introduced overlap of (1) and (2). I've now checked the other SLRUs for > similar problems, but I found nothing urgent: Good, I was wondering if we had any similar issues elsewhere. > I think we don't yet have the right name here, seeing QUEUE_TAIL_PAGE != > QUEUE_POS_PAGE(QUEUE_TAIL) sounds paradoxical, yet happens regularly. How > about naming it QUEUE_STOP_PAGE? Hmm, it's not very clear what "stop" means here. What do you think of QUEUE_OLDEST_PAGE? regards, tom lane
Noah Misch <noah@leadboat.com> writes: > On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote: >> Hmm, it's not very clear what "stop" means here. What do you think of >> QUEUE_OLDEST_PAGE? > QUEUE_OLDEST_PAGE is fine. I like it a little less than QUEUE_STOP_PAGE, > because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a > physical tail. I went with QUEUE_STOP_PAGE. In further testing, I noted that the patch as I had it re-introduced the symptom that 8b7ae5a82 fixed, that running "make installcheck" twice in a row causes the async-notify isolation test to fail. That's because I'd changed asyncQueueUsage() to measure the distance back to the physical tail, which isn't stable because we only truncate after crossing an SLRU segment boundary. So I reverted it to measuring the distance to QUEUE_TAIL. You could argue either way about which definition is more useful to end users, perhaps; but in practice the difference should usually be too small to matter, for everyone except regression tests that are looking for distance exactly zero. Anyhow, pushed with that fix. Mikael, it appears that you have three options: revert to 11.9 until 11.11 is out, restart your server every time it approaches notify-queue-full, or apply this patch: https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa regards, tom lane
Great!
I will talk to the rest of the team on which path to take.
And thank you for finding and fixing this so quickly!
kr
Mikael Gustavsson
SMHI
Från: Tom Lane <tgl@sss.pgh.pa.us>
Skickat: den 28 november 2020 20:15
Till: Noah Misch
Kopia: Gustavsson Mikael; pgsql-bugs@lists.postgresql.org; Svensson Peter; Almen Anders
Ämne: Re: SV: Problem with pg_notify / listen
Skickat: den 28 november 2020 20:15
Till: Noah Misch
Kopia: Gustavsson Mikael; pgsql-bugs@lists.postgresql.org; Svensson Peter; Almen Anders
Ämne: Re: SV: Problem with pg_notify / listen
Noah Misch <noah@leadboat.com> writes:
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here. What do you think of
>> QUEUE_OLDEST_PAGE?
> QUEUE_OLDEST_PAGE is fine. I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.
I went with QUEUE_STOP_PAGE.
In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail. That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary. So I reverted it to measuring the distance to
QUEUE_TAIL. You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.
Anyhow, pushed with that fix.
Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa
regards, tom lane
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here. What do you think of
>> QUEUE_OLDEST_PAGE?
> QUEUE_OLDEST_PAGE is fine. I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.
I went with QUEUE_STOP_PAGE.
In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail. That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary. So I reverted it to measuring the distance to
QUEUE_TAIL. You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.
Anyhow, pushed with that fix.
Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa
regards, tom lane
Hi,
I have a question. Is this problem in postgres 12.5 ? We have very similar problem after upgrade from 12.4 to 12.5: queue size grows to 15% in 20 hours and still is raising.
On Sun, Feb 7, 2021 at 2:15 AM Gustavsson Mikael <mikael.gustavsson@smhi.se> wrote:
Great!
I will talk to the rest of the team on which path to take.
And thank you for finding and fixing this so quickly!
kr
Mikael Gustavsson
SMHI
Från: Tom Lane <tgl@sss.pgh.pa.us>
Skickat: den 28 november 2020 20:15
Till: Noah Misch
Kopia: Gustavsson Mikael; pgsql-bugs@lists.postgresql.org; Svensson Peter; Almen Anders
Ämne: Re: SV: Problem with pg_notify / listenNoah Misch <noah@leadboat.com> writes:
> On Fri, Nov 27, 2020 at 11:03:40PM -0500, Tom Lane wrote:
>> Hmm, it's not very clear what "stop" means here. What do you think of
>> QUEUE_OLDEST_PAGE?
> QUEUE_OLDEST_PAGE is fine. I like it a little less than QUEUE_STOP_PAGE,
> because oldestClogXid is a logical tail, and QUEUE_OLDEST_PAGE would be a
> physical tail.
I went with QUEUE_STOP_PAGE.
In further testing, I noted that the patch as I had it re-introduced the
symptom that 8b7ae5a82 fixed, that running "make installcheck" twice
in a row causes the async-notify isolation test to fail. That's because
I'd changed asyncQueueUsage() to measure the distance back to the physical
tail, which isn't stable because we only truncate after crossing an
SLRU segment boundary. So I reverted it to measuring the distance to
QUEUE_TAIL. You could argue either way about which definition is more
useful to end users, perhaps; but in practice the difference should
usually be too small to matter, for everyone except regression tests
that are looking for distance exactly zero.
Anyhow, pushed with that fix.
Mikael, it appears that you have three options: revert to 11.9 until 11.11
is out, restart your server every time it approaches notify-queue-full, or
apply this patch:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=40f2fbe71ad615a2bcaaf5b840ccb9329e4378aa
regards, tom lane
Pozdrawiam
Piotr Włodarczyk
=?UTF-8?Q?Piotr_W=C5=82odarczyk?= <piotrwlodarczyk89@gmail.com> writes: > I have a question. Is this problem in postgres 12.5 ? Yes. regards, tom lane
Ok. And will by fixed in 12.6?
On Sun, 7 Feb 2021, 2:28 am Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Piotr Włodarczyk <piotrwlodarczyk89@gmail.com> writes:
> I have a question. Is this problem in postgres 12.5 ?
Yes.
regards, tom lane
=?UTF-8?Q?Piotr_W=C5=82odarczyk?= <piotrwlodarczyk89@gmail.com> writes: > Ok. And will by fixed in 12.6? Yes. regards, tom lane