Thread: Checkpointer sync queue fills up / loops around pg_usleep() are bad
Hi, In two recent investigations in occasional test failures (019_replslot_limit.pl failures, AIO rebase) the problems are somehow tied to checkpointer. I don't yet know if actually causally related to precisely those failures, but when running e.g. 027_stream_regress.pl, I see phases in which many backends are looping in RegisterSyncRequest() repeatedly, each time sleeping with pg_usleep(10000L). Without adding instrumentation this is completely invisible at any log level. There's no log messages, there's no wait events, nothing. ISTM, we should not have any loops around pg_usleep(). And shorter term, we shouldn't have any loops around pg_usleep() that don't emit log messages / set wait events. Therefore I propose that we "prohibit" such loops without at least a DEBUG2 elog() or so. It's just too hard to debug. The reason for the sync queue filling up in 027_stream_regress.pl is actually fairly simple: 1) The test runs with shared_buffers = 1MB, leading to a small sync queue of 128 entries. 2) CheckpointWriteDelay() does pg_usleep(100000L) ForwardSyncRequest() wakes up the checkpointer using SetLatch() if the sync queue is more than half full. But at least on linux and freebsd that doesn't actually interrupt pg_usleep() anymore (due to using signalfd / kqueue rather than a signal handler). And on all platforms the signal might arrive just before the pg_usleep() rather than during, also not causing usleep to be interrupted. If I shorten the sleep in CheckpointWriteDelay() the problem goes away. This actually reduces the time for a single run of 027_stream_regress.pl on my workstation noticably. With default sleep time it's ~32s, with shortened time it's ~27s. I suspect we need to do something about this concrete problem for 14 and master, because it's certainly worse than before on linux / freebsd. I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd require adding a new enum value to WaitEventTimeout in 14. Which probably is fine? Greetings, Andres Freund
On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote: > I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd > require adding a new enum value to WaitEventTimeout in 14. Which probably is > fine? We've added wait events in back-branches in the past, so this does not strike me as a problem as long as you add the new entry at the end of the enum, while keeping things ordered on HEAD. In recent memory, I think that only some of the extensions published by PostgresPro rely on the enums in this area. -- Michael
Attachment
On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote: > On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote: > > I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd > > require adding a new enum value to WaitEventTimeout in 14. Which probably is > > fine? > > We've added wait events in back-branches in the past, so this does not > strike me as a problem as long as you add the new entry at the end of > the enum, while keeping things ordered on HEAD. +1 > In recent memory, I > think that only some of the extensions published by PostgresPro rely > on the enums in this area. Indeed, I only know of pg_wait_sampling which uses it. Note that it relies on pgstat_get_wait_event* functions, so it should only returns "???" / "unknown wait event" until you recompile it for a newer minor version and not report errors or crash. All other extensions I know of simply use whatever pg_stat_activity returns, so no impact.
On Sun, Feb 27, 2022 at 10:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote: > > On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote: > > > I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd > > > require adding a new enum value to WaitEventTimeout in 14. Which probably is > > > fine? > > > > We've added wait events in back-branches in the past, so this does not > > strike me as a problem as long as you add the new entry at the end of > > the enum, while keeping things ordered on HEAD. > > +1 +1 Sleeps like these are also really bad for ProcSignalBarrier, which I was expecting to be the impetus for fixing any remaining loops like this. With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s on my FreeBSD workstation! It seems a little strange to introduce a new wait event that will very often appear into a stable branch, but ... it is actually telling the truth, so there is that. The sleep/poll loop in RegisterSyncRequest() may also have another problem. The comment explains that it was a deliberate choice not to do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't think there's an excuse to ignore postmaster death in a loop that presumably becomes infinite if the checkpointer exits. I guess we could do: - pg_usleep(10000L); + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, WAIT_EVENT_SYNC_REQUEST); But... really, this should be waiting on a condition variable that the checkpointer broadcasts on when the queue goes from full to not full, no? Perhaps for master only?
Attachment
Hi, On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: >With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s >on my FreeBSD workstation! That's impressive - wouldn't have guessed it to make that much of a difference. I assume that running the tests on freebsdfor an older pg with a similar s_b & max_wal_size doesn't benefit as much? I wonder how much windows will improve. >It seems a little strange to introduce a new wait event that will very >often appear into a stable branch, but ... it is actually telling the >truth, so there is that. In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD. I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule... >The sleep/poll loop in RegisterSyncRequest() may also have another >problem. The comment explains that it was a deliberate choice not to >do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't >think there's an excuse to ignore postmaster death in a loop that >presumably becomes infinite if the checkpointer exits. I guess we >could do: > >- pg_usleep(10000L); >+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, >WAIT_EVENT_SYNC_REQUEST); > >But... really, this should be waiting on a condition variable that the >checkpointer broadcasts on when the queue goes from full to not full, >no? Perhaps for master only? Looks worth improving, but yes, I'd not do it in the back branches. I do think it's worth giving that sleep a proper wait event though, even in the back branches. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Feb 28, 2022 at 2:36 PM Andres Freund <andres@anarazel.de> wrote: > On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: > >It seems a little strange to introduce a new wait event that will very > >often appear into a stable branch, but ... it is actually telling the > >truth, so there is that. > > In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD. Yeah. > I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule... I might look into this. > >The sleep/poll loop in RegisterSyncRequest() may also have another > >problem. The comment explains that it was a deliberate choice not to > >do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't > >think there's an excuse to ignore postmaster death in a loop that > >presumably becomes infinite if the checkpointer exits. I guess we > >could do: > > > >- pg_usleep(10000L); > >+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10, > >WAIT_EVENT_SYNC_REQUEST); > > > >But... really, this should be waiting on a condition variable that the > >checkpointer broadcasts on when the queue goes from full to not full, > >no? Perhaps for master only? > > Looks worth improving, but yes, I'd not do it in the back branches. 0003 is a first attempt at that, for master only (on top of 0002 which is the minimal fix). This shaves another second off 027_stream_regress.pl on my workstation. The main thing I realised is that I needed to hold interrupts while waiting, which seems like it should go away with 'tombstone' files as discussed in other threads. That's not a new problem in this patch, it just looks more offensive to the eye when you spell it out, instead of hiding it with an unreported sleep/poll loop... > I do think it's worth giving that sleep a proper wait event though, even in the back branches. I'm thinking that 0002 should be back-patched all the way, but 0001 could be limited to 14.
Attachment
Hi, On 2022-03-02 06:46:23 +1300, Thomas Munro wrote: > > I do think it's worth giving that sleep a proper wait event though, even in the back branches. > > I'm thinking that 0002 should be back-patched all the way, but 0001 > could be limited to 14. No strong opinion on back to where to backpatch. It'd be nice to have a proper wait event everywhere, but especially < 12 it looks different enough to be some effort. > From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Mon, 28 Feb 2022 11:27:05 +1300 > Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay(). LGTM. Would be nice to have this fixed soon, even if it's just to reduce test times :) > From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Tue, 1 Mar 2022 11:38:27 +1300 > Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest(). LGTM. > From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Tue, 1 Mar 2022 17:34:43 +1300 > Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue > is full. > > Previously, in the (hopefully) rare case that we need to wait for the > checkpointer to create space in the sync request queue, we'd enter a > sleep/retry loop. Instead, create a condition variable so the > checkpointer can wake us up whenever there is a transition from 'full' > to 'not full'. > @@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags) > * to perform its own fsync, which is far more expensive in practice. It > * is theoretically possible a backend fsync might still be necessary, if > * the queue is full and contains no duplicate entries. In that case, we > - * let the backend know by returning false. > + * let the backend know by returning false, or if 'wait' is true, then we > + * wait for space to become available. > */ > bool > -ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > +ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait) > { > CheckpointerRequest *request; > bool too_full; > @@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > * backend will have to perform its own fsync request. But before forcing > * that to happen, we can try to compact the request queue. > */ > - if (CheckpointerShmem->checkpointer_pid == 0 || > - (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && > - !CompactCheckpointerRequestQueue())) > + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests && > + !CompactCheckpointerRequestQueue() && > + !wait) Bit confused about the addition of the wait parameter / removal of the CheckpointerShmem->checkpointer_pid check. What's that about? > + /* > + * If we still don't have enough space and the caller asked us to wait, > + * wait for the checkpointer to advertise that there is space. > + */ > + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests) > + { > + ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv); > + while (CheckpointerShmem->num_requests >= > + CheckpointerShmem->max_requests) > + { > + LWLockRelease(CheckpointerCommLock); > + ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv, > + WAIT_EVENT_FORWARD_SYNC_REQUEST); > + LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); > + } > + ConditionVariableCancelSleep(); > + } Could there be a problem with a lot of backends trying to acquire CheckpointerCommLock in exclusive mode? I'm inclined to think it's rare enough to not worry. Greetings, Andres Freund
On Wed, Mar 2, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-02 06:46:23 +1300, Thomas Munro wrote: > > From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001 > > From: Thomas Munro <thomas.munro@gmail.com> > > Date: Mon, 28 Feb 2022 11:27:05 +1300 > > Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay(). > > LGTM. Would be nice to have this fixed soon, even if it's just to reduce test > times :) Thanks for the review. Pushed to master and 14, with the wait event moved to the end of the enum for the back-patch. > > From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001 > > From: Thomas Munro <thomas.munro@gmail.com> > > Date: Tue, 1 Mar 2022 11:38:27 +1300 > > Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest(). > > LGTM. Pushed as far back as 12. It could be done for 10 & 11, but indeed the code starts getting quite different back there, and since there are no field reports, I think that's OK for now. A simple repro, for the record: run installcheck with shared_buffers=256kB, and then partway through, kill -STOP $checkpointer to simulate being stalled on IO for a while. Backends will soon start waiting for the checkpointer to drain the queue while dropping relations. This state was invisible to pg_stat_activity, and hangs forever if you kill the postmaster and CONT the checkpointer. > > From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001 > > From: Thomas Munro <thomas.munro@gmail.com> > > Date: Tue, 1 Mar 2022 17:34:43 +1300 > > Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue > > is full. > [review] I'll come back to 0003 (condition variable-based improvement) a bit later.