Thread: Checkpointer sync queue fills up / loops around pg_usleep() are bad

Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Andres Freund
Date:
Hi,

In two recent investigations in occasional test failures
(019_replslot_limit.pl failures, AIO rebase) the problems are somehow tied to
checkpointer.

I don't yet know if actually causally related to precisely those failures, but
when running e.g. 027_stream_regress.pl, I see phases in which many backends
are looping in RegisterSyncRequest() repeatedly, each time sleeping with
pg_usleep(10000L).


Without adding instrumentation this is completely invisible at any log
level. There's no log messages, there's no wait events, nothing.

ISTM, we should not have any loops around pg_usleep(). And shorter term, we
shouldn't have any loops around pg_usleep() that don't emit log messages / set
wait events. Therefore I propose that we "prohibit" such loops without at
least a DEBUG2 elog() or so. It's just too hard to debug.


The reason for the sync queue filling up in 027_stream_regress.pl is actually
fairly simple:

1) The test runs with shared_buffers = 1MB, leading to a small sync queue of
   128 entries.
2) CheckpointWriteDelay() does pg_usleep(100000L)

ForwardSyncRequest() wakes up the checkpointer using SetLatch() if the sync
queue is more than half full.

But at least on linux and freebsd that doesn't actually interrupt pg_usleep()
anymore (due to using signalfd / kqueue rather than a signal handler). And on
all platforms the signal might arrive just before the pg_usleep() rather than
during, also not causing usleep to be interrupted.

If I shorten the sleep in CheckpointWriteDelay() the problem goes away. This
actually reduces the time for a single run of 027_stream_regress.pl on my
workstation noticably.  With default sleep time it's ~32s, with shortened time
it's ~27s.

I suspect we need to do something about this concrete problem for 14 and
master, because it's certainly worse than before on linux / freebsd.

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?



Greetings,

Andres Freund



Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Michael Paquier
Date:
On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
> I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
> require adding a new enum value to WaitEventTimeout in 14. Which probably is
> fine?

We've added wait events in back-branches in the past, so this does not
strike me as a problem as long as you add the new entry at the end of
the enum, while keeping things ordered on HEAD.  In recent memory, I
think that only some of the extensions published by PostgresPro rely
on the enums in this area.
--
Michael

Attachment

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Julien Rouhaud
Date:
On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:
> On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
> > I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
> > require adding a new enum value to WaitEventTimeout in 14. Which probably is
> > fine?
> 
> We've added wait events in back-branches in the past, so this does not
> strike me as a problem as long as you add the new entry at the end of
> the enum, while keeping things ordered on HEAD.

+1

> In recent memory, I
> think that only some of the extensions published by PostgresPro rely
> on the enums in this area. 

Indeed, I only know of pg_wait_sampling which uses it.  Note that it relies on
pgstat_get_wait_event* functions, so it should only returns "???" / "unknown
wait event" until you recompile it for a newer minor version and not report
errors or crash.  All other extensions I know of simply use whatever
pg_stat_activity returns, so no impact.



Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Thomas Munro
Date:
On Sun, Feb 27, 2022 at 10:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Sun, Feb 27, 2022 at 06:10:45PM +0900, Michael Paquier wrote:
> > On Sat, Feb 26, 2022 at 01:39:42PM -0800, Andres Freund wrote:
> > > I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
> > > require adding a new enum value to WaitEventTimeout in 14. Which probably is
> > > fine?
> >
> > We've added wait events in back-branches in the past, so this does not
> > strike me as a problem as long as you add the new entry at the end of
> > the enum, while keeping things ordered on HEAD.
>
> +1

+1

Sleeps like these are also really bad for ProcSignalBarrier, which I
was expecting to be the impetus for fixing any remaining loops like
this.

With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
on my FreeBSD workstation!

It seems a little strange to introduce a new wait event that will very
often appear into a stable branch, but ... it is actually telling the
truth, so there is that.

The sleep/poll loop in RegisterSyncRequest() may also have another
problem.  The comment explains that it was a deliberate choice not to
do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
think there's an excuse to ignore postmaster death in a loop that
presumably becomes infinite if the checkpointer exits.  I guess we
could do:

-               pg_usleep(10000L);
+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
WAIT_EVENT_SYNC_REQUEST);

But... really, this should be waiting on a condition variable that the
checkpointer broadcasts on when the queue goes from full to not full,
no?  Perhaps for master only?

Attachment

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Andres Freund
Date:
Hi,

On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
>With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
>on my FreeBSD workstation!

That's impressive - wouldn't have guessed it to make that much of a difference. I assume that running the tests on
freebsdfor an older pg with a similar s_b & max_wal_size  doesn't benefit as much? I wonder how much windows will
improve.


>It seems a little strange to introduce a new wait event that will very
>often appear into a stable branch, but ... it is actually telling the
>truth, so there is that.

In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.

I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...


>The sleep/poll loop in RegisterSyncRequest() may also have another
>problem.  The comment explains that it was a deliberate choice not to
>do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
>think there's an excuse to ignore postmaster death in a loop that
>presumably becomes infinite if the checkpointer exits.  I guess we
>could do:
>
>-               pg_usleep(10000L);
>+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
>WAIT_EVENT_SYNC_REQUEST);
>
>But... really, this should be waiting on a condition variable that the
>checkpointer broadcasts on when the queue goes from full to not full,
>no?  Perhaps for master only?

Looks worth improving, but yes, I'd not do it in the back branches.

I do think it's worth giving that sleep a proper wait event though, even in the back branches.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Thomas Munro
Date:
On Mon, Feb 28, 2022 at 2:36 PM Andres Freund <andres@anarazel.de> wrote:
> On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
> >It seems a little strange to introduce a new wait event that will very
> >often appear into a stable branch, but ... it is actually telling the
> >truth, so there is that.
>
> In the back branches it needs to be at the end of the enum - I assume you intended that just to be for HEAD.

Yeah.

> I wonder whether in HEAD we shouldn't make that sleep duration be computed from the calculation in IsOnSchedule...

I might look into this.

> >The sleep/poll loop in RegisterSyncRequest() may also have another
> >problem.  The comment explains that it was a deliberate choice not to
> >do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
> >think there's an excuse to ignore postmaster death in a loop that
> >presumably becomes infinite if the checkpointer exits.  I guess we
> >could do:
> >
> >-               pg_usleep(10000L);
> >+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
> >WAIT_EVENT_SYNC_REQUEST);
> >
> >But... really, this should be waiting on a condition variable that the
> >checkpointer broadcasts on when the queue goes from full to not full,
> >no?  Perhaps for master only?
>
> Looks worth improving, but yes, I'd not do it in the back branches.

0003 is a first attempt at that, for master only (on top of 0002 which
is the minimal fix).  This shaves another second off
027_stream_regress.pl on my workstation.  The main thing I realised is
that I needed to hold interrupts while waiting, which seems like it
should go away with 'tombstone' files as discussed in other threads.
That's not a new problem in this patch, it just looks more offensive
to the eye when you spell it out, instead of hiding it with an
unreported sleep/poll loop...

> I do think it's worth giving that sleep a proper wait event though, even in the back branches.

I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.

Attachment

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Andres Freund
Date:
Hi,

On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
> > I do think it's worth giving that sleep a proper wait event though, even in the back branches.
> 
> I'm thinking that 0002 should be back-patched all the way, but 0001
> could be limited to 14.

No strong opinion on back to where to backpatch. It'd be nice to have a proper
wait event everywhere, but especially < 12 it looks different enough to be
some effort.


> From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Mon, 28 Feb 2022 11:27:05 +1300
> Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)



> From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Tue, 1 Mar 2022 11:38:27 +1300
> Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

LGTM.


> From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Tue, 1 Mar 2022 17:34:43 +1300
> Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
>  is full.
> 
> Previously, in the (hopefully) rare case that we need to wait for the
> checkpointer to create space in the sync request queue, we'd enter a
> sleep/retry loop.  Instead, create a condition variable so the
> checkpointer can wake us up whenever there is a transition from 'full'
> to 'not full'.


> @@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags)
>   * to perform its own fsync, which is far more expensive in practice.  It
>   * is theoretically possible a backend fsync might still be necessary, if
>   * the queue is full and contains no duplicate entries.  In that case, we
> - * let the backend know by returning false.
> + * let the backend know by returning false, or if 'wait' is true, then we
> + * wait for space to become available.
>   */
>  bool
> -ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> +ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait)
>  {
>      CheckpointerRequest *request;
>      bool        too_full;
> @@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>       * backend will have to perform its own fsync request.  But before forcing
>       * that to happen, we can try to compact the request queue.
>       */
> -    if (CheckpointerShmem->checkpointer_pid == 0 ||
> -        (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
> -         !CompactCheckpointerRequestQueue()))
> +    if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
> +        !CompactCheckpointerRequestQueue() &&
> +        !wait)

Bit confused about the addition of the wait parameter / removal of the
CheckpointerShmem->checkpointer_pid check. What's that about?


> +    /*
> +     * If we still don't have enough space and the caller asked us to wait,
> +     * wait for the checkpointer to advertise that there is space.
> +     */
> +    if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests)
> +    {
> +        ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv);
> +        while (CheckpointerShmem->num_requests >=
> +               CheckpointerShmem->max_requests)
> +        {
> +            LWLockRelease(CheckpointerCommLock);
> +            ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv,
> +                                   WAIT_EVENT_FORWARD_SYNC_REQUEST);
> +            LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
> +        }
> +        ConditionVariableCancelSleep();
> +    }

Could there be a problem with a lot of backends trying to acquire
CheckpointerCommLock in exclusive mode? I'm inclined to think it's rare enough
to not worry.

Greetings,

Andres Freund



Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

From
Thomas Munro
Date:
On Wed, Mar 2, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
> > From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Mon, 28 Feb 2022 11:27:05 +1300
> > Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().
>
> LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
> times :)

Thanks for the review.  Pushed to master and 14, with the wait event
moved to the end of the enum for the back-patch.

> > From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Tue, 1 Mar 2022 11:38:27 +1300
> > Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().
>
> LGTM.

Pushed as far back as 12.  It could be done for 10 & 11, but indeed
the code starts getting quite different back there, and since there
are no field reports, I think that's OK for now.

A simple repro, for the record: run installcheck with
shared_buffers=256kB, and then partway through, kill -STOP
$checkpointer to simulate being stalled on IO for a while.  Backends
will soon start waiting for the checkpointer to drain the queue while
dropping relations.  This state was invisible to pg_stat_activity, and
hangs forever if you kill the postmaster and CONT the checkpointer.

> > From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <thomas.munro@gmail.com>
> > Date: Tue, 1 Mar 2022 17:34:43 +1300
> > Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
> >  is full.

> [review]

I'll come back to 0003 (condition variable-based improvement) a bit later.