Thread: should crash recovery ignore checkpoint_flush_after ?

should crash recovery ignore checkpoint_flush_after ?

From

Justin Pryzby

Date:

18 January 2020, 14:08:07

One of our PG12 instances was in crash recovery for an embarassingly long time
after hitting ENOSPC.  (Note, I first started wroting this mail 10 months ago
while running PG11 after having same experience after OOM).  Running linux.

As I understand, the first thing that happens syncing every file in the data
dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
compression, so that's slow, but tolerable, and at least understandable, and
with visible progress in ps.

The 2nd stage replays WAL.  strace show's it's occasionally running
sync_file_range, and I think recovery might've been several times faster if
we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
just kill -9 the recovery process and edited the config to disable this lest it
spend all night in recovery.

$ sudo strace -p 12564 2>&1 |sed 33q
Process 12564 attached
sync_file_range(0x21, 0x2bba000, 0xa000, 0x2) = 0
sync_file_range(0xb2, 0x2026000, 0x1a000, 0x2) = 0
clock_gettime(CLOCK_MONOTONIC, {7521130, 31376505}) = 0

(gdb) bt
#0  0x00000032b2adfe8a in sync_file_range () from /lib64/libc.so.6
#1  0x00000000007454e2 in pg_flush_data (fd=<value optimized out>, offset=<value optimized out>, nbytes=<value
optimizedout>) at fd.c:437
 
#2  0x00000000007456b4 in FileWriteback (file=<value optimized out>, offset=41508864, nbytes=16384,
wait_event_info=167772170)at fd.c:1855
 
#3  0x000000000073dbac in IssuePendingWritebacks (context=0x7ffed45f8530) at bufmgr.c:4381
#4  0x000000000073f1ff in SyncOneBuffer (buf_id=<value optimized out>, skip_recently_used=<value optimized out>,
wb_context=0x7ffed45f8530)at bufmgr.c:2409
 
#5  0x000000000073f549 in BufferSync (flags=6) at bufmgr.c:1991
#6  0x000000000073f5d6 in CheckPointBuffers (flags=6) at bufmgr.c:2585
#7  0x000000000050552c in CheckPointGuts (checkPointRedo=535426125266848, flags=6) at xlog.c:9006
#8  0x000000000050cace in CreateCheckPoint (flags=6) at xlog.c:8795
#9  0x0000000000511740 in StartupXLOG () at xlog.c:7612
#10 0x00000000006faaf1 in StartupProcessMain () at startup.c:207

That GUC is intended to reduce latency spikes caused by checkpoint fsync.  But
I think limiting to default 256kB between syncs is too limiting during
recovery, and at that point it's better to optimize for throughput anyway,
since no other backends are running (in that instance) and cannot run until
recovery finishes.  At least, if this setting is going to apply during
recovery, the documentation should mention it (it's a "recovery checkpoint")

See also
4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
428b1d6 Allow to trigger kernel writeback after a configurable number of writes.

Re: should crash recovery ignore checkpoint_flush_after ?

From

Andres Freund

Date:

18 January 2020, 18:48:22

Hi,

On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> One of our PG12 instances was in crash recovery for an embarassingly long time
> after hitting ENOSPC.  (Note, I first started wroting this mail 10 months ago
> while running PG11 after having same experience after OOM).  Running linux.
> 
> As I understand, the first thing that happens syncing every file in the data
> dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> compression, so that's slow, but tolerable, and at least understandable, and
> with visible progress in ps.
>
> The 2nd stage replays WAL.  strace show's it's occasionally running
> sync_file_range, and I think recovery might've been several times faster if
> we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> just kill -9 the recovery process and edited the config to disable this lest it
> spend all night in recovery.

I'm not quite sure what you mean here with "fsync once per file". The
sync_file_range doesn't actually issue an fsync, even if sounds like
it. In the case of checkpointing what it basically does is to ask the
kernel to please start writing data back immediately, instead of waiting
till the absolute end of the checkpoint, when doing fsyncs. IOW, the
data is going to be written back *anyway* in short order.

It's ossible that ZFS's compression just does broken things here, I
don't know.

> That GUC is intended to reduce latency spikes caused by checkpoint fsync.  But
> I think limiting to default 256kB between syncs is too limiting during
> recovery, and at that point it's better to optimize for throughput anyway,
> since no other backends are running (in that instance) and cannot run until
> recovery finishes.

I don't think that'd be good by default - in my experience the stalls
caused by the kernel writing back massive amounts of data at once is
also problematic during recovery (and can lead to much higher %sys
too). You get the pattern of the fsync at the end taking forever, while
IO is idle before. And you'd get the latency spikes once recovery is
over too.

> At least, if this setting is going to apply during
> recovery, the documentation should mention it (it's a "recovery checkpoint")

That makes sense.

> See also
> 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).

FWIW, I still think this is the wrong default, and that it causes our
users harm. It only makes sense because the reverse was the default. But
it's easy to see quite massive stalls even on fast nvme SSDs (as in 10s
of no transactions committing, in an oltp workload). Nor do I think is
it really comparable with the checkpointing setting, because there we
*know* that we're about to fsync the file, whereas in the backend case
we might just use the fs page cache as an extension of shared buffers.

Greetings,

Andres Freund

Re: should crash recovery ignore checkpoint_flush_after ?

From

Justin Pryzby

Date:

18 January 2020, 20:11:12

On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> > One of our PG12 instances was in crash recovery for an embarassingly long time
> > after hitting ENOSPC.  (Note, I first started wroting this mail 10 months ago
> > while running PG11 after having same experience after OOM).  Running linux.
> > 
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL.  strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
> 
> I'm not quite sure what you mean here with "fsync once per file". The
> sync_file_range doesn't actually issue an fsync, even if sounds like it.

I mean if we didn't call sync_file_range() and instead let the kernel handle
the writes and then fsync() at end of checkpoint, which happens in any case.
I think I'll increase or maybe disable this GUC on our servers and, if needed,
adjust /proc/sys/vm/dirty_*ratio.

> It's ossible that ZFS's compression just does broken things here, I
> don't know.

Or, our settings aren't ideal or recovery is just going to perform poorly for
that.  Which I'm ok with, since it should be rare anyway, and recovery is
unlikely to be a big deal for us.

> > At least, if this setting is going to apply during
> > recovery, the documentation should mention it (it's a "recovery checkpoint")
> 
> That makes sense.

Find attached.
I modified a 2nd sentence since "that" was ambiguous, and could be read to
refer to "stalls".

@@ -2994,17 +2994,19 @@ include_dir 'conf.d'
         Whenever more than this amount of data has been
         written while performing a checkpoint, attempt to force the
         OS to issue these writes to the underlying storage.  Doing so will
         limit the amount of dirty data in the kernel's page cache, reducing
         the likelihood of stalls when an <function>fsync</function> is issued at the end of the
         checkpoint, or when the OS writes data back in larger batches in the
-        background.  Often that will result in greatly reduced transaction
+        background.  This feature will often result in greatly reduced transaction
         latency, but there also are some cases, especially with workloads
         that are bigger than <xref linkend="guc-shared-buffers"/>, but smaller
         than the OS's page cache, where performance might degrade.  This
         setting may have no effect on some platforms.
+        This setting also applies to the checkpoint written at the end of crash
+        recovery.
         If this value is specified without units, it is taken as blocks,
         that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
         The valid range is
         between <literal>0</literal>, which disables forced writeback,
         and <literal>2MB</literal>.  The default is <literal>256kB</literal> on
         Linux, <literal>0</literal> elsewhere.  (If <symbol>BLCKSZ</symbol> is not

What about also updating PS following the last xlog replayed ?
Otherwise it shows "recovering <file>" for the duration of the recovery
checkpoint.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7628,3 +7628,6 @@ StartupXLOG(void)
                else
+               {
+                       set_ps_display("recovery checkpoint", false);
                        CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+               }

> > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
> 
> FWIW, I still think this is the wrong default, and that it causes our
> users harm.

I have no opinion about the default, but the maximum seems low, as a maximum.
Why not INT_MAX, like wal_writer_flush_after ?

src/include/pg_config_manual.h:#define WRITEBACK_MAX_PENDING_FLUSHES 256

Thanks,
Justin

Attachment

Re: should crash recovery ignore checkpoint_flush_after ?

From

Thomas Munro

Date:

18 January 2020, 20:52:21

On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> As I understand, the first thing that happens syncing every file in the data
> dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> compression, so that's slow, but tolerable, and at least understandable, and
> with visible progress in ps.
>
> The 2nd stage replays WAL.  strace show's it's occasionally running
> sync_file_range, and I think recovery might've been several times faster if
> we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> just kill -9 the recovery process and edited the config to disable this lest it
> spend all night in recovery.

Does sync_file_range() even do anything for non-mmap'd files on ZFS?
Non-mmap'd ZFS data is not in the Linux page cache, and I think
sync_file_range() works at that level.  At a guess, there'd need to be
a new VFS file_operation so that ZFS could get a callback to handle
data in its ARC.

Re: should crash recovery ignore checkpoint_flush_after ?

From

Andres Freund

Date:

18 January 2020, 23:22:00

Hi,

On 2020-01-18 14:11:12 -0600, Justin Pryzby wrote:
> On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote:
> > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> > > One of our PG12 instances was in crash recovery for an embarassingly long time
> > > after hitting ENOSPC.  (Note, I first started wroting this mail 10 months ago
> > > while running PG11 after having same experience after OOM).  Running linux.
> > > 
> > > As I understand, the first thing that happens syncing every file in the data
> > > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > > compression, so that's slow, but tolerable, and at least understandable, and
> > > with visible progress in ps.
> > >
> > > The 2nd stage replays WAL.  strace show's it's occasionally running
> > > sync_file_range, and I think recovery might've been several times faster if
> > > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> > > just kill -9 the recovery process and edited the config to disable this lest it
> > > spend all night in recovery.
> > 
> > I'm not quite sure what you mean here with "fsync once per file". The
> > sync_file_range doesn't actually issue an fsync, even if sounds like it.
> 
> I mean if we didn't call sync_file_range() and instead let the kernel handle
> the writes and then fsync() at end of checkpoint, which happens in any
> case.

Yea, but then more writes have to be done at the end, instead of in
parallel with other work during checkpointing. the kernel will often end
up starting to write back buffers before that - but without much concern
for locality, so it'll be a lot more random writes.

> > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
> > 
> > FWIW, I still think this is the wrong default, and that it causes our
> > users harm.
> 
> I have no opinion about the default, but the maximum seems low, as a maximum.
> Why not INT_MAX, like wal_writer_flush_after ?

Because it requires a static memory allocation, and that'd not be all
that trivial to change (we may be in a critical section, so can't
allocate). And issuing them in a larger batch will often stall within
the kernel, anyway - there's a limited number of writes the kernel can
have in progress at once. We could make it a PGC_POSTMASTER variable,
and allocate at server start, but that seems like a cure worse than the
disease.

wal_writer_flush_after doesn't have that concern, because it works
differently.

Greetings,

Andres Freund

Re: should crash recovery ignore checkpoint_flush_after ?

From

Andres Freund

Date:

18 January 2020, 23:32:02

Hi,

On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL.  strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
> 
> Does sync_file_range() even do anything for non-mmap'd files on ZFS?

Good point. Next time it might be worthwhile to use strace -T to see
whether the sync_file_range calls actually take meaningful time.


> Non-mmap'd ZFS data is not in the Linux page cache, and I think
> sync_file_range() works at that level.  At a guess, there'd need to be
> a new VFS file_operation so that ZFS could get a callback to handle
> data in its ARC.

Yea, it requires the pages to be in the pagecache to do anything:

int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
            unsigned int flags)
{
...

    if (flags & SYNC_FILE_RANGE_WRITE) {
        int sync_mode = WB_SYNC_NONE;

        if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
                 SYNC_FILE_RANGE_WRITE_AND_WAIT)
            sync_mode = WB_SYNC_ALL;

        ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
                         sync_mode);
        if (ret < 0)
            goto out;
    }

and then

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                loff_t end, int sync_mode)
{
    int ret;
    struct writeback_control wbc = {
        .sync_mode = sync_mode,
        .nr_to_write = LONG_MAX,
        .range_start = start,
        .range_end = end,
    };

    if (!mapping_cap_writeback_dirty(mapping) ||
        !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
        return 0;

which means that if there's no pages in the pagecache for the relevant
range, it'll just finish here.  *Iff* there are some, say because
something else mmap()ed a section, it'd potentially call into
address_space->writepages() callback.  So it's possible to emulate
enough state for ZFS or such to still get sync_file_range() call into it
(by setting up a pseudo map tagged as dirty), but it's not really the
normal path.

Greetings,

Andres Freund

Re: should crash recovery ignore checkpoint_flush_after ?

From

Justin Pryzby

Date:

19 January 2020, 16:13:57

On Sat, Jan 18, 2020 at 03:32:02PM -0800, Andres Freund wrote:
> On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> > On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Does sync_file_range() even do anything for non-mmap'd files on ZFS?
> 
> Good point. Next time it might be worthwhile to use strace -T to see
> whether the sync_file_range calls actually take meaningful time.

> Yea, it requires the pages to be in the pagecache to do anything:

>     if (!mapping_cap_writeback_dirty(mapping) ||
>         !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>         return 0;

That logic is actually brand new (Sep 23, 2019, linux 5.4)

https://github.com/torvalds/linux/commit/c3aab9a0bd91b696a852169479b7db1ece6cbf8c#diff-fd2d793b8b4760b4887c8c7bbb3451d7

Running a manual CHECKPOINT, I saw stuff like:

sync_file_range(0x15f, 0x1442c000, 0x2000, 0x2) = 0 <2.953956>
sync_file_range(0x15f, 0x14430000, 0x4000, 0x2) = 0 <0.006395>
sync_file_range(0x15f, 0x14436000, 0x4000, 0x2) = 0 <0.003859>
sync_file_range(0x15f, 0x1443e000, 0x2000, 0x2) = 0 <0.027975>
sync_file_range(0x15f, 0x14442000, 0x2000, 0x2) = 0 <0.000048>

And actually, that server had been running its DB instance on a centos6 VM
(kernel-2.6.32-754.23.1.el6.x86_64), shared with the appserver, to mitigate
another issue last year.  I moved the DB back to its own centos7 VM
(kernel-3.10.0-862.14.4.el7.x86_64), and I cannot see that anymore.
It seems if there's any issue (with postgres or otherwise), it's vastly
mitigated or much harder to hit under modern kernels.

I also found these:
https://github.com/torvalds/linux/commit/23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 (master v5.5-rc6...v4.4-rc1)
https://github.com/torvalds/linux/commit/ee53a891f47444c53318b98dac947ede963db400 (master v5.5-rc6...v2.6.29-rc1)

The 2nd commit is maybe the cause of the issue.

The first commit is supposedly too new to explain the difference between the
two kernels, but I'm guessing redhat maybe backpatched it into the 3.10 kernel.

Thanks,
Justin