Thread: should crash recovery ignore checkpoint_flush_after ?
One of our PG12 instances was in crash recovery for an embarassingly long time after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago while running PG11 after having same experience after OOM). Running linux. As I understand, the first thing that happens syncing every file in the data dir, like in initdb --sync. These instances were both 5+TB on zfs, with compression, so that's slow, but tolerable, and at least understandable, and with visible progress in ps. The 2nd stage replays WAL. strace show's it's occasionally running sync_file_range, and I think recovery might've been several times faster if we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've just kill -9 the recovery process and edited the config to disable this lest it spend all night in recovery. $ sudo strace -p 12564 2>&1 |sed 33q Process 12564 attached sync_file_range(0x21, 0x2bba000, 0xa000, 0x2) = 0 sync_file_range(0xb2, 0x2026000, 0x1a000, 0x2) = 0 clock_gettime(CLOCK_MONOTONIC, {7521130, 31376505}) = 0 (gdb) bt #0 0x00000032b2adfe8a in sync_file_range () from /lib64/libc.so.6 #1 0x00000000007454e2 in pg_flush_data (fd=<value optimized out>, offset=<value optimized out>, nbytes=<value optimizedout>) at fd.c:437 #2 0x00000000007456b4 in FileWriteback (file=<value optimized out>, offset=41508864, nbytes=16384, wait_event_info=167772170)at fd.c:1855 #3 0x000000000073dbac in IssuePendingWritebacks (context=0x7ffed45f8530) at bufmgr.c:4381 #4 0x000000000073f1ff in SyncOneBuffer (buf_id=<value optimized out>, skip_recently_used=<value optimized out>, wb_context=0x7ffed45f8530)at bufmgr.c:2409 #5 0x000000000073f549 in BufferSync (flags=6) at bufmgr.c:1991 #6 0x000000000073f5d6 in CheckPointBuffers (flags=6) at bufmgr.c:2585 #7 0x000000000050552c in CheckPointGuts (checkPointRedo=535426125266848, flags=6) at xlog.c:9006 #8 0x000000000050cace in CreateCheckPoint (flags=6) at xlog.c:8795 #9 0x0000000000511740 in StartupXLOG () at xlog.c:7612 #10 0x00000000006faaf1 in StartupProcessMain () at startup.c:207 That GUC is intended to reduce latency spikes caused by checkpoint fsync. But I think limiting to default 256kB between syncs is too limiting during recovery, and at that point it's better to optimize for throughput anyway, since no other backends are running (in that instance) and cannot run until recovery finishes. At least, if this setting is going to apply during recovery, the documentation should mention it (it's a "recovery checkpoint") See also 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). 428b1d6 Allow to trigger kernel writeback after a configurable number of writes.
Hi, On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote: > One of our PG12 instances was in crash recovery for an embarassingly long time > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago > while running PG11 after having same experience after OOM). Running linux. > > As I understand, the first thing that happens syncing every file in the data > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > compression, so that's slow, but tolerable, and at least understandable, and > with visible progress in ps. > > The 2nd stage replays WAL. strace show's it's occasionally running > sync_file_range, and I think recovery might've been several times faster if > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > just kill -9 the recovery process and edited the config to disable this lest it > spend all night in recovery. I'm not quite sure what you mean here with "fsync once per file". The sync_file_range doesn't actually issue an fsync, even if sounds like it. In the case of checkpointing what it basically does is to ask the kernel to please start writing data back immediately, instead of waiting till the absolute end of the checkpoint, when doing fsyncs. IOW, the data is going to be written back *anyway* in short order. It's ossible that ZFS's compression just does broken things here, I don't know. > That GUC is intended to reduce latency spikes caused by checkpoint fsync. But > I think limiting to default 256kB between syncs is too limiting during > recovery, and at that point it's better to optimize for throughput anyway, > since no other backends are running (in that instance) and cannot run until > recovery finishes. I don't think that'd be good by default - in my experience the stalls caused by the kernel writing back massive amounts of data at once is also problematic during recovery (and can lead to much higher %sys too). You get the pattern of the fsync at the end taking forever, while IO is idle before. And you'd get the latency spikes once recovery is over too. > At least, if this setting is going to apply during > recovery, the documentation should mention it (it's a "recovery checkpoint") That makes sense. > See also > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). FWIW, I still think this is the wrong default, and that it causes our users harm. It only makes sense because the reverse was the default. But it's easy to see quite massive stalls even on fast nvme SSDs (as in 10s of no transactions committing, in an oltp workload). Nor do I think is it really comparable with the checkpointing setting, because there we *know* that we're about to fsync the file, whereas in the backend case we might just use the fs page cache as an extension of shared buffers. Greetings, Andres Freund
On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote: > Hi, > > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote: > > One of our PG12 instances was in crash recovery for an embarassingly long time > > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago > > while running PG11 after having same experience after OOM). Running linux. > > > > As I understand, the first thing that happens syncing every file in the data > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > > compression, so that's slow, but tolerable, and at least understandable, and > > with visible progress in ps. > > > > The 2nd stage replays WAL. strace show's it's occasionally running > > sync_file_range, and I think recovery might've been several times faster if > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > > just kill -9 the recovery process and edited the config to disable this lest it > > spend all night in recovery. > > I'm not quite sure what you mean here with "fsync once per file". The > sync_file_range doesn't actually issue an fsync, even if sounds like it. I mean if we didn't call sync_file_range() and instead let the kernel handle the writes and then fsync() at end of checkpoint, which happens in any case. I think I'll increase or maybe disable this GUC on our servers and, if needed, adjust /proc/sys/vm/dirty_*ratio. > It's ossible that ZFS's compression just does broken things here, I > don't know. Or, our settings aren't ideal or recovery is just going to perform poorly for that. Which I'm ok with, since it should be rare anyway, and recovery is unlikely to be a big deal for us. > > At least, if this setting is going to apply during > > recovery, the documentation should mention it (it's a "recovery checkpoint") > > That makes sense. Find attached. I modified a 2nd sentence since "that" was ambiguous, and could be read to refer to "stalls". @@ -2994,17 +2994,19 @@ include_dir 'conf.d' Whenever more than this amount of data has been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an <function>fsync</function> is issued at the end of the checkpoint, or when the OS writes data back in larger batches in the - background. Often that will result in greatly reduced transaction + background. This feature will often result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. + This setting also applies to the checkpoint written at the end of crash + recovery. If this value is specified without units, it is taken as blocks, that is <symbol>BLCKSZ</symbol> bytes, typically 8kB. The valid range is between <literal>0</literal>, which disables forced writeback, and <literal>2MB</literal>. The default is <literal>256kB</literal> on Linux, <literal>0</literal> elsewhere. (If <symbol>BLCKSZ</symbol> is not What about also updating PS following the last xlog replayed ? Otherwise it shows "recovering <file>" for the duration of the recovery checkpoint. --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7628,3 +7628,6 @@ StartupXLOG(void) else + { + set_ps_display("recovery checkpoint", false); CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE); + } > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). > > FWIW, I still think this is the wrong default, and that it causes our > users harm. I have no opinion about the default, but the maximum seems low, as a maximum. Why not INT_MAX, like wal_writer_flush_after ? src/include/pg_config_manual.h:#define WRITEBACK_MAX_PENDING_FLUSHES 256 Thanks, Justin
Attachment
On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > As I understand, the first thing that happens syncing every file in the data > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > compression, so that's slow, but tolerable, and at least understandable, and > with visible progress in ps. > > The 2nd stage replays WAL. strace show's it's occasionally running > sync_file_range, and I think recovery might've been several times faster if > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > just kill -9 the recovery process and edited the config to disable this lest it > spend all night in recovery. Does sync_file_range() even do anything for non-mmap'd files on ZFS? Non-mmap'd ZFS data is not in the Linux page cache, and I think sync_file_range() works at that level. At a guess, there'd need to be a new VFS file_operation so that ZFS could get a callback to handle data in its ARC.
Hi, On 2020-01-18 14:11:12 -0600, Justin Pryzby wrote: > On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote: > > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote: > > > One of our PG12 instances was in crash recovery for an embarassingly long time > > > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago > > > while running PG11 after having same experience after OOM). Running linux. > > > > > > As I understand, the first thing that happens syncing every file in the data > > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > > > compression, so that's slow, but tolerable, and at least understandable, and > > > with visible progress in ps. > > > > > > The 2nd stage replays WAL. strace show's it's occasionally running > > > sync_file_range, and I think recovery might've been several times faster if > > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > > > just kill -9 the recovery process and edited the config to disable this lest it > > > spend all night in recovery. > > > > I'm not quite sure what you mean here with "fsync once per file". The > > sync_file_range doesn't actually issue an fsync, even if sounds like it. > > I mean if we didn't call sync_file_range() and instead let the kernel handle > the writes and then fsync() at end of checkpoint, which happens in any > case. Yea, but then more writes have to be done at the end, instead of in parallel with other work during checkpointing. the kernel will often end up starting to write back buffers before that - but without much concern for locality, so it'll be a lot more random writes. > > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). > > > > FWIW, I still think this is the wrong default, and that it causes our > > users harm. > > I have no opinion about the default, but the maximum seems low, as a maximum. > Why not INT_MAX, like wal_writer_flush_after ? Because it requires a static memory allocation, and that'd not be all that trivial to change (we may be in a critical section, so can't allocate). And issuing them in a larger batch will often stall within the kernel, anyway - there's a limited number of writes the kernel can have in progress at once. We could make it a PGC_POSTMASTER variable, and allocate at server start, but that seems like a cure worse than the disease. wal_writer_flush_after doesn't have that concern, because it works differently. Greetings, Andres Freund
Hi, On 2020-01-19 09:52:21 +1300, Thomas Munro wrote: > On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > As I understand, the first thing that happens syncing every file in the data > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > > compression, so that's slow, but tolerable, and at least understandable, and > > with visible progress in ps. > > > > The 2nd stage replays WAL. strace show's it's occasionally running > > sync_file_range, and I think recovery might've been several times faster if > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > > just kill -9 the recovery process and edited the config to disable this lest it > > spend all night in recovery. > > Does sync_file_range() even do anything for non-mmap'd files on ZFS? Good point. Next time it might be worthwhile to use strace -T to see whether the sync_file_range calls actually take meaningful time. > Non-mmap'd ZFS data is not in the Linux page cache, and I think > sync_file_range() works at that level. At a guess, there'd need to be > a new VFS file_operation so that ZFS could get a callback to handle > data in its ARC. Yea, it requires the pages to be in the pagecache to do anything: int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, unsigned int flags) { ... if (flags & SYNC_FILE_RANGE_WRITE) { int sync_mode = WB_SYNC_NONE; if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) == SYNC_FILE_RANGE_WRITE_AND_WAIT) sync_mode = WB_SYNC_ALL; ret = __filemap_fdatawrite_range(mapping, offset, endbyte, sync_mode); if (ret < 0) goto out; } and then int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, loff_t end, int sync_mode) { int ret; struct writeback_control wbc = { .sync_mode = sync_mode, .nr_to_write = LONG_MAX, .range_start = start, .range_end = end, }; if (!mapping_cap_writeback_dirty(mapping) || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; which means that if there's no pages in the pagecache for the relevant range, it'll just finish here. *Iff* there are some, say because something else mmap()ed a section, it'd potentially call into address_space->writepages() callback. So it's possible to emulate enough state for ZFS or such to still get sync_file_range() call into it (by setting up a pseudo map tagged as dirty), but it's not really the normal path. Greetings, Andres Freund
On Sat, Jan 18, 2020 at 03:32:02PM -0800, Andres Freund wrote: > On 2020-01-19 09:52:21 +1300, Thomas Munro wrote: > > On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Does sync_file_range() even do anything for non-mmap'd files on ZFS? > > Good point. Next time it might be worthwhile to use strace -T to see > whether the sync_file_range calls actually take meaningful time. > Yea, it requires the pages to be in the pagecache to do anything: > if (!mapping_cap_writeback_dirty(mapping) || > !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > return 0; That logic is actually brand new (Sep 23, 2019, linux 5.4) https://github.com/torvalds/linux/commit/c3aab9a0bd91b696a852169479b7db1ece6cbf8c#diff-fd2d793b8b4760b4887c8c7bbb3451d7 Running a manual CHECKPOINT, I saw stuff like: sync_file_range(0x15f, 0x1442c000, 0x2000, 0x2) = 0 <2.953956> sync_file_range(0x15f, 0x14430000, 0x4000, 0x2) = 0 <0.006395> sync_file_range(0x15f, 0x14436000, 0x4000, 0x2) = 0 <0.003859> sync_file_range(0x15f, 0x1443e000, 0x2000, 0x2) = 0 <0.027975> sync_file_range(0x15f, 0x14442000, 0x2000, 0x2) = 0 <0.000048> And actually, that server had been running its DB instance on a centos6 VM (kernel-2.6.32-754.23.1.el6.x86_64), shared with the appserver, to mitigate another issue last year. I moved the DB back to its own centos7 VM (kernel-3.10.0-862.14.4.el7.x86_64), and I cannot see that anymore. It seems if there's any issue (with postgres or otherwise), it's vastly mitigated or much harder to hit under modern kernels. I also found these: https://github.com/torvalds/linux/commit/23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 (master v5.5-rc6...v4.4-rc1) https://github.com/torvalds/linux/commit/ee53a891f47444c53318b98dac947ede963db400 (master v5.5-rc6...v2.6.29-rc1) The 2nd commit is maybe the cause of the issue. The first commit is supposedly too new to explain the difference between the two kernels, but I'm guessing redhat maybe backpatched it into the 3.10 kernel. Thanks, Justin