should crash recovery ignore checkpoint_flush_after ? - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | should crash recovery ignore checkpoint_flush_after ? |
Date | |
Msg-id | 20200118140807.GA10774@telsasoft.com Whole thread Raw |
Responses |
Re: should crash recovery ignore checkpoint_flush_after ?
Re: should crash recovery ignore checkpoint_flush_after ? |
List | pgsql-hackers |
One of our PG12 instances was in crash recovery for an embarassingly long time after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago while running PG11 after having same experience after OOM). Running linux. As I understand, the first thing that happens syncing every file in the data dir, like in initdb --sync. These instances were both 5+TB on zfs, with compression, so that's slow, but tolerable, and at least understandable, and with visible progress in ps. The 2nd stage replays WAL. strace show's it's occasionally running sync_file_range, and I think recovery might've been several times faster if we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've just kill -9 the recovery process and edited the config to disable this lest it spend all night in recovery. $ sudo strace -p 12564 2>&1 |sed 33q Process 12564 attached sync_file_range(0x21, 0x2bba000, 0xa000, 0x2) = 0 sync_file_range(0xb2, 0x2026000, 0x1a000, 0x2) = 0 clock_gettime(CLOCK_MONOTONIC, {7521130, 31376505}) = 0 (gdb) bt #0 0x00000032b2adfe8a in sync_file_range () from /lib64/libc.so.6 #1 0x00000000007454e2 in pg_flush_data (fd=<value optimized out>, offset=<value optimized out>, nbytes=<value optimizedout>) at fd.c:437 #2 0x00000000007456b4 in FileWriteback (file=<value optimized out>, offset=41508864, nbytes=16384, wait_event_info=167772170)at fd.c:1855 #3 0x000000000073dbac in IssuePendingWritebacks (context=0x7ffed45f8530) at bufmgr.c:4381 #4 0x000000000073f1ff in SyncOneBuffer (buf_id=<value optimized out>, skip_recently_used=<value optimized out>, wb_context=0x7ffed45f8530)at bufmgr.c:2409 #5 0x000000000073f549 in BufferSync (flags=6) at bufmgr.c:1991 #6 0x000000000073f5d6 in CheckPointBuffers (flags=6) at bufmgr.c:2585 #7 0x000000000050552c in CheckPointGuts (checkPointRedo=535426125266848, flags=6) at xlog.c:9006 #8 0x000000000050cace in CreateCheckPoint (flags=6) at xlog.c:8795 #9 0x0000000000511740 in StartupXLOG () at xlog.c:7612 #10 0x00000000006faaf1 in StartupProcessMain () at startup.c:207 That GUC is intended to reduce latency spikes caused by checkpoint fsync. But I think limiting to default 256kB between syncs is too limiting during recovery, and at that point it's better to optimize for throughput anyway, since no other backends are running (in that instance) and cannot run until recovery finishes. At least, if this setting is going to apply during recovery, the documentation should mention it (it's a "recovery checkpoint") See also 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). 428b1d6 Allow to trigger kernel writeback after a configurable number of writes.
pgsql-hackers by date: