should crash recovery ignore checkpoint_flush_after ? - Mailing list pgsql-hackers

From Justin Pryzby
Subject should crash recovery ignore checkpoint_flush_after ?
Date
Msg-id 20200118140807.GA10774@telsasoft.com
Whole thread Raw
Responses Re: should crash recovery ignore checkpoint_flush_after ?  (Andres Freund <andres@anarazel.de>)
Re: should crash recovery ignore checkpoint_flush_after ?  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
One of our PG12 instances was in crash recovery for an embarassingly long time
after hitting ENOSPC.  (Note, I first started wroting this mail 10 months ago
while running PG11 after having same experience after OOM).  Running linux.

As I understand, the first thing that happens syncing every file in the data
dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
compression, so that's slow, but tolerable, and at least understandable, and
with visible progress in ps.

The 2nd stage replays WAL.  strace show's it's occasionally running
sync_file_range, and I think recovery might've been several times faster if
we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
just kill -9 the recovery process and edited the config to disable this lest it
spend all night in recovery.

$ sudo strace -p 12564 2>&1 |sed 33q
Process 12564 attached
sync_file_range(0x21, 0x2bba000, 0xa000, 0x2) = 0
sync_file_range(0xb2, 0x2026000, 0x1a000, 0x2) = 0
clock_gettime(CLOCK_MONOTONIC, {7521130, 31376505}) = 0

(gdb) bt
#0  0x00000032b2adfe8a in sync_file_range () from /lib64/libc.so.6
#1  0x00000000007454e2 in pg_flush_data (fd=<value optimized out>, offset=<value optimized out>, nbytes=<value
optimizedout>) at fd.c:437
 
#2  0x00000000007456b4 in FileWriteback (file=<value optimized out>, offset=41508864, nbytes=16384,
wait_event_info=167772170)at fd.c:1855
 
#3  0x000000000073dbac in IssuePendingWritebacks (context=0x7ffed45f8530) at bufmgr.c:4381
#4  0x000000000073f1ff in SyncOneBuffer (buf_id=<value optimized out>, skip_recently_used=<value optimized out>,
wb_context=0x7ffed45f8530)at bufmgr.c:2409
 
#5  0x000000000073f549 in BufferSync (flags=6) at bufmgr.c:1991
#6  0x000000000073f5d6 in CheckPointBuffers (flags=6) at bufmgr.c:2585
#7  0x000000000050552c in CheckPointGuts (checkPointRedo=535426125266848, flags=6) at xlog.c:9006
#8  0x000000000050cace in CreateCheckPoint (flags=6) at xlog.c:8795
#9  0x0000000000511740 in StartupXLOG () at xlog.c:7612
#10 0x00000000006faaf1 in StartupProcessMain () at startup.c:207

That GUC is intended to reduce latency spikes caused by checkpoint fsync.  But
I think limiting to default 256kB between syncs is too limiting during
recovery, and at that point it's better to optimize for throughput anyway,
since no other backends are running (in that instance) and cannot run until
recovery finishes.  At least, if this setting is going to apply during
recovery, the documentation should mention it (it's a "recovery checkpoint")

See also
4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
428b1d6 Allow to trigger kernel writeback after a configurable number of writes.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg13 PGDLLIMPORT list
Next
From: Bruce Momjian
Date:
Subject: Re: backup manifests