Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness |
Date | |
Msg-id | 20210724230207.brbls2v5ve5zqyti@alap3.anarazel.de Whole thread Raw |
In response to | log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On 2021-07-24 15:50:36 -0700, Andres Freund wrote: > As an example, here's the log output of a workload that has a replication slot > preventing WAL files from being recycled (and too small max_wal_size): > > 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL file(s)added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s, average=0.005s; distance=540578 kB, estimate=540905 kB > 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL file(s)added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s, average=0.004s; distance=540708 kB, estimate=540886 kB > 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL file(s)added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s, average=0.004s; distance=540657 kB, estimate=540863 kB > 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL file(s)added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s, average=0.003s; distance=540733 kB, estimate=540850 kB > 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL file(s)added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s, average=0.003s; distance=540640 kB, estimate=540829 kB > > So it's 3 new WAL files in that timeframe, one might think? > > A probe instrumenting xlog file creation shows something very different: > perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39 > (39 is the O_CREAT BasicOpenFile(), not the recycle path). > > perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000 > 1.001030943 9 probe_postgres:XLogFileInitInternal_L39 > 2.002998896 8 probe_postgres:XLogFileInitInternal_L39 > 3.005098857 8 probe_postgres:XLogFileInitInternal_L39 > 4.007000426 6 probe_postgres:XLogFileInitInternal_L39 > 5.008423119 9 probe_postgres:XLogFileInitInternal_L39 > 6.013427568 8 probe_postgres:XLogFileInitInternal_L39 > 7.015087613 8 probe_postgres:XLogFileInitInternal_L39 > 8.017393339 8 probe_postgres:XLogFileInitInternal_L39 > 9.018425526 7 probe_postgres:XLogFileInitInternal_L39 > 10.020398520 5 probe_postgres:XLogFileInitInternal_L39 And even more extreme, the logs can end up suggestiong pg_wal is shrinking, when the opposite is the case. Here's the checkpoint output from a parallel copy data load (without a replication slot holding things back): 2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22291 buffers (17.0%); 0 WAL file(s)added, 27 removed, 141 recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, longest=0.116 s, average=0.039s; distance=2756904 kB, estimate=2756904 kB 2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 21840 buffers (16.7%); 0 WAL file(s)added, 53 removed, 149 recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, longest=0.201 s, average=0.037s; distance=3313885 kB, estimate=3313885 kB 2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22326 buffers (17.0%); 0 WAL file(s)added, 56 removed, 149 recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, longest=0.099 s, average=0.040s; distance=3346125 kB, estimate=3346125 kB 2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22155 buffers (16.9%); 0 WAL file(s)added, 60 removed, 148 recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, longest=0.101 s, average=0.043s; distance=3409281 kB, estimate=3409281 kB It does look like WAL space usage is shrinking, but the opposite is true - we're creating so much WAL that the checkpointer can't checkpoint fast enough to keep WAL usage below max_wal_size. So WAL files are constantly created that then need to be removed (hence the non-zero removed counts). # time counts unit events 277.087990275 15 probe_postgres:XLogFileInitInternal_L39 278.098549960 15 probe_postgres:XLogFileInitInternal_L39 279.104787575 6 probe_postgres:XLogFileInitInternal_L39 280.108980690 5 probe_postgres:XLogFileInitInternal_L39 281.111781617 6 probe_postgres:XLogFileInitInternal_L39 282.113601958 2 probe_postgres:XLogFileInitInternal_L39 283.115711683 0 probe_postgres:XLogFileInitInternal_L39 284.121508636 0 probe_postgres:XLogFileInitInternal_L39 285.124865325 0 probe_postgres:XLogFileInitInternal_L39 286.126932016 0 probe_postgres:XLogFileInitInternal_L39 287.129874993 11 probe_postgres:XLogFileInitInternal_L39 288.131838429 15 probe_postgres:XLogFileInitInternal_L39 289.133609021 13 probe_postgres:XLogFileInitInternal_L39 290.136254341 6 probe_postgres:XLogFileInitInternal_L39 291.139368485 5 probe_postgres:XLogFileInitInternal_L39 292.142728293 6 probe_postgres:XLogFileInitInternal_L39 293.148078766 2 probe_postgres:XLogFileInitInternal_L39 294.150258476 0 probe_postgres:XLogFileInitInternal_L39 295.172398897 0 probe_postgres:XLogFileInitInternal_L39 296.174658196 0 probe_postgres:XLogFileInitInternal_L39 297.176818943 0 probe_postgres:XLogFileInitInternal_L39 298.179003473 14 probe_postgres:XLogFileInitInternal_L39 299.181597777 14 probe_postgres:XLogFileInitInternal_L39 300.184711566 14 probe_postgres:XLogFileInitInternal_L39 301.188919194 6 probe_postgres:XLogFileInitInternal_L39 Greetings, Andres Freund
pgsql-hackers by date: