Thread: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

Hi,

I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.

The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
 * XXX this is currently extremely conservative, since it forces only one
 * future log segment to exist, and even that only if we are 75% done with
 * the current one.  This is only appropriate for very low-WAL-volume systems.

Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().

I think we should consider just removing that field. Or, even better, show
something accurate instead.

As an example, here's the log output of a workload that has a replication slot
preventing WAL files from being recycled (and too small max_wal_size):

2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s,
average=0.005s; distance=540578 kB, estimate=540905 kB
 
2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s,
average=0.004s; distance=540708 kB, estimate=540886 kB
 
2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s,
average=0.004s; distance=540657 kB, estimate=540863 kB
 
2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s,
average=0.003s; distance=540733 kB, estimate=540850 kB
 
2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s,
average=0.003s; distance=540640 kB, estimate=540829 kB
 

So it's 3 new WAL files in that timeframe, one might think?

A probe instrumenting xlog file creation shows something very different:
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39
(39 is the O_CREAT BasicOpenFile(), not the recycle path).

perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
     1.001030943                  9      probe_postgres:XLogFileInitInternal_L39
     2.002998896                  8      probe_postgres:XLogFileInitInternal_L39
     3.005098857                  8      probe_postgres:XLogFileInitInternal_L39
     4.007000426                  6      probe_postgres:XLogFileInitInternal_L39
     5.008423119                  9      probe_postgres:XLogFileInitInternal_L39
     6.013427568                  8      probe_postgres:XLogFileInitInternal_L39
     7.015087613                  8      probe_postgres:XLogFileInitInternal_L39
     8.017393339                  8      probe_postgres:XLogFileInitInternal_L39
     9.018425526                  7      probe_postgres:XLogFileInitInternal_L39
    10.020398520                  5      probe_postgres:XLogFileInitInternal_L39

Greetings,

Andres Freund



Hi,

On 2021-07-24 15:50:36 -0700, Andres Freund wrote:
> As an example, here's the log output of a workload that has a replication slot
> preventing WAL files from being recycled (and too small max_wal_size):
>
> 2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s,
average=0.005s; distance=540578 kB, estimate=540905 kB
 
> 2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s,
average=0.004s; distance=540708 kB, estimate=540886 kB
 
> 2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s,
average=0.004s; distance=540657 kB, estimate=540863 kB
 
> 2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL
file(s)added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s,
average=0.003s; distance=540733 kB, estimate=540850 kB
 
> 2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG:  checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL
file(s)added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s,
average=0.003s; distance=540640 kB, estimate=540829 kB
 
>
> So it's 3 new WAL files in that timeframe, one might think?
>
> A probe instrumenting xlog file creation shows something very different:
> perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39
> (39 is the O_CREAT BasicOpenFile(), not the recycle path).
>
> perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
>      1.001030943                  9      probe_postgres:XLogFileInitInternal_L39
>      2.002998896                  8      probe_postgres:XLogFileInitInternal_L39
>      3.005098857                  8      probe_postgres:XLogFileInitInternal_L39
>      4.007000426                  6      probe_postgres:XLogFileInitInternal_L39
>      5.008423119                  9      probe_postgres:XLogFileInitInternal_L39
>      6.013427568                  8      probe_postgres:XLogFileInitInternal_L39
>      7.015087613                  8      probe_postgres:XLogFileInitInternal_L39
>      8.017393339                  8      probe_postgres:XLogFileInitInternal_L39
>      9.018425526                  7      probe_postgres:XLogFileInitInternal_L39
>     10.020398520                  5      probe_postgres:XLogFileInitInternal_L39

And even more extreme, the logs can end up suggestiong pg_wal is shrinking,
when the opposite is the case. Here's the checkpoint output from a parallel
copy data load (without a replication slot holding things back):

2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG:  checkpoint complete: wrote 22291 buffers (17.0%); 0 WAL
file(s)added, 27 removed, 141 recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, longest=0.116 s,
average=0.039s; distance=2756904 kB, estimate=2756904 kB
 
2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG:  checkpoint complete: wrote 21840 buffers (16.7%); 0 WAL
file(s)added, 53 removed, 149 recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, longest=0.201 s,
average=0.037s; distance=3313885 kB, estimate=3313885 kB
 
2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG:  checkpoint complete: wrote 22326 buffers (17.0%); 0 WAL
file(s)added, 56 removed, 149 recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, longest=0.099 s,
average=0.040s; distance=3346125 kB, estimate=3346125 kB
 
2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG:  checkpoint complete: wrote 22155 buffers (16.9%); 0 WAL
file(s)added, 60 removed, 148 recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, longest=0.101 s,
average=0.043s; distance=3409281 kB, estimate=3409281 kB
 

It does look like WAL space usage is shrinking, but the opposite is true -
we're creating so much WAL that the checkpointer can't checkpoint fast enough
to keep WAL usage below max_wal_size. So WAL files are constantly created that
then need to be removed (hence the non-zero removed counts).

#           time             counts unit events
   277.087990275                 15      probe_postgres:XLogFileInitInternal_L39
   278.098549960                 15      probe_postgres:XLogFileInitInternal_L39
   279.104787575                  6      probe_postgres:XLogFileInitInternal_L39
   280.108980690                  5      probe_postgres:XLogFileInitInternal_L39
   281.111781617                  6      probe_postgres:XLogFileInitInternal_L39
   282.113601958                  2      probe_postgres:XLogFileInitInternal_L39
   283.115711683                  0      probe_postgres:XLogFileInitInternal_L39
   284.121508636                  0      probe_postgres:XLogFileInitInternal_L39
   285.124865325                  0      probe_postgres:XLogFileInitInternal_L39
   286.126932016                  0      probe_postgres:XLogFileInitInternal_L39
   287.129874993                 11      probe_postgres:XLogFileInitInternal_L39
   288.131838429                 15      probe_postgres:XLogFileInitInternal_L39
   289.133609021                 13      probe_postgres:XLogFileInitInternal_L39
   290.136254341                  6      probe_postgres:XLogFileInitInternal_L39
   291.139368485                  5      probe_postgres:XLogFileInitInternal_L39
   292.142728293                  6      probe_postgres:XLogFileInitInternal_L39
   293.148078766                  2      probe_postgres:XLogFileInitInternal_L39
   294.150258476                  0      probe_postgres:XLogFileInitInternal_L39
   295.172398897                  0      probe_postgres:XLogFileInitInternal_L39
   296.174658196                  0      probe_postgres:XLogFileInitInternal_L39
   297.176818943                  0      probe_postgres:XLogFileInitInternal_L39
   298.179003473                 14      probe_postgres:XLogFileInitInternal_L39
   299.181597777                 14      probe_postgres:XLogFileInitInternal_L39
   300.184711566                 14      probe_postgres:XLogFileInitInternal_L39
   301.188919194                  6      probe_postgres:XLogFileInitInternal_L39

Greetings,

Andres Freund




On 2021/07/25 7:50, Andres Freund wrote:
> Hi,
> 
> I've been repeatedly confused by the the number of WAL files supposedly
> added. Even when 100s of new WAL files are created the relevant portion
> of log_checkpoints will only ever list zero or one added WAL file.
> 
> The reason for that is that CheckpointStats.ckpt_segs_added is only
> incremented in PreallocXlogFiles(). Which has the following comment:
>   * XXX this is currently extremely conservative, since it forces only one
>   * future log segment to exist, and even that only if we are 75% done with
>   * the current one.  This is only appropriate for very low-WAL-volume systems.
> 
> Whereas in real workloads WAL files are almost exclusively created via
> XLogWrite()->XLogFileInit().
> 
> I think we should consider just removing that field. Or, even better, show
> something accurate instead.

+1 to show something accurate instead.

It's also worth showing them in monitoring stats view like pg_stat_wal?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

From
"Bossart, Nathan"
Date:
On 7/24/21, 8:10 PM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:
> On 2021/07/25 7:50, Andres Freund wrote:
>> Hi,
>>
>> I've been repeatedly confused by the the number of WAL files supposedly
>> added. Even when 100s of new WAL files are created the relevant portion
>> of log_checkpoints will only ever list zero or one added WAL file.
>>
>> The reason for that is that CheckpointStats.ckpt_segs_added is only
>> incremented in PreallocXlogFiles(). Which has the following comment:
>>   * XXX this is currently extremely conservative, since it forces only one
>>   * future log segment to exist, and even that only if we are 75% done with
>>   * the current one.  This is only appropriate for very low-WAL-volume systems.
>>
>> Whereas in real workloads WAL files are almost exclusively created via
>> XLogWrite()->XLogFileInit().
>>
>> I think we should consider just removing that field. Or, even better, show
>> something accurate instead.
>
> +1 to show something accurate instead.
>
> It's also worth showing them in monitoring stats view like pg_stat_wal?

+1.  I was confused by this when working on a WAL pre-allocation
patch [0].  Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal.  How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?

Nathan

[0] https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de



On 2021/07/27 5:27, Bossart, Nathan wrote:
> +1.  I was confused by this when working on a WAL pre-allocation
> patch [0].  Perhaps it could be replaced by a new parameter and a new
> field in pg_stat_wal.  How about something like log_wal_init_interval,
> where the value is the minimum amount of time between reporting the
> number of WAL segments created since the last report?

You mean to introduce new GUC like log_wal_init_interval and that
the number of WAL files created since the last report will be logged
every that interval? But isn't it better and simpler to just expose
the accumulated number of WAL files created, in pg_stat_wal view
or elsewhere? If so, we can easily get to know the number of WAL files
created in every interval by checking the view and calculating the diff.

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

From
"Bossart, Nathan"
Date:
On 7/26/21, 5:23 PM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:
> On 2021/07/27 5:27, Bossart, Nathan wrote:
>> +1.  I was confused by this when working on a WAL pre-allocation
>> patch [0].  Perhaps it could be replaced by a new parameter and a new
>> field in pg_stat_wal.  How about something like log_wal_init_interval,
>> where the value is the minimum amount of time between reporting the
>> number of WAL segments created since the last report?
>
> You mean to introduce new GUC like log_wal_init_interval and that
> the number of WAL files created since the last report will be logged
> every that interval? But isn't it better and simpler to just expose
> the accumulated number of WAL files created, in pg_stat_wal view
> or elsewhere? If so, we can easily get to know the number of WAL files
> created in every interval by checking the view and calculating the diff.

I agree with you about adding a new field to pg_stat_wal.  The
parameter would just be a convenient way of logging this information
for future reference.  I don't feel strongly about the parameter if
you think the pg_stat_wal addition is enough.

Nathan


Hi,

On 2021-07-26 20:27:21 +0000, Bossart, Nathan wrote:
> +1.  I was confused by this when working on a WAL pre-allocation
> patch [0].  Perhaps it could be replaced by a new parameter and a new
> field in pg_stat_wal.  How about something like log_wal_init_interval,
> where the value is the minimum amount of time between reporting the
> number of WAL segments created since the last report?

Why not just make the number in log_checkpoints accurate? There's no
point in the current number, so we don't need to preserve it...

Greetings,

Andres Freund



Hi,

On 2021-07-25 12:10:07 +0900, Fujii Masao wrote:
> It's also worth showing them in monitoring stats view like pg_stat_wal?

I'm not convinced that's all that meaningful. It makes sense to include
it as part of the checkpoint output, because checkpoints determine when
WAL can be recycled etc. It's not that clear to me how to represent that
as part of pg_stat_wal?

I guess we could add columns for the amount of WAL has been a) newly
created b) recycled c) removed. In combination that *does* seem
useful. But also a mostly independent change...

Greetings,

Andres Freund



Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

From
"Bossart, Nathan"
Date:
On 7/26/21, 5:48 PM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-07-26 20:27:21 +0000, Bossart, Nathan wrote:
>> +1.  I was confused by this when working on a WAL pre-allocation
>> patch [0].  Perhaps it could be replaced by a new parameter and a new
>> field in pg_stat_wal.  How about something like log_wal_init_interval,
>> where the value is the minimum amount of time between reporting the
>> number of WAL segments created since the last report?
>
> Why not just make the number in log_checkpoints accurate? There's no
> point in the current number, so we don't need to preserve it...

My understanding is that the statistics logged for log_checkpoints are
just for that specific checkpoint.  From that angle, the value for the
number of WAL files added is technically correct.  Checkpoints will
only ever create zero or one new files via PreallocXlogFiles().  If we
also added all the segments created outside of the checkpoint, the
value for "added" would go from meaning "WAL files created by this
checkpoint" to "WAL files creates since the last checkpoint."  That's
probably less confusing than what's there today, but it's still
slightly different than all the other log_checkpoints stats.

Nathan