Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size - Mailing list pgsql-bugs

From Kyotaro Horiguchi
Subject Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size
Date
Msg-id 20210729.162038.534808353849568395.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-bugs
At Wed, 28 Jul 2021 11:38:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> This test is still unstable :-(
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-20%2012%3A46%3A11
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-20%2015%3A05%3A39
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
> 
> These all look like
> 
> #   Failed test 'check that segments have been removed'
> #   at t/019_replslot_limit.pl line 226.
> #          got: '000000010000000000000020'
> #     expected: '000000010000000000000024'
> # Looks like you failed 1 test of 16.
> 
> with varying values mentioned.  It looks to me like WAL file cleanup
> is not as synchronous with slot creation as the test thinks.
> Maybe it needs to loop until the oldest WAL file matches what it
> expects?

Sorry for the kludge.

Mmm. In the failure cases, directory scan(@16:52:22.036) runs before
the targetted checkpoint completes(@16:52:22.144).

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
16:52:17.328 LOG:  checkpoint starting: wal
16:52:19.140 LOG:  invalidating slot "rep1" because its restart_lsn 0/1D00000 exceeds max_slot_wal_keep_size
16:52:19.316 019_replslot_limit.pl LOG:  statement: SELECT pg_walfile_name(lsn) FROM
pg_create_physical_replication_slot('s2',true)
 
16:52:22.036 019_replslot_limit.pl LOG:  statement: SELECT pg_ls_dir AS f FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~
'^[0-9A-F]{24}$'ORDER BY 1 LIMIT
 
16:52:22.077 019_replslot_limit.pl LOG:  statement: SELECT pg_drop_replication_slot('s2')
16:52:22.144 LOG:  checkpoint complete: wrote 18 buffers (14.1%); 0 WAL file(s) added, 4 removed, 3 recycled;
write=1.806s, sync=0.001 s, total=4.817 s; sync files=0, longest=0.000 s, average=0.000 s; distance=3072 kB,
estimate=3072kB
 

The reason is the previous checkpoint completed after starting to
advance segments

> my $logstart = get_log_size($node_primary);
> advance_wal($node_primary, 7);   !!!! another checkpoint runs/ends
...
16:52:19.140  # check for "invalidate slots" in log.
              # check for the "lost" state in pg_replication_slots.
             !! checkfor "checkpint complete" in log.
16:52:22.077  # read redo segment and oldest wal.
16:52:22.144 !! The target checkpoint ends.

So it seems to me we need to explicitly prevent unexpected checkpoints
from happening maybe by enlarging max_wal_size temporily.

I'll going that way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: BUG #17127: drop column cann't delete from pg_attribute, so it will up to 1600 limits soon
Next
From: Kämpf, Heiko (OWL-IT)
Date:
Subject: AW: AW: BUG #16858: clang10-devel packages are missing in SLES15 SP2 in general