Re: cleanup patches for incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: cleanup patches for incremental backup
Date
Msg-id CA+TgmoZusu5g3rM1k=UB29Sf53c1OKjm1uFw9uo-sSXBLFZJiQ@mail.gmail.com
Whole thread Raw
In response to Re: cleanup patches for incremental backup  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: cleanup patches for incremental backup
Re: cleanup patches for incremental backup
List pgsql-hackers
On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> I'm seeing some recent buildfarm failures for pg_walsummary:
>
>         https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58
>         https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36
>         https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49
>         https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57
>         https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36
>
> The signature looks nearly identical in each:
>
>         #   Failed test 'WAL summary file exists'
>         #   at t/002_blocks.pl line 79.
>
>         #   Failed test 'stdout shows block 0 modified'
>         #   at t/002_blocks.pl line 85.
>         #                   ''
>         #     doesn't match '(?^m:FORK main: block 0$)'
>
> I haven't been able to reproduce the issue on my machine, and I haven't
> figured out precisely what is happening yet, but I wanted to make sure
> there is awareness.

This is weird. There's a little more detail in the log file,
regress_log_002_blocks, e.g. from the first failure you linked:

[11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
[11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
[11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0
# 1
...
[11:18:21.349](0.000s) #          got: 'pg_walsummary: error: could
not open file
"/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0000000100000000014E0D0800000000155BAF0
# 1.summary": No such file or directory'

The "examining summary" line is generated based on the output of
pg_available_wal_summaries(). The way that works is that the server
calls readdir(), disassembles the filename into a TLI and two LSNs,
and returns the result. Then, a fraction of a second later, the test
script reassembles those components into a filename and finds the file
missing. If the logic to translate between filenames and TLIs & LSNs
were incorrect, the test would fail consistently. So the only
explanation that seems to fit the facts is the file disappearing out
from under us. But that really shouldn't happen. We do have code to
remove such files in MaybeRemoveOldWalSummaries(), but it's only
supposed to be nuking files more than 10 days old.

So I don't really have a theory here as to what could be happening. :-(

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: POC: GROUP BY optimization
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences, take 2