On Wed, Jan 24, 2024 at 12:46:16PM -0500, Robert Haas wrote:
> The "examining summary" line is generated based on the output of
> pg_available_wal_summaries(). The way that works is that the server
> calls readdir(), disassembles the filename into a TLI and two LSNs,
> and returns the result. Then, a fraction of a second later, the test
> script reassembles those components into a filename and finds the file
> missing. If the logic to translate between filenames and TLIs & LSNs
> were incorrect, the test would fail consistently. So the only
> explanation that seems to fit the facts is the file disappearing out
> from under us. But that really shouldn't happen. We do have code to
> remove such files in MaybeRemoveOldWalSummaries(), but it's only
> supposed to be nuking files more than 10 days old.
>
> So I don't really have a theory here as to what could be happening. :-(
There might be an overflow risk in the cutoff time calculation, but I doubt
that's the root cause of these failures:
/*
* Files should only be removed if the last modification time precedes the
* cutoff time we compute here.
*/
cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
Otherwise, I think we'll probably need to add some additional logging to
figure out what is happening...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com