Thread: Dubious server log messages after pg_upgrade

Dubious server log messages after pg_upgrade

From
Tom Lane
Date:
I happened to notice these entries in a log file on a
buildfarm member [1]:

2025-03-12 15:39:53.265 UTC [7296] WARNING:  found incorrect redo LSN 0/159FB60 (expected 0/40000028)
2025-03-12 15:39:53.265 UTC [7296] LOG:  corrupted statistics file "pg_stat/pgstat.stat"

(this is near the end of the pg_upgrade_server.log log file).
I don't think these are related to that run's subsequent test failure,
which looks to be good old Windows randomness.  I then looked into the
logs of a local BF instance that also runs xversion-upgrade tests, and
darned if I didn't find

2025-03-12 14:59:15.792 EDT [2216647] LOG:  database system was shut down at 2025-03-12 14:59:13 EDT
2025-03-12 14:59:15.794 EDT [2216647] WARNING:  found incorrect redo LSN 0/46F73F18 (expected 0/47000028)
2025-03-12 14:59:15.794 EDT [2216647] LOG:  corrupted statistics file "pg_stat/pgstat.stat"
2025-03-12 14:59:15.795 EDT [2216644] LOG:  database system is ready to accept connections

despite that run having completed with no report of trouble.
So this may have been going on for quite some time without our
noticing.  The "corrupted statistics file" whine is most likely
caused by pg_upgrade copying the old system's pgstat.stat file
into the new installation --- is that a good idea?  I have
no idea what's causing the redo LSN complaint, but it seems
like that might deserve closer investigation.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-12%2012%3A48%3A39



Re: Dubious server log messages after pg_upgrade

From
Nathan Bossart
Date:
On Wed, Mar 12, 2025 at 08:41:29PM -0400, Tom Lane wrote:
> The "corrupted statistics file" whine is most likely
> caused by pg_upgrade copying the old system's pgstat.stat file
> into the new installation --- is that a good idea?

I don't believe pg_upgrade copies that file to the new cluster.  At least,
that would be news to me.

> I have
> no idea what's causing the redo LSN complaint, but it seems
> like that might deserve closer investigation.

pgstat_read_statsfile() emits a warning before it jumps to the section that
emits the LOG.  The LSN complaint seems to just be providing some context
around what exactly is corrupted in pgstat.stat.  Commit b860848 seems
rather suspect.  I've added Michael here...

-- 
nathan



Re: Dubious server log messages after pg_upgrade

From
Michael Paquier
Date:
On Wed, Mar 12, 2025 at 08:41:29PM -0400, Tom Lane wrote:
> So this may have been going on for quite some time without our
> noticing.  The "corrupted statistics file" whine is most likely
> caused by pg_upgrade copying the old system's pgstat.stat file
> into the new installation --- is that a good idea?  I have
> no idea what's causing the redo LSN complaint, but it seems
> like that might deserve closer investigation.

Playing catch-up with various things, my apologies for the lag.

We do not copy the stats file from the old to the new node AFAIK.
This would not work anyway as the old file would fail to load on the
new node when starting the new server due to PGSTAT_FILE_FORMAT_ID
that does not match.  Note that the LSN check is happening after the
version check.

The complaint is coming from the firect control file manipulations
that pg_upgrade does, which are incompatible with what the stats file
stores, like this one with pg_resetwal that forces the new cluster to
use a redo LSN newer than anything that has been copied from the old
cluster so as the page LSNs can remain consistent:
    exec_prog(UTILITY_LOG_FILE, NULL, true, true,
    /* use timeline 1 to match controldata and no WAL history file */
              "\"%s/pg_resetwal\" -l 00000001%s \"%s\"", new_cluster.bindir,
              old_cluster.controldata.nextxlogfile + 8,
              new_cluster.pgdata);

b860848232aa exists because I've been trying to make the handling of
the stats file more durable by forcing it to be flushed at each
checkpoint, where I've found this check to be independently useful.

Let's remove it for this release.  Perhaps we will not even need this
part if we are able to rebuild the most critical stats from WAL after
a crash.  This itself needs more work, one point mentioned being to
move some table stats at the level of its relfilenode(s) so as we
could catch up on the data in the startup process when recovering.
Note the bump of PGSTAT_FILE_FORMAT_ID that's required by removing
this LSN.

With all that said, please see the attached that I'm planning to do.
--
Michael

Attachment

Re: Dubious server log messages after pg_upgrade

From
Andres Freund
Date:
Hi,

On 2025-03-12 20:41:29 -0400, Tom Lane wrote:
> I happened to notice these entries in a log file on a
> buildfarm member [1]:
> 
> 2025-03-12 15:39:53.265 UTC [7296] WARNING:  found incorrect redo LSN 0/159FB60 (expected 0/40000028)
> 2025-03-12 15:39:53.265 UTC [7296] LOG:  corrupted statistics file "pg_stat/pgstat.stat"
> 
> (this is near the end of the pg_upgrade_server.log log file).
> I don't think these are related to that run's subsequent test failure,
> which looks to be good old Windows randomness.  I then looked into the
> logs of a local BF instance that also runs xversion-upgrade tests, and
> darned if I didn't find
> 
> 2025-03-12 14:59:15.792 EDT [2216647] LOG:  database system was shut down at 2025-03-12 14:59:13 EDT
> 2025-03-12 14:59:15.794 EDT [2216647] WARNING:  found incorrect redo LSN 0/46F73F18 (expected 0/47000028)
> 2025-03-12 14:59:15.794 EDT [2216647] LOG:  corrupted statistics file "pg_stat/pgstat.stat"
> 2025-03-12 14:59:15.795 EDT [2216644] LOG:  database system is ready to accept connections
> 
> despite that run having completed with no report of trouble.
> So this may have been going on for quite some time without our
> noticing.  The "corrupted statistics file" whine is most likely
> caused by pg_upgrade copying the old system's pgstat.stat file
> into the new installation --- is that a good idea?  I have
> no idea what's causing the redo LSN complaint, but it seems
> like that might deserve closer investigation.

I think the two issues are closely related - this is code that was introduced,
in b860848232aa, as part of work to make pgstats somewhat crashsafe.

Greetings,

Andres Freund



Re: Dubious server log messages after pg_upgrade

From
Andres Freund
Date:
Hi,

On 2025-03-15 16:14:10 -0400, Andres Freund wrote:
> On 2025-03-12 20:41:29 -0400, Tom Lane wrote:
> > I happened to notice these entries in a log file on a
> > buildfarm member [1]:
> > 
> > 2025-03-12 15:39:53.265 UTC [7296] WARNING:  found incorrect redo LSN 0/159FB60 (expected 0/40000028)
> > 2025-03-12 15:39:53.265 UTC [7296] LOG:  corrupted statistics file "pg_stat/pgstat.stat"
> > 
> > (this is near the end of the pg_upgrade_server.log log file).
> > I don't think these are related to that run's subsequent test failure,
> > which looks to be good old Windows randomness.  I then looked into the
> > logs of a local BF instance that also runs xversion-upgrade tests, and
> > darned if I didn't find
> > 
> > 2025-03-12 14:59:15.792 EDT [2216647] LOG:  database system was shut down at 2025-03-12 14:59:13 EDT
> > 2025-03-12 14:59:15.794 EDT [2216647] WARNING:  found incorrect redo LSN 0/46F73F18 (expected 0/47000028)
> > 2025-03-12 14:59:15.794 EDT [2216647] LOG:  corrupted statistics file "pg_stat/pgstat.stat"
> > 2025-03-12 14:59:15.795 EDT [2216644] LOG:  database system is ready to accept connections
> > 
> > despite that run having completed with no report of trouble.
> > So this may have been going on for quite some time without our
> > noticing.  The "corrupted statistics file" whine is most likely
> > caused by pg_upgrade copying the old system's pgstat.stat file
> > into the new installation --- is that a good idea?  I have
> > no idea what's causing the redo LSN complaint, but it seems
> > like that might deserve closer investigation.
> 
> I think the two issues are closely related - this is code that was introduced,
> in b860848232aa, as part of work to make pgstats somewhat crashsafe.

Ugh, sorry, that went out unintentionally.

Greetings,

Andres Freund



Re: Dubious server log messages after pg_upgrade

From
Michael Paquier
Date:
On Sat, Mar 15, 2025 at 04:15:26PM -0400, Andres Freund wrote:
> Ugh, sorry, that went out unintentionally.

No problem.  Thanks.
--
Michael

Attachment

Re: Dubious server log messages after pg_upgrade

From
Michael Paquier
Date:
On Sat, Mar 15, 2025 at 11:09:28AM +0900, Michael Paquier wrote:
> Let's remove it for this release.  Perhaps we will not even need this
> part if we are able to rebuild the most critical stats from WAL after
> a crash.  This itself needs more work, one point mentioned being to
> move some table stats at the level of its relfilenode(s) so as we
> could catch up on the data in the startup process when recovering.
> Note the bump of PGSTAT_FILE_FORMAT_ID that's required by removing
> this LSN.
>
> With all that said, please see the attached that I'm planning to do.

And just done that.

FWIW, I've had on my tablets the idea to refactor the pgstats code so
as it would be possible to read and/or write its data on disk through
a frontend.  This thread gives an extra reason to do so.
--
Michael

Attachment