Thread: Dubious server log messages after pg_upgrade
I happened to notice these entries in a log file on a buildfarm member [1]: 2025-03-12 15:39:53.265 UTC [7296] WARNING: found incorrect redo LSN 0/159FB60 (expected 0/40000028) 2025-03-12 15:39:53.265 UTC [7296] LOG: corrupted statistics file "pg_stat/pgstat.stat" (this is near the end of the pg_upgrade_server.log log file). I don't think these are related to that run's subsequent test failure, which looks to be good old Windows randomness. I then looked into the logs of a local BF instance that also runs xversion-upgrade tests, and darned if I didn't find 2025-03-12 14:59:15.792 EDT [2216647] LOG: database system was shut down at 2025-03-12 14:59:13 EDT 2025-03-12 14:59:15.794 EDT [2216647] WARNING: found incorrect redo LSN 0/46F73F18 (expected 0/47000028) 2025-03-12 14:59:15.794 EDT [2216647] LOG: corrupted statistics file "pg_stat/pgstat.stat" 2025-03-12 14:59:15.795 EDT [2216644] LOG: database system is ready to accept connections despite that run having completed with no report of trouble. So this may have been going on for quite some time without our noticing. The "corrupted statistics file" whine is most likely caused by pg_upgrade copying the old system's pgstat.stat file into the new installation --- is that a good idea? I have no idea what's causing the redo LSN complaint, but it seems like that might deserve closer investigation. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-12%2012%3A48%3A39
On Wed, Mar 12, 2025 at 08:41:29PM -0400, Tom Lane wrote: > The "corrupted statistics file" whine is most likely > caused by pg_upgrade copying the old system's pgstat.stat file > into the new installation --- is that a good idea? I don't believe pg_upgrade copies that file to the new cluster. At least, that would be news to me. > I have > no idea what's causing the redo LSN complaint, but it seems > like that might deserve closer investigation. pgstat_read_statsfile() emits a warning before it jumps to the section that emits the LOG. The LSN complaint seems to just be providing some context around what exactly is corrupted in pgstat.stat. Commit b860848 seems rather suspect. I've added Michael here... -- nathan
On Wed, Mar 12, 2025 at 08:41:29PM -0400, Tom Lane wrote: > So this may have been going on for quite some time without our > noticing. The "corrupted statistics file" whine is most likely > caused by pg_upgrade copying the old system's pgstat.stat file > into the new installation --- is that a good idea? I have > no idea what's causing the redo LSN complaint, but it seems > like that might deserve closer investigation. Playing catch-up with various things, my apologies for the lag. We do not copy the stats file from the old to the new node AFAIK. This would not work anyway as the old file would fail to load on the new node when starting the new server due to PGSTAT_FILE_FORMAT_ID that does not match. Note that the LSN check is happening after the version check. The complaint is coming from the firect control file manipulations that pg_upgrade does, which are incompatible with what the stats file stores, like this one with pg_resetwal that forces the new cluster to use a redo LSN newer than anything that has been copied from the old cluster so as the page LSNs can remain consistent: exec_prog(UTILITY_LOG_FILE, NULL, true, true, /* use timeline 1 to match controldata and no WAL history file */ "\"%s/pg_resetwal\" -l 00000001%s \"%s\"", new_cluster.bindir, old_cluster.controldata.nextxlogfile + 8, new_cluster.pgdata); b860848232aa exists because I've been trying to make the handling of the stats file more durable by forcing it to be flushed at each checkpoint, where I've found this check to be independently useful. Let's remove it for this release. Perhaps we will not even need this part if we are able to rebuild the most critical stats from WAL after a crash. This itself needs more work, one point mentioned being to move some table stats at the level of its relfilenode(s) so as we could catch up on the data in the startup process when recovering. Note the bump of PGSTAT_FILE_FORMAT_ID that's required by removing this LSN. With all that said, please see the attached that I'm planning to do. -- Michael
Attachment
Hi, On 2025-03-12 20:41:29 -0400, Tom Lane wrote: > I happened to notice these entries in a log file on a > buildfarm member [1]: > > 2025-03-12 15:39:53.265 UTC [7296] WARNING: found incorrect redo LSN 0/159FB60 (expected 0/40000028) > 2025-03-12 15:39:53.265 UTC [7296] LOG: corrupted statistics file "pg_stat/pgstat.stat" > > (this is near the end of the pg_upgrade_server.log log file). > I don't think these are related to that run's subsequent test failure, > which looks to be good old Windows randomness. I then looked into the > logs of a local BF instance that also runs xversion-upgrade tests, and > darned if I didn't find > > 2025-03-12 14:59:15.792 EDT [2216647] LOG: database system was shut down at 2025-03-12 14:59:13 EDT > 2025-03-12 14:59:15.794 EDT [2216647] WARNING: found incorrect redo LSN 0/46F73F18 (expected 0/47000028) > 2025-03-12 14:59:15.794 EDT [2216647] LOG: corrupted statistics file "pg_stat/pgstat.stat" > 2025-03-12 14:59:15.795 EDT [2216644] LOG: database system is ready to accept connections > > despite that run having completed with no report of trouble. > So this may have been going on for quite some time without our > noticing. The "corrupted statistics file" whine is most likely > caused by pg_upgrade copying the old system's pgstat.stat file > into the new installation --- is that a good idea? I have > no idea what's causing the redo LSN complaint, but it seems > like that might deserve closer investigation. I think the two issues are closely related - this is code that was introduced, in b860848232aa, as part of work to make pgstats somewhat crashsafe. Greetings, Andres Freund
Hi, On 2025-03-15 16:14:10 -0400, Andres Freund wrote: > On 2025-03-12 20:41:29 -0400, Tom Lane wrote: > > I happened to notice these entries in a log file on a > > buildfarm member [1]: > > > > 2025-03-12 15:39:53.265 UTC [7296] WARNING: found incorrect redo LSN 0/159FB60 (expected 0/40000028) > > 2025-03-12 15:39:53.265 UTC [7296] LOG: corrupted statistics file "pg_stat/pgstat.stat" > > > > (this is near the end of the pg_upgrade_server.log log file). > > I don't think these are related to that run's subsequent test failure, > > which looks to be good old Windows randomness. I then looked into the > > logs of a local BF instance that also runs xversion-upgrade tests, and > > darned if I didn't find > > > > 2025-03-12 14:59:15.792 EDT [2216647] LOG: database system was shut down at 2025-03-12 14:59:13 EDT > > 2025-03-12 14:59:15.794 EDT [2216647] WARNING: found incorrect redo LSN 0/46F73F18 (expected 0/47000028) > > 2025-03-12 14:59:15.794 EDT [2216647] LOG: corrupted statistics file "pg_stat/pgstat.stat" > > 2025-03-12 14:59:15.795 EDT [2216644] LOG: database system is ready to accept connections > > > > despite that run having completed with no report of trouble. > > So this may have been going on for quite some time without our > > noticing. The "corrupted statistics file" whine is most likely > > caused by pg_upgrade copying the old system's pgstat.stat file > > into the new installation --- is that a good idea? I have > > no idea what's causing the redo LSN complaint, but it seems > > like that might deserve closer investigation. > > I think the two issues are closely related - this is code that was introduced, > in b860848232aa, as part of work to make pgstats somewhat crashsafe. Ugh, sorry, that went out unintentionally. Greetings, Andres Freund
On Sat, Mar 15, 2025 at 04:15:26PM -0400, Andres Freund wrote: > Ugh, sorry, that went out unintentionally. No problem. Thanks. -- Michael
Attachment
On Sat, Mar 15, 2025 at 11:09:28AM +0900, Michael Paquier wrote: > Let's remove it for this release. Perhaps we will not even need this > part if we are able to rebuild the most critical stats from WAL after > a crash. This itself needs more work, one point mentioned being to > move some table stats at the level of its relfilenode(s) so as we > could catch up on the data in the startup process when recovering. > Note the bump of PGSTAT_FILE_FORMAT_ID that's required by removing > this LSN. > > With all that said, please see the attached that I'm planning to do. And just done that. FWIW, I've had on my tablets the idea to refactor the pgstats code so as it would be possible to read and/or write its data on disk through a frontend. This thread gives an extra reason to do so. -- Michael