pá 17. 5. 2024 v 18:02 odesílatel Peter Geoghegan <pg@bowt.ie> napsal:
On Fri, May 17, 2024 at 9:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote: > after migration on PostgreSQL 16 I seen 3x times (about every week) broken tables on replica nodes. The query fails with error > > ERROR: could not access status of transaction 1442871302 > DETAIL: Could not open file "pg_xact/0560": No such file or directory
You've shown an inconsistency between the primary and standby with respect to the heap tuple infomask bits related to freezing. It looks like a FREEZE WAL record from the primary was never replayed on the standby.
It think is possible so broken tuples was created before upgrade from Postgres 15 to Postgres 16 - not too far before, so this bug can be side effect of upgrade
It's natural for me to wonder if my Postgres 16 work on page-level freezing might be a factor here. If that really was true, then it would be necessary to explain why the primary and standby are inconsistent (no reason to suspect a problem on the primary here). It'd have to be the kind of issue that could be detected mechanically using wal_consistency_checking, but wasn't detected that way before now -- that seems unlikely.
It's worth considering if the more aggressive behavior around relfrozenxid advancement (in 15) and freezing (in 16) has increased the likelihood of problems like these in setups that were already faulty, in whatever way. The standby database is indeed corrupt, but even on 16 it's fairly isolated corruption in practical terms. The full extent of the problem is clear once amcheck is run, but only one tuple can actually cause the system to error due to the influence of hint bits (for better or worse, hint bits mask the problem quite well, even on 16).