Thread: ERROR: cannot freeze committed xmax
Hi, Few weeks ago at one of the databases we started to observe the following error in Postgresql logs: ERROR: cannot freeze committed xmax 572 For the note it's Postgresql 10.14 running on RHEL: version ---------------------------------------------------------------------------------------------------------- PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit (1 row) Apparently it happens during the autovacuum and leads to the failing autovacuum freeze progress. We have detected two dictionary tables where manual VACUUM FREEZE fails: pg_proc and pg_depend. Below you can see the page_inspect results for the appropriate blocks as well as the status of xid. Am I understand correctly that for some reason the transaction is marked as commited while on the page level it's rolledback so the inconsistency is detected? Could you advise how to fix it and make autovacuum work again? SELECT txid_status(572); txid_status ------------- committed (1 row) SELECT ctid, xmin, xmax FROM pg_proc WHERE xmax=572; ctid | xmin | xmax ---------+------+------ (75,19) | 571 | 572 (1 row) SELECT ctid, xmin, xmax FROM pg_depend WHERE xmax=572; ctid | xmin | xmax ---------+------+------ (55,76) | 571 | 572 (55,77) | 571 | 572 (55,78) | 571 | 572 (55,79) | 571 | 572 (55,80) | 571 | 572 (5 rows) SELECT lp, t_ctid AS ctid, t_xmin AS xmin, t_xmax AS xmax, (t_infomask & 128)::boolean AS xmax_is_lock, (t_infomask & 1024)::boolean AS xmax_committed, (t_infomask & 2048)::boolean AS xmax_rolled_back, (t_infomask & 4096)::boolean AS xmax_multixact FROM heap_page_item_attrs( get_raw_page('pg_proc', 75), 'pg_proc' ); lp | ctid | xmin | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact ----+---------+-----------+------+--------------+----------------+------------------+---------------- 1 | (75,1) | 564 | 0 | f | f | t | f 2 | (75,2) | 564 | 0 | f | f | t | f 3 | (75,3) | 564 | 0 | f | f | t | f 4 | (75,4) | 564 | 0 | f | f | t | f 5 | (75,5) | 564 | 0 | f | f | t | f 6 | (75,6) | 564 | 0 | f | f | t | f 7 | (75,7) | 564 | 0 | f | f | t | f 8 | (75,8) | 564 | 0 | f | f | t | f 9 | (75,9) | 564 | 0 | f | f | t | f 10 | (75,10) | 566 | 0 | f | f | t | f 11 | (75,11) | 566 | 0 | f | f | t | f 12 | | | | | | | 13 | | | | | | | 14 | (75,14) | 212156121 | 0 | f | f | t | f 15 | (75,15) | 212156121 | 0 | f | f | t | f 16 | (75,16) | 212156121 | 0 | f | f | t | f 17 | (75,17) | 212156121 | 0 | f | f | t | f 18 | (75,18) | 212156121 | 0 | f | f | t | f 19 | (75,21) | 571 | 572 | f | f | t | f 20 | (75,20) | 212156121 | 0 | f | f | t | f 21 | (75,21) | 572 | 0 | f | f | t | f 22 | (75,22) | 97113057 | 0 | f | f | t | f 23 | (75,23) | 212156121 | 0 | f | f | t | f 24 | (75,24) | 212156121 | 0 | f | f | t | f 25 | (75,25) | 212156121 | 0 | f | f | t | f 26 | (75,26) | 212156121 | 0 | f | f | t | f 27 | (75,27) | 212156121 | 0 | f | f | t | f (27 rows) SELECT lp, t_ctid AS ctid, t_xmin AS xmin, t_xmax AS xmax, (t_infomask & 128)::boolean AS xmax_is_lock, (t_infomask & 1024)::boolean AS xmax_committed, (t_infomask & 2048)::boolean AS xmax_rolled_back, (t_infomask & 4096)::boolean AS xmax_multixact FROM heap_page_item_attrs( get_raw_page('pg_depend', 55), 'pg_depend' ); lp | ctid | xmin | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact -----+----------+-----------+------+--------------+----------------+------------------+---------------- ... 75 | | | | | | | 76 | (55,76) | 571 | 572 | f | f | t | f 77 | (55,77) | 571 | 572 | f | f | t | f 78 | (55,78) | 571 | 572 | f | f | t | f 79 | (55,79) | 571 | 572 | f | f | t | f 80 | (55,80) | 571 | 572 | f | f | t | f 81 | | | | | | | ... 85 | | | | | | | 86 | (55,86) | 572 | 0 | f | f | t | f 87 | | | | | | | 88 | (55,88) | 572 | 0 | f | f | t | f 89 | | | | | | | 90 | | | | | | | 91 | (55,91) | 572 | 0 | f | f | t | f 92 | (55,92) | 572 | 0 | f | f | t | f 93 | (55,93) | 572 | 0 | f | f | t | f 94 | (55,94) | 572 | 0 | f | f | t | f ... (169 rows)
Hi Sasha On 2021-Jul-14, Sasha Aliashkevich wrote: > lp | ctid | xmin | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact > ----+---------+-----------+------+--------------+----------------+------------------+---------------- > 19 | (75,21) | 571 | 572 | f | f | t | f > 21 | (75,21) | 572 | 0 | f | f | t | f Yeah, row 19 here is dead and should have been removed by vacuum (same with the ones in pg_depend). I don't know why it wasn't. If you query the tables SELECT * FROM pg_proc WHERE ctid = '(75,19)' are they visible? They shouldn't. If they are visible, maybe try to UPDATE them, again by ctid; perhaps that will kill them for good. But since the txid_status() returns committed, that likely won't work. Maybe if you try to VACUUM by setting vacuum_freeze_min_age to a high value (so that it doesn't try to freeze, but only remove them) they are removed? (I didn't verify that the values you &-mask the infomask with are correct.) -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
One thing I forgot is that these XIDs are fairly old, perhaps dating back to when this database was freshly initdb'd if there has been no XID wraparound. In that case you were probably running a version much older than 10.14 when they were written. Do you happen to know when did you initdb this, with what version, when did you upgrade this to 10.14? That may help search the commit log for bugfixes that might explain the bug. I just remembered this one as my favorite candidate: Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300 Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300 Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300 Don't mark pages all-visible spuriously Dan Wood diagnosed a long-standing problem that pages containing tuples that are locked by multixacts containing live lockers may spuriously end up as candidates for getting their all-visible flag set. This has the long-term effect that multixacts remain unfrozen; this may previously pass undetected, but since commit XYZ it would be reported as "ERROR: found multixact 134100944 from before relminmxid 192042633" because when a later vacuum tries to freeze the page it detects that a multixact that should have gotten frozen, wasn't. Dan proposed a (correct) patch that simply sets a variable to its correct value, after a bogus initialization. But, per discussion, it seems better coding to avoid the bogus initializations altogether, since they could give rise to more bugs later. Therefore this fix rewrites the logic a little bit to avoid depending on the bogus initializations. This bug was part of a family introduced in 9.6 by commit a892234f830e; later, commit 38e9f90a227d fixed most of them, but this one was unnoticed. Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El número de instalaciones de UNIX se ha elevado a 10, y se espera que este número aumente" (UPM, 1972)
Hi Alvaro, glad to hear from you! This database is relatevely young, it was initdb'ed about a year ago or so and it was initially at 10.x. I don't know theexact minor version but the major version was 10 for sure. The problematic row is actually visible: SELECT COUNT(*) FROM pg_proc WHERE ctid = '(75,19)' count ------- 1 (1 row) And it refers to a known custom stored function. Also I'm not sure if I can really skip freezing at this point because this table is already over autovacuum_freeze_max_age(correct me if I'm wrong): SELECT age(c.relfrozenxid), s.setting autovacuum_freeze_max_age FROM pg_class c, pg_settings s WHERE c.relname = 'pg_proc' AND s.name = 'autovacuum_freeze_max_age'; age | autovacuum_freeze_max_age -----------+--------------------------- 213791108 | 200000000 (1 row) This is a production database which I can not restart at any point unfortunately, so I can't change the global settings atm. We can sacrifice that function and recreate it, but unfortunately when I execute DROP it fails with the error: ERROR: could not find tuple for rule 16396 I guess it's because of the broken pages at pg_depend. Do you think it is safe to execute DELETE on the corresponding records in pr_proc and pro_depend? Would it update the brokenxmax? I feel like I'm dealing hear with some sort of data curruption. I recall that few months ago we already had issues with the same function and view based on it. At that point for some reasonit just stopped working. So we had to DROP and RECREATE it. It fixed the issue, but apparently the data inconsistency stayed on disk and now we've got to the situation where AUTOVACUUMstopped working and it makes me really nervous. > On 15 Jul 2021, at 15:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > One thing I forgot is that these XIDs are fairly old, perhaps dating > back to when this database was freshly initdb'd if there has been no XID > wraparound. In that case you were probably running a version much older > than 10.14 when they were written. Do you happen to know when did you > initdb this, with what version, when did you upgrade this to 10.14? > That may help search the commit log for bugfixes that might explain the > bug. I just remembered this one as my favorite candidate: > > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300 > Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300 > Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300 > > Don't mark pages all-visible spuriously > > Dan Wood diagnosed a long-standing problem that pages containing tuples > that are locked by multixacts containing live lockers may spuriously end > up as candidates for getting their all-visible flag set. This has the > long-term effect that multixacts remain unfrozen; this may previously > pass undetected, but since commit XYZ it would be reported as > "ERROR: found multixact 134100944 from before relminmxid 192042633" > because when a later vacuum tries to freeze the page it detects that a > multixact that should have gotten frozen, wasn't. > > Dan proposed a (correct) patch that simply sets a variable to its > correct value, after a bogus initialization. But, per discussion, it > seems better coding to avoid the bogus initializations altogether, since > they could give rise to more bugs later. Therefore this fix rewrites > the logic a little bit to avoid depending on the bogus initializations. > > This bug was part of a family introduced in 9.6 by commit a892234f830e; > later, commit 38e9f90a227d fixed most of them, but this one was > unnoticed. > > Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera > Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera > Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com > > > -- > Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ > "El número de instalaciones de UNIX se ha elevado a 10, > y se espera que este número aumente" (UPM, 1972)