Thread: ERROR: cannot freeze committed xmax

ERROR: cannot freeze committed xmax

From
Sasha Aliashkevich
Date:
Hi,

Few weeks ago at one of the databases we started to observe the following error in Postgresql logs:

    ERROR: cannot freeze committed xmax 572

For the note it's Postgresql 10.14 running on RHEL:

                                                     version
    ----------------------------------------------------------------------------------------------------------
     PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
    (1 row)


Apparently it happens during the autovacuum and leads to the failing autovacuum freeze progress.
We have detected two dictionary tables where manual VACUUM FREEZE fails: pg_proc and pg_depend.
Below you can see the page_inspect results for the appropriate blocks as well as the status of xid.
Am I understand correctly that for some reason the transaction is marked as commited while on the page level it's
rolledback so the inconsistency is detected? 
Could you advise how to fix it and make autovacuum work again?


SELECT txid_status(572);

 txid_status
-------------
 committed
(1 row)


SELECT ctid, xmin, xmax FROM pg_proc WHERE xmax=572;

  ctid   | xmin | xmax
---------+------+------
 (75,19) |  571 |  572
(1 row)

SELECT ctid, xmin, xmax FROM pg_depend WHERE xmax=572;

  ctid   | xmin | xmax
---------+------+------
 (55,76) |  571 |  572
 (55,77) |  571 |  572
 (55,78) |  571 |  572
 (55,79) |  571 |  572
 (55,80) |  571 |  572
(5 rows)


SELECT lp,
       t_ctid AS ctid,
       t_xmin AS xmin,
       t_xmax AS xmax,
       (t_infomask & 128)::boolean AS xmax_is_lock,
       (t_infomask & 1024)::boolean AS xmax_committed,
       (t_infomask & 2048)::boolean AS xmax_rolled_back,
       (t_infomask & 4096)::boolean AS xmax_multixact
FROM heap_page_item_attrs(
        get_raw_page('pg_proc', 75),
        'pg_proc'
     );

 lp |  ctid   |   xmin    | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact
----+---------+-----------+------+--------------+----------------+------------------+----------------
  1 | (75,1)  |       564 |    0 | f            | f              | t                | f
  2 | (75,2)  |       564 |    0 | f            | f              | t                | f
  3 | (75,3)  |       564 |    0 | f            | f              | t                | f
  4 | (75,4)  |       564 |    0 | f            | f              | t                | f
  5 | (75,5)  |       564 |    0 | f            | f              | t                | f
  6 | (75,6)  |       564 |    0 | f            | f              | t                | f
  7 | (75,7)  |       564 |    0 | f            | f              | t                | f
  8 | (75,8)  |       564 |    0 | f            | f              | t                | f
  9 | (75,9)  |       564 |    0 | f            | f              | t                | f
 10 | (75,10) |       566 |    0 | f            | f              | t                | f
 11 | (75,11) |       566 |    0 | f            | f              | t                | f
 12 |         |           |      |              |                |                  |
 13 |         |           |      |              |                |                  |
 14 | (75,14) | 212156121 |    0 | f            | f              | t                | f
 15 | (75,15) | 212156121 |    0 | f            | f              | t                | f
 16 | (75,16) | 212156121 |    0 | f            | f              | t                | f
 17 | (75,17) | 212156121 |    0 | f            | f              | t                | f
 18 | (75,18) | 212156121 |    0 | f            | f              | t                | f
 19 | (75,21) |       571 |  572 | f            | f              | t                | f
 20 | (75,20) | 212156121 |    0 | f            | f              | t                | f
 21 | (75,21) |       572 |    0 | f            | f              | t                | f
 22 | (75,22) |  97113057 |    0 | f            | f              | t                | f
 23 | (75,23) | 212156121 |    0 | f            | f              | t                | f
 24 | (75,24) | 212156121 |    0 | f            | f              | t                | f
 25 | (75,25) | 212156121 |    0 | f            | f              | t                | f
 26 | (75,26) | 212156121 |    0 | f            | f              | t                | f
 27 | (75,27) | 212156121 |    0 | f            | f              | t                | f
(27 rows)


SELECT lp,
       t_ctid AS ctid,
       t_xmin AS xmin,
       t_xmax AS xmax,
       (t_infomask & 128)::boolean AS xmax_is_lock,
       (t_infomask & 1024)::boolean AS xmax_committed,
       (t_infomask & 2048)::boolean AS xmax_rolled_back,
       (t_infomask & 4096)::boolean AS xmax_multixact
FROM heap_page_item_attrs(
        get_raw_page('pg_depend', 55),
        'pg_depend'
     );


 lp  |   ctid   |   xmin    | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact
-----+----------+-----------+------+--------------+----------------+------------------+----------------
...
  75 |          |           |      |              |                |                  |
  76 | (55,76)  |       571 |  572 | f            | f              | t                | f
  77 | (55,77)  |       571 |  572 | f            | f              | t                | f
  78 | (55,78)  |       571 |  572 | f            | f              | t                | f
  79 | (55,79)  |       571 |  572 | f            | f              | t                | f
  80 | (55,80)  |       571 |  572 | f            | f              | t                | f
  81 |          |           |      |              |                |                  |

...
  85 |          |           |      |              |                |                  |
  86 | (55,86)  |       572 |    0 | f            | f              | t                | f
  87 |          |           |      |              |                |                  |
  88 | (55,88)  |       572 |    0 | f            | f              | t                | f
  89 |          |           |      |              |                |                  |
  90 |          |           |      |              |                |                  |
  91 | (55,91)  |       572 |    0 | f            | f              | t                | f
  92 | (55,92)  |       572 |    0 | f            | f              | t                | f
  93 | (55,93)  |       572 |    0 | f            | f              | t                | f
  94 | (55,94)  |       572 |    0 | f            | f              | t                | f
...
(169 rows)




Re: ERROR: cannot freeze committed xmax

From
Alvaro Herrera
Date:
Hi Sasha

On 2021-Jul-14, Sasha Aliashkevich wrote:

>  lp |  ctid   |   xmin    | xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact 
> ----+---------+-----------+------+--------------+----------------+------------------+----------------
>  19 | (75,21) |       571 |  572 | f            | f              | t                | f              
>  21 | (75,21) |       572 |    0 | f            | f              | t                | f              

Yeah, row 19 here is dead and should have been removed by vacuum (same
with the ones in pg_depend).  I don't know why it wasn't.  If you query
the tables
SELECT * FROM pg_proc WHERE ctid = '(75,19)'
are they visible?  They shouldn't.  If they are visible, maybe try to
UPDATE them, again by ctid; perhaps that will kill them for good.  But
since the txid_status() returns committed, that likely won't work.

Maybe if you try to VACUUM by setting vacuum_freeze_min_age to a high
value (so that it doesn't try to freeze, but only remove them) they are
removed?

(I didn't verify that the values you &-mask the infomask with are
correct.)

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: ERROR: cannot freeze committed xmax

From
Alvaro Herrera
Date:
One thing I forgot is that these XIDs are fairly old, perhaps dating
back to when this database was freshly initdb'd if there has been no XID
wraparound.  In that case you were probably running a version much older
than 10.14 when they were written.  Do you happen to know when did you
initdb this, with what version, when did you upgrade this to 10.14?
That may help search the commit log for bugfixes that might explain the
bug.  I just remembered this one as my favorite candidate:

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300
Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300
Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300

    Don't mark pages all-visible spuriously
    
    Dan Wood diagnosed a long-standing problem that pages containing tuples
    that are locked by multixacts containing live lockers may spuriously end
    up as candidates for getting their all-visible flag set.  This has the
    long-term effect that multixacts remain unfrozen; this may previously
    pass undetected, but since commit XYZ it would be reported as
      "ERROR: found multixact 134100944 from before relminmxid 192042633"
    because when a later vacuum tries to freeze the page it detects that a
    multixact that should have gotten frozen, wasn't.
    
    Dan proposed a (correct) patch that simply sets a variable to its
    correct value, after a bogus initialization.  But, per discussion, it
    seems better coding to avoid the bogus initializations altogether, since
    they could give rise to more bugs later.  Therefore this fix rewrites
    the logic a little bit to avoid depending on the bogus initializations.
    
    This bug was part of a family introduced in 9.6 by commit a892234f830e;
    later, commit 38e9f90a227d fixed most of them, but this one was
    unnoticed.
    
    Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera
    Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera
    Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El número de instalaciones de UNIX se ha elevado a 10,
y se espera que este número aumente" (UPM, 1972)



Re: ERROR: cannot freeze committed xmax

From
Sasha Aliashkevich
Date:
Hi Alvaro, glad to hear from you!

This database is relatevely young, it was initdb'ed about a year ago or so and it was initially at 10.x. I don't know
theexact minor version but the major version was 10 for sure.  

The problematic row is actually visible:

    SELECT COUNT(*) FROM pg_proc WHERE ctid = '(75,19)'

     count
    -------
         1
    (1 row)

And it refers to a known custom stored function.

Also I'm not sure if I can really skip freezing at this point because this table is already over
autovacuum_freeze_max_age(correct me if I'm wrong): 

    SELECT age(c.relfrozenxid), s.setting autovacuum_freeze_max_age
    FROM pg_class c, pg_settings s
    WHERE c.relname = 'pg_proc' AND s.name = 'autovacuum_freeze_max_age';

        age    | autovacuum_freeze_max_age
    -----------+---------------------------
     213791108 | 200000000
    (1 row)

This is a production database which I can not restart at any point unfortunately, so I can't change the global settings
atm.

We can sacrifice that function and recreate it, but unfortunately when I execute DROP it fails with the error:

    ERROR:  could not find tuple for rule 16396

I guess it's because of the broken pages at pg_depend.
Do you think it is safe to execute DELETE on the corresponding records in pr_proc and pro_depend? Would it update the
brokenxmax? 

I feel like I'm dealing hear with some sort of data curruption.
I recall that few months ago we already had issues with the same function and view based on it. At that point for some
reasonit just stopped working. So we had to DROP and RECREATE it. 
It fixed the issue, but apparently the data inconsistency stayed on disk and now we've got to the situation where
AUTOVACUUMstopped working and it makes me really nervous. 



> On 15 Jul 2021, at 15:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> One thing I forgot is that these XIDs are fairly old, perhaps dating
> back to when this database was freshly initdb'd if there has been no XID
> wraparound.  In that case you were probably running a version much older
> than 10.14 when they were written.  Do you happen to know when did you
> initdb this, with what version, when did you upgrade this to 10.14?
> That may help search the commit log for bugfixes that might explain the
> bug.  I just remembered this one as my favorite candidate:
>
> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300
> Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300
> Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300
>
>    Don't mark pages all-visible spuriously
>
>    Dan Wood diagnosed a long-standing problem that pages containing tuples
>    that are locked by multixacts containing live lockers may spuriously end
>    up as candidates for getting their all-visible flag set.  This has the
>    long-term effect that multixacts remain unfrozen; this may previously
>    pass undetected, but since commit XYZ it would be reported as
>      "ERROR: found multixact 134100944 from before relminmxid 192042633"
>    because when a later vacuum tries to freeze the page it detects that a
>    multixact that should have gotten frozen, wasn't.
>
>    Dan proposed a (correct) patch that simply sets a variable to its
>    correct value, after a bogus initialization.  But, per discussion, it
>    seems better coding to avoid the bogus initializations altogether, since
>    they could give rise to more bugs later.  Therefore this fix rewrites
>    the logic a little bit to avoid depending on the bogus initializations.
>
>    This bug was part of a family introduced in 9.6 by commit a892234f830e;
>    later, commit 38e9f90a227d fixed most of them, but this one was
>    unnoticed.
>
>    Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera
>    Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera
>    Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com
>
>
> --
> Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
> "El número de instalaciones de UNIX se ha elevado a 10,
> y se espera que este número aumente" (UPM, 1972)