Re: VACUUM unable to accomplish because of a non-existent MultiXactId - Mailing list pgsql-admin

From Kouber Saparev
Subject Re: VACUUM unable to accomplish because of a non-existent MultiXactId
Date
Msg-id 99572764-62FE-4C23-8667-53B648C8DFD4@gmail.com
Whole thread Raw
In response to Re: VACUUM unable to accomplish because of a non-existent MultiXactId  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: VACUUM unable to accomplish because of a non-existent MultiXactId  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-admin

The state of your data is probably caused by some weird corner case of
the upgrade.  Can you see in the log files that the toast table has been
failing vacuum since the upgrade, or is it more recent than that?  (In
other words, is there any working vacuum after the upgrade?)

We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days later - on 22 October:

Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG:  automatic vacuum of table “db.pg_toast.pg_toast_376621": index scans: 1
Oct 22 08:16:49 db-master postgres[10589]: [3-2]      pages: 0 removed, 784361 remain
Oct 22 08:16:49 db-master postgres[10589]: [3-3]      tuples: 110 removed, 3768496 remain, 0 are dead but not yet removable
Oct 22 08:16:49 db-master postgres[10589]: [3-4]      buffer usage: 37193 hits, 44891 misses, 32311 dirtied
Oct 22 08:16:49 db-master postgres[10589]: [3-5]      avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s
Oct 22 08:16:49 db-master postgres[10589]: [3-6]      system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec

The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since:

Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound
Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT:  automatic vacuum of table “db.pg_toast.pg_toast_376621”

So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19 October.

It would be useful to debug this that you attached gdb to a backend, set
breakpoint on errfinish, then run vacuum on that table.  Then you can
extract the page number from the backtrace.  With the page number we can
try pageinspect and heap_page_items until we find the culprit and
perhaps identify how it got in that state.

I will try to obtain the page number, and will then send you the results, thank you.

Can we somehow do it on one of our replicas (after detaching it), i.e. is the corrupted record propagated through the replication channel, and in the meantime fix the table on the master?

Thanks!

Kouber Saparev

pgsql-admin by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: VACUUM unable to accomplish because of a non-existent MultiXactId
Next
From: Häfliger Guido
Date:
Subject: Re: pg_service and ldap