Re: Error on vacuum: xmin before relfrozenxid - Mailing list pgsql-general

From Paolo Crosato
Subject Re: Error on vacuum: xmin before relfrozenxid
Date
Msg-id CAKMFJucLdKK+Js1+o+z-aJ8qRuWMSivmpzPK_8zQNLLpysRsAA@mail.gmail.com
Whole thread Raw
In response to Re: Error on vacuum: xmin before relfrozenxid  (Andres Freund <andres@anarazel.de>)
Responses Re: Error on vacuum: xmin before relfrozenxid
Re: Error on vacuum: xmin before relfrozenxid
List pgsql-general
Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.


Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups. 
The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.

I managed to recover the log of the first time we run into the issue, the error was the same but on template1:

May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:  found xmin 2600758304 from before relfrozenxid 400011439
May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:  automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with  400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it. 

The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can provide further details if needed.

Thanks,
Best Regards

Paolo Crosato


pgsql-general by date:

Previous
From: Andres Freund
Date:
Subject: Re: Error on vacuum: xmin before relfrozenxid
Next
From: Thomas Munro
Date:
Subject: Re: pg_multixact/members growing