Home > mailing lists

Thread: Error on vacuum: xmin before relfrozenxid

Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

22 May 2018, 17:18:20

Hi,

we have an error happening on a catalog table on one of the dbs in the instance.

This is the error report form:

A description of what you are trying to achieve and what results you expect.:

Any kind of vacuum fails on pg_authid table, I would expect it to succeed. This is occasionaly blocking autovacuums, so we must resort to manual vacuum of busy tables, which is bearable but inconvenient.

PostgreSQL version number you are running:

PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit

How you installed PostgreSQL:

From the pgdg yum repositories.

Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all.

name | current_setting | source

------------------------------+------------------------------------------------------------------------------+----------------------

application_name | psql | client

archive_command | rsync -a %p barman@x.x.x.x:/data/backup/barman/ubipgsql-10/incoming/%f | configuration file

archive_mode | on | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 25min | configuration file

client_encoding | UTF8 | client

DateStyle | ISO, MDY | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 64GB | configuration file

hot_standby | on | configuration file

hot_standby_feedback | on | configuration file

lc_messages | en_US.UTF-8 | configuration file

lc_monetary | en_US.UTF-8 | configuration file

lc_numeric | en_US.UTF-8 | configuration file

lc_time | en_US.UTF-8 | configuration file

listen_addresses | * | configuration file

log_autovacuum_min_duration | 1s | configuration file

log_checkpoints | on | configuration file

log_destination | syslog | configuration file

log_directory | log | configuration file

log_filename | postgresql-%a.log | configuration file

log_line_prefix | user=%u,db=%d,client=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 1s | configuration file

log_min_messages | info | configuration file

log_rotation_age | 1d | configuration file

log_rotation_size | 0 | configuration file

log_statement | ddl | configuration file

log_temp_files | 1MB | configuration file

log_timezone | Europe/Rome | configuration file

log_truncate_on_rotation | on | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 1GB | configuration file

max_connections | 1000 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 12GB | configuration file

min_wal_size | 80MB | configuration file

password_encryption | scram-sha-256 | configuration file

pg_stat_statements.max | 10000 | configuration file

pg_stat_statements.track | all | configuration file

shared_buffers | 32GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

syslog_facility | local0 | configuration file

syslog_ident | postgres | configuration file

TimeZone | Europe/Rome | configuration file

track_activity_query_size | 2048 | configuration file

track_functions | all | configuration file

track_io_timing | on | configuration file

wal_keep_segments | 200 | configuration file

wal_level | logical | configuration file

(51 rows)

Operating system and version:

CentOS Linux release 7.4.1708 (Core)

Linux xx.xx.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What program you're using to connect to PostgreSQL:

psql, jdbc driver. It's not an application side error.

Is there anything relevant or unusual in the PostgreSQL server logs?:

Yes, we see these kind of errors:

May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client= ERROR: found xmin 2889675859 from before relfrozenxid 400011439

May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client= CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"

When these errors pop up, autovacuum repeatedly fails on this very same table, and we have to resort to manual vacuums

For questions about any kind of error:

What you were doing when the error happened / how to cause the error:

The first time we saw the error, we found an idle transaction was left open for several days. The user was not monitored for long running transactions. We killed the session, deleted the offending row (the one with the xmin value reported) and then vacuumed the table.

However, teh same error pops up every time we try to add a new user. We would like to fix these error once and for all, so we can add new users.

The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)

This is the error:

psql (10.4)

Type "help" for help.

postgres=# vacuum pg_authid;

ERROR: found xmin 3031994631 from before relfrozenxid 400011439

Some datas:

postgres=# select xmin from pg_authid ;

xmin

------------

557

7216348

110077819

110511334

3031994631

3032044199

3032070282

(14 rows)

postgres=# select relfrozenxid from pg_class where relname='pg_authid';

relfrozenxid

--------------

400011439

(1 row)

postgres=#

Is this a sympthom of data corruption or transaction wraparound due to the long running transaction that we killed weeks ago? This is the only table in the whole cluster that has this error. We are monitoring transactions wraparound with the the check_postgres.pl script, the check is still running fine and no alert was given at all since the cluster has been running.

Thank you in advance for any answer.

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

22 May 2018, 19:49:11

Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
> 
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
> 
> How you installed PostgreSQL:
> 
> From the pgdg yum repositories.
> 
> Changes made to the settings in the postgresql.conf file:  see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"
> 
> postgres=# select xmin from pg_authid ;
>     xmin
> ------------
>           1
>           1
>           1
>           1
>           1
>         557
>     7216348
>   110077819
>   110511334
>  3031994631
>  3032044199
>  3032044199
>  3032044199
>  3032070282
> (14 rows)
> 
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439
> (1 row)
> 
> postgres=#
> 
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.


> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
> 
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

Thanks.
Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

22 May 2018, 22:43:01

Hello,

2018-05-22 18:49 GMT+02:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
>
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
>
> How you installed PostgreSQL:
>
> From the pgdg yum repositories.
>
> Changes made to the settings in the postgresql.conf file: see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR: found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"
>
> postgres=# select xmin from pg_authid ;
> xmin
> ------------
> 1
> 1
> 1
> 1
> 1
> 557
> 7216348
> 110077819
> 110511334
> 3031994631
> 3032044199
> 3032044199
> 3032044199
> 3032070282
> (14 rows)
>
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439
> (1 row)
>
> postgres=#
>
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.

> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
>
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;

This is the result:

postgres=# select ctid, xmin, xmax from pg_authid ;

ctid | xmin | xmax

--------+------------+------

(0,1) | 1 | 0

(0,2) | 1 | 0

(0,3) | 1 | 0

(0,4) | 1 | 0

(0,5) | 1 | 0

(0,6) | 557 | 0

(0,11) | 7216348 | 0

(0,12) | 110077819 | 0

(0,13) | 110511334 | 0

(0,16) | 3031994631 | 0

(0,17) | 3032044199 | 0

(0,18) | 3032044199 | 0

(0,19) | 3032044199 | 0

(0,20) | 3032070282 | 0

(14 rows)

postgres=#

and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

This is the output of the second query:

postgres=# CREATE EXTENSION pageinspect;

CREATE EXTENSION

postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

----+--------+----------+--------+------------+--------+----------+--------+-------------+------------+--------+------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 | 8080 | 1 | 108 | 1 | 0 | 0 | (0,1) | 11 | 2313 | 32 | 1111111110000000 | 3373 | \x70675f6d6f6e69746f720000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

2 | 7968 | 1 | 108 | 1 | 0 | 0 | (0,2) | 11 | 2313 | 32 | 1111111110000000 | 3374 | \x70675f726561645f616c6c5f73657474696e677300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

3 | 7856 | 1 | 108 | 1 | 0 | 0 | (0,3) | 11 | 2313 | 32 | 1111111110000000 | 3375 | \x70675f726561645f616c6c5f737461747300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

4 | 7744 | 1 | 108 | 1 | 0 | 0 | (0,4) | 11 | 2313 | 32 | 1111111110000000 | 3377 | \x70675f737461745f7363616e5f7461626c65730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

5 | 7632 | 1 | 108 | 1 | 0 | 0 | (0,5) | 11 | 2313 | 32 | 1111111110000000 | 4200 | \x70675f7369676e616c5f6261636b656e6400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

6 | 7384 | 1 | 245 | 557 | 0 | 0 | (0,6) | 11 | 11019 | 32 | 1111111111000000 | 24576 | \x7265706c6963610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a305a6a336a39676d31314e473166455a6c75774a52413d3d2448394d446b2b445a4a723856544955702f684933664e4933376c777675696e49336731444a6672447863383d3a312f485442645265694c54306341724c47526f4d6d7739415a7351674c365a4c454a324e4e55306b5775413d

7 | 16 | 2 | 0 | | | | | | | | | |

8 | 17 | 2 | 0 | | | | | | | | | |

9 | 18 | 2 | 0 | | | | | | | | | |

10 | 19 | 2 | 0 | | | | | | | | | |

11 | 7136 | 1 | 245 | 7216348 | 0 | 0 | (0,11) | 11 | 11019 | 32 | 1111111111000000 | 10 | \x706f73746772657300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff24020000534352414d2d5348412d32353624343039363a72532f36303674706f523975465558797566392f6f413d3d2435412b4e745a444265333365397143556570422b324748664236567564717a65434b56427067576c2f41773d3a5a5863594e662f394f436744423345756f64514245564a526d39496361596d5a4b7856584e6a6d636554513d

12 | 6888 | 1 | 245 | 110077819 | 0 | 0 | (0,12) | 11 | 11019 | 32 | 1111111111000000 | 432115 | \x6261726d616e000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7a716d737472687231416e376e39456f367a735677773d3d244d6351307473424e743466593169334d456e4c6a754866306e6c38774c44386f594236704630464e302b493d3a2b6336764a32372b68327832616d626c393554465168474867387270456a5a4a4952376d513044594e33553d

13 | 6640 | 1 | 245 | 110511334 | 0 | 0 | (0,13) | 11 | 11019 | 32 | 1111111111000000 | 432127 | \x73747265616d696e675f6261726d616e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a4e45496938767a714d7076752b594c585952377458413d3d245a39364b744d32316479687658716d3074616b7669717568794671574f45543647384174363954326e52413d3a5a4973674e71305537485344525168784f732b617364336d7559314d7365333373456956352b78694f52413d

14 | 20 | 2 | 0 | | | | | | | | | |

15 | 0 | 3 | 0 | | | | | | | | | |

16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16) | 32779 | 10507 | 32 | 1111111111000000 | 675851 | \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

17 | 6384 | 1 | 108 | 3032044199 | 0 | 0 | (0,17) | 32779 | 10505 | 32 | 1111111110000000 | 774033 | \x726561646163636573730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff

18 | 6136 | 1 | 245 | 3032044199 | 0 | 0 | (0,18) | 32779 | 10507 | 32 | 1111111111000000 | 774134 | \x74696572676173745f726f00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a43335379382b507849324d4e4f4a67306e6f467a5a773d3d24373575643339325154422b46672f364949396a32576e41326e544b62535636666572447662657142432b343d3a5074736777516278566a7336352f4833554e6b484e58664335793578422f552f6973636b426269746159633d

19 | 5888 | 1 | 245 | 3032044199 | 0 | 0 | (0,19) | 32779 | 10507 | 32 | 1111111111000000 | 8608367 | \x666c6565745f6d6f62696c655f6170700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a5069564870613054594247703075746f6a47365061413d3d247951577151776e64574d317a4651787050644936505662516a4246534f5131424632374f7a6162616d36413d3a64504a5144572b4258454f632f526554566c4a7678654541714e4d365878314b77744c667942674745334d3d

20 | 5640 | 1 | 245 | 3032070282 | 0 | 0 | (0,20) | 32779 | 10507 | 32 | 1111111111000000 | 1661808 | \x746965726761737400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7373427448544b574b68674232623265757634664e673d3d2453575a5253613578644a624951796649634244674875796e55684b436848676841475643767742504c50303d3a31656938464d354b30632b674a5a3233716e37314e46767978766d6978756a754c41666c4d445a676c63413d

(20 rows)

Thanks.
Greetings,

Andres Freund

Thank you very much!

Best Regards

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 00:19:37

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
> 
> This is the result:
> 
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
>
\x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 00:19:37

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
> 
> This is the result:
> 
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
>
\x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 01:04:26

Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

> (0,16) | 3031994631 | 0
> 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
> | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen. That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues? Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups.

The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.

I managed to recover the log of the first time we run into the issue, the error was the same but on template1:

May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: found xmin 2600758304 from before relfrozenxid 400011439

May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with 400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it.

The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can provide further details if needed.

Thanks,

Best Regards

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 01:04:26

Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

> (0,16) | 3031994631 | 0
> 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
> | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen. That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues? Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups.

I managed to recover the log of the first time we run into the issue, the error was the same but on template1:

May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: found xmin 2600758304 from before relfrozenxid 400011439

May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: automatic vacuum of table "template1.pg_catalog.pg_authid"

The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can provide further details if needed.

Thanks,

Best Regards

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 02:39:58

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
> 
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 02:39:58

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
> 
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 10:39:54

2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error. At which rate are you
creating / updating database users / roles?

Once or twice a month. Yesterday I updated some rows on the table with a

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 10:39:54

2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error. At which rate are you
creating / updating database users / roles?

Once or twice a month. Yesterday I updated some rows on the table with a

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 21:32:38

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> > 
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
> 
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Andres Freund

Date:

23 May 2018, 21:32:38

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> > 
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
> 
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 22:55:55

2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
tables)?
- Did you have any failovers?
- Do you use logical replication?

1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.

2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.

3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

23 May 2018, 22:55:55

2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
tables)?
- Did you have any failovers?
- Do you use logical replication?

1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.

3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Andrey Borodin

Date:

24 May 2018, 09:30:05

Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never
run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the
wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a
coupleof users were added without any problem. 
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode
asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first
occurrenceof this error was beyond recovery window. 

Best regards, Andrey Borodin.

Re: Error on vacuum: xmin before relfrozenxid

From

Andrey Borodin

Date:

24 May 2018, 09:30:05

Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never
run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the
wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a
coupleof users were added without any problem. 
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode
asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first
occurrenceof this error was beyond recovery window. 

Best regards, Andrey Borodin.

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

24 May 2018, 22:00:55

2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:

Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.

I could build a mirror instance with barman and see if the issue is present as well, then try to vacuum freeze it in single mode, and see if it disappears; but I would like to know why it happened in the first time. I wonder if the autovacuum settings played a role, we kept the defaults, even if the instance has a very heavy update workload.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

Paolo Crosato

Date:

24 May 2018, 22:00:55

2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:

Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.

Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From

"Wong, Yi Wen"

Date:

24 May 2018, 23:25:04

Hello,

We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or
fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared
relations.

We are in the process of coming up with a repro as well.

A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared
relationto proceed. 

Thanks,

Yi Wen Wong

Re: Error on vacuum: xmin before relfrozenxid

From

"Wong, Yi Wen"

Date:

24 May 2018, 23:25:04

Hello,

We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or
fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared
relations.

We are in the process of coming up with a repro as well.

A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared
relationto proceed. 

Thanks,

Yi Wen Wong