Thread: Error on vacuum: xmin before relfrozenxid
Hi,
we have an error happening on a catalog table on one of the dbs in the instance.
This is the error report form:
A description of what you are trying to achieve and what results you expect.:
Any kind of vacuum fails on pg_authid table, I would expect it to succeed. This is occasionaly blocking autovacuums, so we must resort to manual vacuum of busy tables, which is bearable but inconvenient.
PostgreSQL version number you are running:
PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
How you installed PostgreSQL:
From the pgdg yum repositories.
Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all.
name | current_setting | source
------------------------------+------------------------------------------------------------------------------+----------------------
application_name | psql | client
archive_command | rsync -a %p barman@x.x.x.x:/data/backup/barman/ubipgsql-10/incoming/%f | configuration file
archive_mode | on | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 25min | configuration file
client_encoding | UTF8 | client
DateStyle | ISO, MDY | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 64GB | configuration file
hot_standby | on | configuration file
hot_standby_feedback | on | configuration file
lc_messages | en_US.UTF-8 | configuration file
lc_monetary | en_US.UTF-8 | configuration file
lc_numeric | en_US.UTF-8 | configuration file
lc_time | en_US.UTF-8 | configuration file
listen_addresses | * | configuration file
log_autovacuum_min_duration | 1s | configuration file
log_checkpoints | on | configuration file
log_destination | syslog | configuration file
log_directory | log | configuration file
log_filename | postgresql-%a.log | configuration file
log_line_prefix | user=%u,db=%d,client=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 1s | configuration file
log_min_messages | info | configuration file
log_rotation_age | 1d | configuration file
log_rotation_size | 0 | configuration file
log_statement | ddl | configuration file
log_temp_files | 1MB | configuration file
log_timezone | Europe/Rome | configuration file
log_truncate_on_rotation | on | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 1GB | configuration file
max_connections | 1000 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 12GB | configuration file
min_wal_size | 80MB | configuration file
password_encryption | scram-sha-256 | configuration file
pg_stat_statements.max | 10000 | configuration file
pg_stat_statements.track | all | configuration file
shared_buffers | 32GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
syslog_facility | local0 | configuration file
syslog_ident | postgres | configuration file
TimeZone | Europe/Rome | configuration file
track_activity_query_size | 2048 | configuration file
track_functions | all | configuration file
track_io_timing | on | configuration file
wal_keep_segments | 200 | configuration file
wal_level | logical | configuration file
(51 rows)
Operating system and version:
CentOS Linux release 7.4.1708 (Core)
Linux xx.xx.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
What program you're using to connect to PostgreSQL:
psql, jdbc driver. It's not an application side error.
Is there anything relevant or unusual in the PostgreSQL server logs?:
Yes, we see these kind of errors:
May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client= ERROR: found xmin 2889675859 from before relfrozenxid 400011439
May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client= CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"
When these errors pop up, autovacuum repeatedly fails on this very same table, and we have to resort to manual vacuums
For questions about any kind of error:
What you were doing when the error happened / how to cause the error:
The first time we saw the error, we found an idle transaction was left open for several days. The user was not monitored for long running transactions. We killed the session, deleted the offending row (the one with the xmin value reported) and then vacuumed the table.
However, teh same error pops up every time we try to add a new user. We would like to fix these error once and for all, so we can add new users.
The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)
This is the error:
psql (10.4)
Type "help" for help.
postgres=# vacuum pg_authid;
ERROR: found xmin 3031994631 from before relfrozenxid 400011439
Some datas:
postgres=# select xmin from pg_authid ;
xmin
------------
1
1
1
1
1
557
7216348
110077819
110511334
3031994631
3032044199
3032044199
3032044199
3032070282
(14 rows)
postgres=# select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
400011439
(1 row)
postgres=#
Is this a sympthom of data corruption or transaction wraparound due to the long running transaction that we killed weeks ago? This is the only table in the whole cluster that has this error. We are monitoring transactions wraparound with the the check_postgres.pl script, the check is still running fine and no alert was given at all since the cluster has been running.
Thank you in advance for any answer.
Paolo Crosato
Hi, On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote: > PostgreSQL version number you are running: > > PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 > 20150623 (Red Hat 4.8.5-28), 64-bit > > How you installed PostgreSQL: > > From the pgdg yum repositories. > > Changes made to the settings in the postgresql.conf file: see Server > Configuration for a quick way to list them all. > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client= > ERROR: found xmin 2889675859 from before relfrozenxid 400011439 > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client= > CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid" > > postgres=# select xmin from pg_authid ; > xmin > ------------ > 1 > 1 > 1 > 1 > 1 > 557 > 7216348 > 110077819 > 110511334 > 3031994631 > 3032044199 > 3032044199 > 3032044199 > 3032070282 > (14 rows) > > postgres=# select relfrozenxid from pg_class where relname='pg_authid'; > relfrozenxid > -------------- > 400011439 > (1 row) > > postgres=# > > Is this a sympthom of data corruption or transaction wraparound due to the > long running transaction that we killed weeks ago? That shouldn't be the sole cause. > This is the only table in the whole cluster that has this error. We > are monitoring transactions wraparound with the the check_postgres.pl > script, the check is still running fine and no alert was given at all > since the cluster has been running. > > Thank you in advance for any answer. Could you report the result of select ctid, xmin, xmax from pg_authid ; and CREATE EXTENSION pageinspect; SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0)); Thanks. Greetings, Andres Freund
Hello,
2018-05-22 18:49 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
>
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
>
> How you installed PostgreSQL:
>
> From the pgdg yum repositories.
>
> Changes made to the settings in the postgresql.conf file: see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR: found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"
>That shouldn't be the sole cause.> postgres=# select xmin from pg_authid ;
> xmin
> ------------
> 1
> 1
> 1
> 1
> 1
> 557
> 7216348
> 110077819
> 110511334
> 3031994631
> 3032044199
> 3032044199
> 3032044199
> 3032070282
> (14 rows)
>
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439
> (1 row)
>
> postgres=#
>
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?
> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
>
> Thank you in advance for any answer.
Could you report the result of
select ctid, xmin, xmax from pg_authid ;
This is the result:
postgres=# select ctid, xmin, xmax from pg_authid ;
ctid | xmin | xmax
--------+------------+------
(0,1) | 1 | 0
(0,2) | 1 | 0
(0,3) | 1 | 0
(0,4) | 1 | 0
(0,5) | 1 | 0
(0,6) | 557 | 0
(0,11) | 7216348 | 0
(0,12) | 110077819 | 0
(0,13) | 110511334 | 0
(0,16) | 3031994631 | 0
(0,17) | 3032044199 | 0
(0,18) | 3032044199 | 0
(0,19) | 3032044199 | 0
(0,20) | 3032070282 | 0
(14 rows)
postgres=#
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
This is the output of the second query:
postgres=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+------------+--------+----------+--------+-------------+------------+--------+------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 8080 | 1 | 108 | 1 | 0 | 0 | (0,1) | 11 | 2313 | 32 | 1111111110000000 | 3373 | \x70675f6d6f6e69746f720000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
2 | 7968 | 1 | 108 | 1 | 0 | 0 | (0,2) | 11 | 2313 | 32 | 1111111110000000 | 3374 | \x70675f726561645f616c6c5f73657474696e677300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
3 | 7856 | 1 | 108 | 1 | 0 | 0 | (0,3) | 11 | 2313 | 32 | 1111111110000000 | 3375 | \x70675f726561645f616c6c5f737461747300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
4 | 7744 | 1 | 108 | 1 | 0 | 0 | (0,4) | 11 | 2313 | 32 | 1111111110000000 | 3377 | \x70675f737461745f7363616e5f7461626c65730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
5 | 7632 | 1 | 108 | 1 | 0 | 0 | (0,5) | 11 | 2313 | 32 | 1111111110000000 | 4200 | \x70675f7369676e616c5f6261636b656e6400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
6 | 7384 | 1 | 245 | 557 | 0 | 0 | (0,6) | 11 | 11019 | 32 | 1111111111000000 | 24576 | \x7265706c6963610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a305a6a336a39676d31314e473166455a6c75774a52413d3d2448394d446b2b445a4a723856544955702f684933664e4933376c777675696e49336731444a6672447863383d3a312f485442645265694c54306341724c47526f4d6d7739415a7351674c365a4c454a324e4e55306b5775413d
7 | 16 | 2 | 0 | | | | | | | | | |
8 | 17 | 2 | 0 | | | | | | | | | |
9 | 18 | 2 | 0 | | | | | | | | | |
10 | 19 | 2 | 0 | | | | | | | | | |
11 | 7136 | 1 | 245 | 7216348 | 0 | 0 | (0,11) | 11 | 11019 | 32 | 1111111111000000 | 10 | \x706f73746772657300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff24020000534352414d2d5348412d32353624343039363a72532f36303674706f523975465558797566392f6f413d3d2435412b4e745a444265333365397143556570422b324748664236567564717a65434b56427067576c2f41773d3a5a5863594e662f394f436744423345756f64514245564a526d39496361596d5a4b7856584e6a6d636554513d
12 | 6888 | 1 | 245 | 110077819 | 0 | 0 | (0,12) | 11 | 11019 | 32 | 1111111111000000 | 432115 | \x6261726d616e000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7a716d737472687231416e376e39456f367a735677773d3d244d6351307473424e743466593169334d456e4c6a754866306e6c38774c44386f594236704630464e302b493d3a2b6336764a32372b68327832616d626c393554465168474867387270456a5a4a4952376d513044594e33553d
13 | 6640 | 1 | 245 | 110511334 | 0 | 0 | (0,13) | 11 | 11019 | 32 | 1111111111000000 | 432127 | \x73747265616d696e675f6261726d616e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a4e45496938767a714d7076752b594c585952377458413d3d245a39364b744d32316479687658716d3074616b7669717568794671574f45543647384174363954326e52413d3a5a4973674e71305537485344525168784f732b617364336d7559314d7365333373456956352b78694f52413d
14 | 20 | 2 | 0 | | | | | | | | | |
15 | 0 | 3 | 0 | | | | | | | | | |
16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16) | 32779 | 10507 | 32 | 1111111111000000 | 675851 | \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639
17 | 6384 | 1 | 108 | 3032044199 | 0 | 0 | (0,17) | 32779 | 10505 | 32 | 1111111110000000 | 774033 | \x726561646163636573730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
18 | 6136 | 1 | 245 | 3032044199 | 0 | 0 | (0,18) | 32779 | 10507 | 32 | 1111111111000000 | 774134 | \x74696572676173745f726f00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a43335379382b507849324d4e4f4a67306e6f467a5a773d3d24373575643339325154422b46672f364949396a32576e41326e544b62535636666572447662657142432b343d3a5074736777516278566a7336352f4833554e6b484e58664335793578422f552f6973636b426269746159633d
19 | 5888 | 1 | 245 | 3032044199 | 0 | 0 | (0,19) | 32779 | 10507 | 32 | 1111111111000000 | 8608367 | \x666c6565745f6d6f62696c655f6170700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a5069564870613054594247703075746f6a47365061413d3d247951577151776e64574d317a4651787050644936505662516a4246534f5131424632374f7a6162616d36413d3a64504a5144572b4258454f632f526554566c4a7678654541714e4d365878314b77744c667942674745334d3d
20 | 5640 | 1 | 245 | 3032070282 | 0 | 0 | (0,20) | 32779 | 10507 | 32 | 1111111111000000 | 1661808 | \x746965726761737400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7373427448544b574b68674232623265757634664e673d3d2453575a5253613578644a624951796649634244674875796e55684b436848676841475643767742504c50303d3a31656938464d354b30632b674a5a3233716e37314e46767978766d6978756a754c41666c4d445a676c63413d
(20 rows)
Thanks.
Greetings,
Andres Freund
Thank you very much!
Best Regards
Paolo Crosato
Hi, On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote: > > Could you report the result of > > select ctid, xmin, xmax from pg_authid ; > > > > This is the result: > > postgres=# select ctid, xmin, xmax from pg_authid ; > (0,16) | 3031994631 | 0 > 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16) > | 32779 | 10507 | 32 | 1111111111000000 | 675851 | > \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639 > postgres=# select relfrozenxid from pg_class where relname='pg_authid'; > relfrozenxid > -------------- > 400011439 That tuple looks, to me, like it indeed shouldn't exist, given the relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right, shows: HEAP_HASNULL HEAP_HASVARWIDTH HEAP_HASOID HEAP_XMIN_COMMITTED HEAP_XMAX_INVALID HEAP_UPDATED so it's not frozen. That suggests there's either a bug, or you had corruption in your cluster. Could you give a bit of "history" about that postgres instance? What version of postgres was it run on earliest? Was it ever pg_upgraded? Were there any OS crashes? Other hardware issues? Was the cluster ever used with fsync=off or full_page_writes=off? Greetings, Andres Freund
Hi, On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote: > > Could you report the result of > > select ctid, xmin, xmax from pg_authid ; > > > > This is the result: > > postgres=# select ctid, xmin, xmax from pg_authid ; > (0,16) | 3031994631 | 0 > 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16) > | 32779 | 10507 | 32 | 1111111111000000 | 675851 | > \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639 > postgres=# select relfrozenxid from pg_class where relname='pg_authid'; > relfrozenxid > -------------- > 400011439 That tuple looks, to me, like it indeed shouldn't exist, given the relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right, shows: HEAP_HASNULL HEAP_HASVARWIDTH HEAP_HASOID HEAP_XMIN_COMMITTED HEAP_XMAX_INVALID HEAP_UPDATED so it's not frozen. That suggests there's either a bug, or you had corruption in your cluster. Could you give a bit of "history" about that postgres instance? What version of postgres was it run on earliest? Was it ever pg_upgraded? Were there any OS crashes? Other hardware issues? Was the cluster ever used with fsync=off or full_page_writes=off? Greetings, Andres Freund
Good evening,
2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;
> (0,16) | 3031994631 | 0
> 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
> | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> \x6e6167696f7300000000000000000 000000000000000000000000000000 000000000000000000000000000000 000000000000000000000000000000 0000000000001000001000000fffff fff496d64356333633236616163636 439616665346437383061396239613 464663634653639
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439
That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
shows:
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED
so it's not frozen. That suggests there's either a bug, or you had
corruption in your cluster.
Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues? Was the cluster ever
used with fsync=off or full_page_writes=off?
Greetings,
Andres Freund
The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups.
The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.
I managed to recover the log of the first time we run into the issue, the error was the same but on template1:
May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: found xmin 2600758304 from before relfrozenxid 400011439
May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: automatic vacuum of table "template1.pg_catalog.pg_authid"
Deleting the row worked, but, as you see, the problem with 400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it.
The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.
I still have the logs since the first time the error appeared, so I can provide further details if needed.
Thanks,
Best Regards
Paolo Crosato
Good evening,
2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;
> (0,16) | 3031994631 | 0
> 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
> | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> \x6e6167696f7300000000000000000 000000000000000000000000000000 000000000000000000000000000000 000000000000000000000000000000 0000000000001000001000000fffff fff496d64356333633236616163636 439616665346437383061396239613 464663634653639
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439
That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
shows:
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED
so it's not frozen. That suggests there's either a bug, or you had
corruption in your cluster.
Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues? Was the cluster ever
used with fsync=off or full_page_writes=off?
Greetings,
Andres Freund
The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups.
The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.
I managed to recover the log of the first time we run into the issue, the error was the same but on template1:
May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: found xmin 2600758304 from before relfrozenxid 400011439
May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: automatic vacuum of table "template1.pg_catalog.pg_authid"
Deleting the row worked, but, as you see, the problem with 400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it.
The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.
I still have the logs since the first time the error appeared, so I can provide further details if needed.
Thanks,
Best Regards
Paolo Crosato
Hi, On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote: > I managed to recover the log of the first time we run into the issue, the > error was the same but on template1: > > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: > found xmin 2600758304 from before relfrozenxid 400011439 > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: > automatic vacuum of table "template1.pg_catalog.pg_authid" pg_authid (along with a few other tables) is shared between databases. So that's just hte same error. At which rate are you creating / updating database users / roles? Greetings, Andres Freund
Hi, On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote: > I managed to recover the log of the first time we run into the issue, the > error was the same but on template1: > > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: > found xmin 2600758304 from before relfrozenxid 400011439 > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: > automatic vacuum of table "template1.pg_catalog.pg_authid" pg_authid (along with a few other tables) is shared between databases. So that's just hte same error. At which rate are you creating / updating database users / roles? Greetings, Andres Freund
2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"
pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error. At which rate are you
creating / updating database users / roles?
Once or twice a month. Yesterday I updated some rows on the table with a
update pg_auth_id set rolname=rolname where rolname=...
for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.
Best Regards,
Paolo Crosato
2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,
On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"
pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error. At which rate are you
creating / updating database users / roles?
Once or twice a month. Yesterday I updated some rows on the table with a
update pg_auth_id set rolname=rolname where rolname=...
for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.
Best Regards,
Paolo Crosato
On 2018-05-22 16:39:58 -0700, Andres Freund wrote: > Hi, > > On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote: > > I managed to recover the log of the first time we run into the issue, the > > error was the same but on template1: > > > > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: > > found xmin 2600758304 from before relfrozenxid 400011439 > > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: > > automatic vacuum of table "template1.pg_catalog.pg_authid" > > pg_authid (along with a few other tables) is shared between > databases. So that's just hte same error. At which rate are you > creating / updating database users / roles? Other questions: - did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all tables)? - Did you have any failovers? - Do you use logical replication? Greetings, Andres Freund
On 2018-05-22 16:39:58 -0700, Andres Freund wrote: > Hi, > > On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote: > > I managed to recover the log of the first time we run into the issue, the > > error was the same but on template1: > > > > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: > > found xmin 2600758304 from before relfrozenxid 400011439 > > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: > > automatic vacuum of table "template1.pg_catalog.pg_authid" > > pg_authid (along with a few other tables) is shared between > databases. So that's just hte same error. At which rate are you > creating / updating database users / roles? Other questions: - did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all tables)? - Did you have any failovers? - Do you use logical replication? Greetings, Andres Freund
2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?
Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
tables)?
- Did you have any failovers?
- Do you use logical replication?
1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.
Best Regards,
Paolo Crosato
2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?
Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
tables)?
- Did you have any failovers?
- Do you use logical replication?
1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.
Best Regards,
Paolo Crosato
Hi! > 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а): > > 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run. > 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a coupleof users were added without any problem. > 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication. I've encountered seemingly similar ERROR: [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321 [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database" Table pg_database, probably, was not changed anyhow for a long period of database exploitation. Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrenceof this error was beyond recovery window. Best regards, Andrey Borodin.
Hi! > 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а): > > 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run. > 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a coupleof users were added without any problem. > 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication. I've encountered seemingly similar ERROR: [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321 [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database" Table pg_database, probably, was not changed anyhow for a long period of database exploitation. Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrenceof this error was beyond recovery window. Best regards, Andrey Borodin.
2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi!
> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.
I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database"
Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.
Best regards, Andrey Borodin.
I could build a mirror instance with barman and see if the issue is present as well, then try to vacuum freeze it in single mode, and see if it disappears; but I would like to know why it happened in the first time. I wonder if the autovacuum settings played a role, we kept the defaults, even if the instance has a very heavy update workload.
Best Regards,
Paolo Crosato
2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi!
> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.
I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database"
Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.
Best regards, Andrey Borodin.
I could build a mirror instance with barman and see if the issue is present as well, then try to vacuum freeze it in single mode, and see if it disappears; but I would like to know why it happened in the first time. I wonder if the autovacuum settings played a role, we kept the defaults, even if the instance has a very heavy update workload.
Best Regards,
Paolo Crosato
Hello, We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared relations. We are in the process of coming up with a repro as well. A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared relationto proceed. Thanks, Yi Wen Wong
Hello, We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared relations. We are in the process of coming up with a repro as well. A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared relationto proceed. Thanks, Yi Wen Wong