Thread: Error on vacuum: xmin before relfrozenxid

Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:
Hi,

we have an error happening on a catalog table on one of the dbs in the instance. 
This is the error report form:

A description of what you are trying to achieve and what results you expect.:

Any kind of vacuum fails on pg_authid table, I would expect it to succeed. This is occasionaly blocking autovacuums, so we must resort to manual vacuum of busy tables, which is bearable but inconvenient.

PostgreSQL version number you are running:

PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit

How you installed PostgreSQL:

From the pgdg yum repositories.

Changes made to the settings in the postgresql.conf file:  see Server Configuration for a quick way to list them all.

             name             |                               current_setting                                |        source
------------------------------+------------------------------------------------------------------------------+----------------------
 application_name             | psql                                                                         | client
 archive_command              | rsync -a %p barman@x.x.x.x:/data/backup/barman/ubipgsql-10/incoming/%f | configuration file
 archive_mode                 | on                                                                           | configuration file
 checkpoint_completion_target | 0.9                                                                          | configuration file
 checkpoint_timeout           | 25min                                                                        | configuration file
 client_encoding              | UTF8                                                                         | client
 DateStyle                    | ISO, MDY                                                                     | configuration file
 default_text_search_config   | pg_catalog.english                                                           | configuration file
 dynamic_shared_memory_type   | posix                                                                        | configuration file
 effective_cache_size         | 64GB                                                                         | configuration file
 hot_standby                  | on                                                                           | configuration file
 hot_standby_feedback         | on                                                                           | configuration file
 lc_messages                  | en_US.UTF-8                                                                  | configuration file
 lc_monetary                  | en_US.UTF-8                                                                  | configuration file
 lc_numeric                   | en_US.UTF-8                                                                  | configuration file
 lc_time                      | en_US.UTF-8                                                                  | configuration file
 listen_addresses             | *                                                                            | configuration file
 log_autovacuum_min_duration  | 1s                                                                           | configuration file
 log_checkpoints              | on                                                                           | configuration file
 log_destination              | syslog                                                                       | configuration file
 log_directory                | log                                                                          | configuration file
 log_filename                 | postgresql-%a.log                                                            | configuration file
 log_line_prefix              | user=%u,db=%d,client=%h                                                      | configuration file
 log_lock_waits               | on                                                                           | configuration file
 log_min_duration_statement   | 1s                                                                           | configuration file
 log_min_messages             | info                                                                         | configuration file
 log_rotation_age             | 1d                                                                           | configuration file
 log_rotation_size            | 0                                                                            | configuration file
 log_statement                | ddl                                                                          | configuration file
 log_temp_files               | 1MB                                                                          | configuration file
 log_timezone                 | Europe/Rome                                                                  | configuration file
 log_truncate_on_rotation     | on                                                                           | configuration file
 logging_collector            | on                                                                           | configuration file
 maintenance_work_mem         | 1GB                                                                          | configuration file
 max_connections              | 1000                                                                         | configuration file
 max_stack_depth              | 2MB                                                                          | environment variable
 max_wal_size                 | 12GB                                                                         | configuration file
 min_wal_size                 | 80MB                                                                         | configuration file
 password_encryption          | scram-sha-256                                                                | configuration file
 pg_stat_statements.max       | 10000                                                                        | configuration file
 pg_stat_statements.track     | all                                                                          | configuration file
 shared_buffers               | 32GB                                                                         | configuration file
 shared_preload_libraries     | pg_stat_statements                                                           | configuration file
 syslog_facility              | local0                                                                       | configuration file
 syslog_ident                 | postgres                                                                     | configuration file
 TimeZone                     | Europe/Rome                                                                  | configuration file
 track_activity_query_size    | 2048                                                                         | configuration file
 track_functions              | all                                                                          | configuration file
 track_io_timing              | on                                                                           | configuration file
 wal_keep_segments            | 200                                                                          | configuration file
 wal_level                    | logical                                                                      | configuration file
(51 rows)



Operating system and version:

CentOS Linux release 7.4.1708 (Core)
Linux xx.xx.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What program you're using to connect to PostgreSQL:

psql, jdbc driver. It's not an application side error.

Is there anything relevant or unusual in the PostgreSQL server logs?:

Yes, we see these kind of errors:

May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client= ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client= CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"

When these errors pop up, autovacuum repeatedly fails on this very same table, and we have to resort to manual vacuums
 
For questions about any kind of error:
What you were doing when the error happened / how to cause the error:

The first time we saw the error, we found an idle transaction was left open for several days. The user was not monitored for long running transactions. We killed the session, deleted the offending row (the one with the xmin value reported) and then vacuumed the table.
However, teh same error pops up every time we try to add a new user. We would like to fix these error once and for all, so we can add new users.

The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)

This is the error: 

psql (10.4)
Type "help" for help.

postgres=# vacuum pg_authid;
ERROR:  found xmin 3031994631 from before relfrozenxid 400011439

Some datas:

postgres=# select xmin from pg_authid ;
    xmin
------------
          1
          1
          1
          1
          1
        557
    7216348
  110077819
  110511334
 3031994631
 3032044199
 3032044199
 3032044199
 3032070282
(14 rows)

postgres=# select relfrozenxid from pg_class where relname='pg_authid';
 relfrozenxid
--------------
    400011439
(1 row)

postgres=#

Is this a sympthom of data corruption or transaction wraparound due to the long running transaction that we killed weeks ago? This is the only table in the whole cluster that has this error. We are monitoring transactions wraparound with the the check_postgres.pl script, the check is still running fine and no alert was given at all since the cluster has been running.

Thank you in advance for any answer.

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
> 
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
> 
> How you installed PostgreSQL:
> 
> From the pgdg yum repositories.
> 
> Changes made to the settings in the postgresql.conf file:  see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"
> 
> postgres=# select xmin from pg_authid ;
>     xmin
> ------------
>           1
>           1
>           1
>           1
>           1
>         557
>     7216348
>   110077819
>   110511334
>  3031994631
>  3032044199
>  3032044199
>  3032044199
>  3032070282
> (14 rows)
> 
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439
> (1 row)
> 
> postgres=#
> 
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.


> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
> 
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

Thanks.
Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:
Hello,


2018-05-22 18:49 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
>
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
>
> How you installed PostgreSQL:
>
> From the pgdg yum repositories.
>
> Changes made to the settings in the postgresql.conf file:  see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"
>
> postgres=# select xmin from pg_authid ;
>     xmin
> ------------
>           1
>           1
>           1
>           1
>           1
>         557
>     7216348
>   110077819
>   110511334
>  3031994631
>  3032044199
>  3032044199
>  3032044199
>  3032070282
> (14 rows)
>
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439
> (1 row)
>
> postgres=#
>
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.


> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
>
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;

This is the result:

postgres=# select ctid, xmin, xmax from pg_authid ;
  ctid  |    xmin    | xmax
--------+------------+------
 (0,1)  |          1 |    0
 (0,2)  |          1 |    0
 (0,3)  |          1 |    0
 (0,4)  |          1 |    0
 (0,5)  |          1 |    0
 (0,6)  |        557 |    0
 (0,11) |    7216348 |    0
 (0,12) |  110077819 |    0
 (0,13) |  110511334 |    0
 (0,16) | 3031994631 |    0
 (0,17) | 3032044199 |    0
 (0,18) | 3032044199 |    0
 (0,19) | 3032044199 |    0
 (0,20) | 3032070282 |    0
(14 rows)

postgres=#
 
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

This is the output of the second query:

postgres=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
 lp | lp_off | lp_flags | lp_len |   t_xmin   | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |      t_bits      |  t_oid  |                 t_data
----+--------+----------+--------+------------+--------+----------+--------+-------------+------------+--------+------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 |   8080 |        1 |    108 |          1 |      0 |        0 | (0,1)  |          11 |       2313 |     32 | 1111111110000000 |    3373 | \x70675f6d6f6e69746f720000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
  2 |   7968 |        1 |    108 |          1 |      0 |        0 | (0,2)  |          11 |       2313 |     32 | 1111111110000000 |    3374 | \x70675f726561645f616c6c5f73657474696e677300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
  3 |   7856 |        1 |    108 |          1 |      0 |        0 | (0,3)  |          11 |       2313 |     32 | 1111111110000000 |    3375 | \x70675f726561645f616c6c5f737461747300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
  4 |   7744 |        1 |    108 |          1 |      0 |        0 | (0,4)  |          11 |       2313 |     32 | 1111111110000000 |    3377 | \x70675f737461745f7363616e5f7461626c65730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
  5 |   7632 |        1 |    108 |          1 |      0 |        0 | (0,5)  |          11 |       2313 |     32 | 1111111110000000 |    4200 | \x70675f7369676e616c5f6261636b656e6400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
  6 |   7384 |        1 |    245 |        557 |      0 |        0 | (0,6)  |          11 |      11019 |     32 | 1111111111000000 |   24576 | \x7265706c6963610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a305a6a336a39676d31314e473166455a6c75774a52413d3d2448394d446b2b445a4a723856544955702f684933664e4933376c777675696e49336731444a6672447863383d3a312f485442645265694c54306341724c47526f4d6d7739415a7351674c365a4c454a324e4e55306b5775413d
  7 |     16 |        2 |      0 |            |        |          |        |             |            |        |                  |         |
  8 |     17 |        2 |      0 |            |        |          |        |             |            |        |                  |         |
  9 |     18 |        2 |      0 |            |        |          |        |             |            |        |                  |         |
 10 |     19 |        2 |      0 |            |        |          |        |             |            |        |                  |         |
 11 |   7136 |        1 |    245 |    7216348 |      0 |        0 | (0,11) |          11 |      11019 |     32 | 1111111111000000 |      10 | \x706f73746772657300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff24020000534352414d2d5348412d32353624343039363a72532f36303674706f523975465558797566392f6f413d3d2435412b4e745a444265333365397143556570422b324748664236567564717a65434b56427067576c2f41773d3a5a5863594e662f394f436744423345756f64514245564a526d39496361596d5a4b7856584e6a6d636554513d
 12 |   6888 |        1 |    245 |  110077819 |      0 |        0 | (0,12) |          11 |      11019 |     32 | 1111111111000000 |  432115 | \x6261726d616e000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7a716d737472687231416e376e39456f367a735677773d3d244d6351307473424e743466593169334d456e4c6a754866306e6c38774c44386f594236704630464e302b493d3a2b6336764a32372b68327832616d626c393554465168474867387270456a5a4a4952376d513044594e33553d
 13 |   6640 |        1 |    245 |  110511334 |      0 |        0 | (0,13) |          11 |      11019 |     32 | 1111111111000000 |  432127 | \x73747265616d696e675f6261726d616e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a4e45496938767a714d7076752b594c585952377458413d3d245a39364b744d32316479687658716d3074616b7669717568794671574f45543647384174363954326e52413d3a5a4973674e71305537485344525168784f732b617364336d7559314d7365333373456956352b78694f52413d
 14 |     20 |        2 |      0 |            |        |          |        |             |            |        |                  |         |
 15 |      0 |        3 |      0 |            |        |          |        |             |            |        |                  |         |
 16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16) |       32779 |      10507 |     32 | 1111111111000000 |  675851 | \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639
 17 |   6384 |        1 |    108 | 3032044199 |      0 |        0 | (0,17) |       32779 |      10505 |     32 | 1111111110000000 |  774033 | \x726561646163636573730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
 18 |   6136 |        1 |    245 | 3032044199 |      0 |        0 | (0,18) |       32779 |      10507 |     32 | 1111111111000000 |  774134 | \x74696572676173745f726f00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a43335379382b507849324d4e4f4a67306e6f467a5a773d3d24373575643339325154422b46672f364949396a32576e41326e544b62535636666572447662657142432b343d3a5074736777516278566a7336352f4833554e6b484e58664335793578422f552f6973636b426269746159633d
 19 |   5888 |        1 |    245 | 3032044199 |      0 |        0 | (0,19) |       32779 |      10507 |     32 | 1111111111000000 | 8608367 | \x666c6565745f6d6f62696c655f6170700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a5069564870613054594247703075746f6a47365061413d3d247951577151776e64574d317a4651787050644936505662516a4246534f5131424632374f7a6162616d36413d3a64504a5144572b4258454f632f526554566c4a7678654541714e4d365878314b77744c667942674745334d3d
 20 |   5640 |        1 |    245 | 3032070282 |      0 |        0 | (0,20) |       32779 |      10507 |     32 | 1111111111000000 | 1661808 | \x746965726761737400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7373427448544b574b68674232623265757634664e673d3d2453575a5253613578644a624951796649634244674875796e55684b436848676841475643767742504c50303d3a31656938464d354b30632b674a5a3233716e37314e46767978766d6978756a754c41666c4d445a676c63413d
(20 rows)
 

 

Thanks.
Greetings,

Andres Freund

Thank you very much!

Best Regards

Paolo Crosato 

Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
> 
> This is the result:
> 
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
>
\x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.


Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
> 
> This is the result:
> 
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
>
\x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.


Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:
Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.


Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups. 
The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.

I managed to recover the log of the first time we run into the issue, the error was the same but on template1:

May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:  found xmin 2600758304 from before relfrozenxid 400011439
May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:  automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with  400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it. 

The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can provide further details if needed.

Thanks,
Best Regards

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:
Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |    0
>  16 |   6496 |        1 |    144 | 3031994631 |      0 |        0 | (0,16)
> |       32779 |      10507 |     32 | 1111111111000000 |  675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>     400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.


Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups. 
The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved.

I managed to recover the log of the first time we run into the issue, the error was the same but on template1:

May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:  found xmin 2600758304 from before relfrozenxid 400011439
May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:  automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with  400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it. 

The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can provide further details if needed.

Thanks,
Best Regards

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
> 
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
> 
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Once or twice a month. Yesterday I updated some rows on the table with a 

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.
 
Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-23 1:39 GMT+02:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Once or twice a month. Yesterday I updated some rows on the table with a 

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work. That's why there are recent xmin on some rows.
 
Best Regards,

Paolo Crosato

Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> > 
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
> 
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Andres Freund
Date:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> > 
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
> 
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund


Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

Best Regards,

Paolo Crosato
 

Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-23 20:32 GMT+02:00 Andres Freund <andres@anarazel.de>:
On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

Best Regards,

Paolo Crosato
 

Re: Error on vacuum: xmin before relfrozenxid

From
Andrey Borodin
Date:
Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never
run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the
wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a
coupleof users were added without any problem. 
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode
asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first
occurrenceof this error was beyond recovery window. 

Best regards, Andrey Borodin.

Re: Error on vacuum: xmin before relfrozenxid

From
Andrey Borodin
Date:
Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never
run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the
wholecluster was wiped, including all the application and monitoring users. After the db was moved to production, a
coupleof users were added without any problem. 
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode
asap.But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first
occurrenceof this error was beyond recovery window. 

Best regards, Andrey Borodin.

Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.

I could build a mirror instance with barman and see if the issue is present as well, then try to vacuum freeze it in single mode, and see if it disappears; but I would like to know why it happened in the first time. I wonder if the autovacuum settings played a role, we kept the defaults, even if the instance has a very heavy update workload.

Best Regards,

Paolo Crosato 

Re: Error on vacuum: xmin before relfrozenxid

From
Paolo Crosato
Date:


2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo.crosato@gmail.com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.

I could build a mirror instance with barman and see if the issue is present as well, then try to vacuum freeze it in single mode, and see if it disappears; but I would like to know why it happened in the first time. I wonder if the autovacuum settings played a role, we kept the defaults, even if the instance has a very heavy update workload.

Best Regards,

Paolo Crosato 

Re: Error on vacuum: xmin before relfrozenxid

From
"Wong, Yi Wen"
Date:
Hello,

We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or
fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared
relations.

We are in the process of coming up with a repro as well.

A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared
relationto proceed. 

Thanks,

Yi Wen Wong



Re: Error on vacuum: xmin before relfrozenxid

From
"Wong, Yi Wen"
Date:
Hello,

We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or
fullRCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared
relations.

We are in the process of coming up with a repro as well.

A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared
relationto proceed. 

Thanks,

Yi Wen Wong