datfrozenxid not dropping after vacuum - Mailing list pgsql-general

From Matthew Tice
Subject datfrozenxid not dropping after vacuum
Date
Msg-id CA+taBv9cXJkHUxeDC_rvd9BwB70kTCqxN8tcwq=zft9XV_--cw@mail.gmail.com
Whole thread Raw
Responses Re: datfrozenxid not dropping after vacuum  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-general
Hi,

Starting this morning at 0830 local time I noticed that my
datfrozenxid starts moving past the `autovacuum_freeze_max_age` value
of 200000000.   When we encountered this in the past the solution has
been to do one of the following:

1. This is related an error similar to
```
found xmin 2675436435 from before relfrozenxid 321165377
```
Where the solution has been to move the `pg_internal.init` file out of
the way and let Postgresql recreate it.  Or;

2. A long-running transaction.  Typically I'll just find the `idle in
transaction` transactions that have a `query_start` around when my
alarm went off notifying me when `datfrozenxid` breaches
`autovacuum_freeze_max_age`.  Using a query similar to
```
SELECT pid, query_start, datname, usename, state, backend_xmin,
age(backend_xmin)
FROM pg_stat_activity
WHERE state = 'idle in transaction';
```

3. The autovacuum process seemed to be "stuck" on a particular table.
We would kill the pid of the autovacuum process.

The problem is that neither of these solutions have seemed to drop
`datfrozenxid` back down and there is one specific database in this
cluster that's holding onto it.

Using these queries from CrunchyData:

# Show oldest current xid
# WITH max_age AS (
    SELECT 2000000000 as max_old_xid
        , setting AS autovacuum_freeze_max_age
        FROM pg_catalog.pg_settings
        WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS (
    SELECT datname
        , m.max_old_xid::int
        , m.autovacuum_freeze_max_age::int
        , age(d.datfrozenxid) AS oldest_current_xid
    FROM pg_catalog.pg_database d
    JOIN max_age m ON (true)
    WHERE d.datallowconn )
SELECT max(oldest_current_xid) AS oldest_current_xid
    , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS
percent_towards_wraparound
    , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float)))
AS percent_towards_emergency_autovac
FROM per_database_stats;

 oldest_current_xid | percent_towards_wraparound |
percent_towards_emergency_autovac
--------------------+----------------------------+-----------------------------------
          230935699 |                         12 |
          115

# Top 8 individual databases
SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database
ORDER BY 2 DESC LIMIT 8;

          datname          |    age    | current_setting
---------------------------+-----------+-----------------
 siteservice  | 230935699 | 200000000
 coupon-ws           | 217613246 | 200000000
 contextchangestore | 211343280 | 200000000
 template0                 | 210351639 | 200000000
 productmanager | 207876167 | 200000000
 rhistory              | 207876167 | 200000000
 smsservice   | 207876167 | 200000000
 techservice      | 205432524 | 200000000

That top database `siteservice` is the "problematic" one where a
manual vacuum won't coerce it to free up the `datfrozenxid`.

Looking at the tables in that database:

# SELECT c.oid::regclass
    , age(c.relfrozenxid)
    , pg_size_pretty(pg_total_relation_size(c.oid))
FROM pg_class c
JOIN pg_namespace n on c.relnamespace = n.oid
WHERE relkind IN ('r', 't', 'm')
AND n.nspname NOT IN ('pg_toast')
ORDER BY 2 DESC LIMIT 10;

       oid         |    age    | pg_size_pretty
--------------------+-----------+----------------
 pg_database        | 230935699 | 4264 kB
 pg_proc            |  93543215 | 976 kB
 pg_collation       |  93543215 | 560 kB
 pg_attribute       |  93543215 | 600 kB
 pg_shdepend        |  59515320 | 15 MB
 pg_statistic       |  53828900 | 464 kB
 pg_subscription    |  53172718 | 16 kB
 pg_pltemplate      |  53172718 | 56 kB
 pg_authid          |  53172718 | 8616 kB
 pg_db_role_setting |  53172718 | 64 kB

I thought maybe it had to do with my replication slots somehow:

# select slot_name, slot_type, database, active, catalog_xmin,
restart_lsn, confirmed_flush_lsn from pg_replication_slots ;
               slot_name                | slot_type |
database            | active | catalog_xmin |  restart_lsn  |
confirmed_flush_lsn

----------------------------------------+-----------+-------------------------------+--------+--------------+---------------+---------------------
 dbs1db02                               | physical  |
             | f      |              |               |
 dbs1db01                               | physical  |
             | t      |              | 4D25/ACE6EE08 |
 dbs1db03                               | physical  |
             | t      |              | 4D25/ACE6EE08 |
 dbs2db01                               | physical  |
             | t      |              | 4D25/ACE6EE08 |
 debezium_cmanager              | logical   | campaign-manager
     | t      |   2152258063 | 4D25/A421A6C8 | 4D25/ABC18C88
 debezium_rservice        | logical   | retail-content-service
| t      |   2152238060 | 4D25/8EC403B0 | 4D25/A6105DF8
 debezium_partnerservice | logical   | partnerservice | t      |
2152238060 | 4D25/8EC403B0 | 4D25/A5446630
 dbs1_dr                                | physical  |
             | f      |              |               |

So I either restarted the physical standbys or I restarted the logical
connections (Debezium -> Kafka).

I'm also working on a:
# vacuumdb --all --jobs=10 --echo --analyze

But this is still running (and honestly I'm not really hopeful it's
going to do the trick).

My primary, read/write database is Postgresql 10.4 (CentOS 7) while my
standby databases have been patched to 10.17.

Any thoughts on this?

Thanks,Matt



pgsql-general by date:

Previous
From: FOUTE K. Jaurès
Date:
Subject: Re: PostgreSQL 12 - ERROR: invalid attribute number 2 for ad_user_pkey - Urgent
Next
From: Alvaro Herrera
Date:
Subject: Re: datfrozenxid not dropping after vacuum