Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption? - Mailing list pgsql-admin

From Arjun Ranade
Subject Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?
Date
Msg-id CANrrCRy1_mu7A31Tas+-oKWcYK4kpNK8wTFW7VWVq+5yj2uApQ@mail.gmail.com
Whole thread Raw
In response to Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?  (Arjun Ranade <ranade@nodalexchange.com>)
Responses Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
List pgsql-admin
So we had a maintenance window scheduled today and I restarted the DB.  When it came back up, it reported the same toast error when vacuum'ing pg_statistic.  It also reported the "oldest xmin is far in the past" error.

I even brought the DB into single user mode and tried to manually vacuum pg_statistic but got the same error.  I also tried to TRUNCATE pg_statistic in single user mode which failed (see below).

At this point, I am not sure what to make of the issue.  The DB is still complaining about old XMIN even when there are no users connected to it after a fresh restart:

2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39 sessionid=59fa7847.393e WARNING:  oldest xmin is far in the past
2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39 sessionid=59fa7847.393e HINT:  Close open transactions soon to avoid wraparound problems.

If we cannot understand what is happening, we will likely restore from a backup, rebuild all of our standbys/logical replicas and start fresh.  It would be too costly for the db to shutdown during a workday due to the wraparound issue.

Any insight you can provide would be appreciated.

Thanks,
AR

Relevant log output from single user mode:

$ /usr/pgsql-9.4/bin/postmaster --single -D /var/lib/pgsql/9.4/data prod

PostgreSQL stand-alone backend 9.4.14
backend> vacuum full pg_statistic;
2017-11-01 21:35:38.781 EDT pid=12732 database= user= rhost= tid=1/3 sessionid=59fa765f.31bc ERROR:  missing chunk number 0 for toast value 30382746 in pg_toast_2619
2017-11-01 21:35:38.783 EDT pid=12732 database= user= rhost= tid=1/3 sessionid=59fa765f.31bc STATEMENT:  vacuum full pg_statistic;

backend> reindex table pg_statistic;
backend> vacuum full pg_statistic;
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7 sessionid=59fa765f.31bc ERROR:  missing chunk number 0 for toast value 30382746 in pg_toast_2619
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7 sessionid=59fa765f.31bc STATEMENT:  vacuum full pg_statistic;

backend> truncate table pg_statistic;
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14 sessionid=59fa765f.31bc ERROR:  permission denied: "pg_statistic" is a system catalog
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14 sessionid=59fa765f.31bc STATEMENT:  truncate table pg_statistic;




On Tue, Oct 31, 2017 at 10:37 PM, Arjun Ranade <ranade@nodalexchange.com> wrote:
You might be onto something.  I force quit the pglogical supervisor and other session that had a backend_start of a few months ago.  That didn't fix the issue.  Additionally, those processes didn't start back up which is making me worry about the status of my logical replication.  The subscribers are reporting that they are still replicating but I won't know. 

However... when I was checking on the subscribers, I dropped their replication links and recreated them (to see if that would get the pglogical processes to start back up).  I checked the pg_log on the master (same server having the vacuum problem) to see if there was anything about the subscriptions resuming and I found these messages in the log:

2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d WARNING:  oldest xmin is far in the past
2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d HINT:  Close open transactions soon to avoid wraparound problems.
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b WARNING:  oldest xmin is far in the past
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b HINT:  Close open transactions soon to avoid wraparound problems

However, I am now in a situation where I cannot find any old transactions from pg_stat_activity or pg_prepared_xacts. 

It may also be worth noting that the last time our db was restarted (~6 months ago, same date as the old processes that I killed) was due to the datafile partition filling up.

FWIW, here are my vacuum related settings on this db:

name    setting          
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000

On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Arjun Ranade <ranade@nodalexchange.com> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server.  Both are not waiting and have
> null for "state."   This might be expected behavior for pglogical.

I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.

If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?

> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.

Mmm.  You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want.  It might be easier to force-quit the pglogical sessions.

                        regards, tom lane


pgsql-admin by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [ADMIN] Bad recovery: no pg_xlog/RECOVERYXLOG
Next
From: Arjun Ranade
Date:
Subject: Re: [ADMIN] Missing Chunk Error when doing a VACUUM FULL operation -DB Corruption?