"missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup. - Mailing list pgsql-general

From
Subject "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.
Date
Msg-id 1456919678340.31300.116900@webmail2
Whole thread Raw
List pgsql-general
Hi All,

we are running postgresql 9.1.15 on Debian.

we are, basically, running a postgresql cluster with two nodes. We are using synchronous streaming replication to make sure that the slave is always fully in sync (using a recovery.conf that points out the master). The slave is mainly used to maintain high availability. We also run backups on the slave to reduce load on the master.

On both nodes, postgresqls data is stored on a SAN.

At some point, we restarted the slave. The first thing it does is to do a pg_basebackup and then start streaming changes based on that. This seemed to work fine, until 16 hrs later when we made a backup on the slave, using pg_dump. pg_dump was unable to complete, due to a pg_toast error (ERROR:  missing chunk number 0 for toast value 2753291 in pg_toast_22066).

The issue remained until we ran a full vacuum analyze on the cluster.

We have been running in this way on several clusters for some years now, basically since 9.1 was released, without seeing this issue. Since we upgraded to postgresql 9.1.15, we have seen it twice. This does not necessarily mean anything, but I thought I would mention it.

The issue is resolved now, but I would still like to understand what happened.
I have logfiles from the incident, but I cannot see anything out of the ordinary (despite having a fair amount of experience investigating postgresql logs).

I have read that this kind of issues are most frequently due to hardware issues or bugs in postgresql.

I have reviewed the release notes from the subsequent 9.1 releases (9.1.15-9.1.19),
but I have been unable to identify anything that hints at the issue that we have experienced.

the closest one would be:
(9.1.16,9.4.2,9.3.7,9.2.11,9.0.20) Fix possible failure during hash index bucket split, if other processes are modifying the index concurrently (Tom Lane)

but that seems like a long shot.

My question is:
- does anyone know of bug reports (fixed or otherwise) that could shed light on our issue?
- does anyone have experience with this kind of behavior?
- our setup is fairly automated; would it be a good idea to always run vacuum analyze after a pg_restore (or before pg_dump)?

pgsql-general by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: Does RAISE EXCEPTION rollback previous commands in a stored function?
Next
From: Alexander Farber
Date:
Subject: Re: How to ensure that a stored function always returns TRUE or FALSE?