Re: Fwd: Annoying corruption in PostgreSQL. - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Fwd: Annoying corruption in PostgreSQL.
Date
Msg-id 73c212d5-acb1-b018-27f8-f3dfb01b91a2@enterprisedb.com
Whole thread Raw
In response to Fwd: Annoying corruption in PostgreSQL.  (Kirill Reshke <reshkekirill@gmail.com>)
List pgsql-hackers
On 10/27/23 23:10, Kirill Reshke wrote:
> 
> Sorry, seems that i replied only to Tomas, so forwarding message.
> ---------- Forwarded message ---------
> From: *Kirill Reshke* <reshkekirill@gmail.com
> <mailto:reshkekirill@gmail.com>>
> Date: Sat, 28 Oct 2023 at 02:06
> Subject: Re: Annoying corruption in PostgreSQL.
> To: Tomas Vondra <tomas.vondra@enterprisedb.com
> <mailto:tomas.vondra@enterprisedb.com>>
> 
> 
> Hi Tomas!
> 
> Thanks for the explanation!
> 
> 1) 11 to 15. This week there were 14.9 and 12.16 reproductions. Two
> weeks ago there was 15.4 and 11.21 repro. Unfortunately, there is no
> info about repro which were month old or more, but I found in our work
> chats that there was repro on PostgreSQL 13 in April, a minor version
> unknown. Overall, we observed this issue for over a year on all pgdg
> supported versions.
> 
> 2) Searching out bug tracker i have found:
> 
> 1. missing chunk number 0 for toast value 592966012 in
> pg_toast_563953150 (some user relation)
> |2. missing chunk number 0 for toast value 18019714 in
> pg_toast_17706963| (some user relation)
> 3. missing chunk number 0 for toast value 52677740 in pg_toast_247794
> 
> So, this is not always pg_catalog. There toast tables were toast to some
> user relations.
> 

OK.

> 3) It is always about VACUUM FULL (FREEZE/VERBOSE/ANALYZE) / autovacuum.
> 

Hmm, so it's always one of these VACUUM processes complaining?

> We have physical backups and we can PITR. But restoring a cluster to
> some point in the past is a bit of a different task: we need our
> client's approval for these operations, since we are a Managed DBs Cloud
> Provider. Will try to ask someone.
> 

That's what I'd try, to get some sense of what state the vacuum saw,
what were the transactions modifying the TOAST + parent table doing,
etc, how much stuff the transactions did, if maybe there are some
aborts, that sort of thing. Hard to try reproducing this without any
knowledge of the workload. The WAL might tell us if

How often do you actually see this issue? Once of twice a week?

Are you using some extensions that might interfere with this?

And you mentioned you're running large number of clusters - are those
running similar workloads, or are they unrelated?

Actually, can you elaborate why are you running VACUUM FULL etc? That
generally should not be necessary, so maybe we can learn something about
that about your workload.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Pre-proposal: unicode normalized text
Next
From: Noah Misch
Date:
Subject: Re: race condition in pg_class