Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100 - Mailing list pgsql-general

From Pavan Deolasee
Subject Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100
Date
Msg-id CABOikdPJUv5EyxRZnJY2oA7EKi=bZdoryXNmUGK3dFAQ08tZdw@mail.gmail.com
Whole thread Raw
In response to Re: ERROR: unexpected chunk number 0 (expected 1) for toast value76753264 in pg_toast_10920100  (Pavan Deolasee <pavan.deolasee@gmail.com>)
List pgsql-general


On Fri, Apr 6, 2018 at 8:55 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:


On Fri, Apr 6, 2018 at 2:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
adsj@novozymes.com (Adam Sjøgren) writes:
>> [... still waiting for the result, I will return with what it said
>> when the server does ...]

> It did eventually finish, with the same result:

Huh.  So what we have here, apparently, is that regular MVCC snapshots
think there is exactly one copy of the 1698936148/0 row, but TOAST fetches
think there is more than one.  This is darn odd, not least because we
never do UPDATEs in toast tables, only inserts and deletes, so there
certainly shouldn't be update chains there.

It seems like you've got some corner case wherein SnapshotToast sees a row
that isn't visible according to MVCC --- probably a row left over from
some previous cycle of life.  That is, I'm imagining the OID counter
wrapped around and we've reused a toast OID, but for some reason there's
still a row in the table with that OID.  I'm not sure offhand how we could
get into such a state.  Alvaro, does this ring any bells (remembering that
this is 9.3)?

FWIW one of our support customers reported a very similar TOAST table corruption issue last week which nearly caused an outage. After a lot of analysis, I think I've now fully understood the reasons behind the corruption, the underlying bug(s) and possible remedy. I am currently working on writing a reproducible test case to demonstrate the problem and writing the fix. More details on that soon.


I've posted a reproducer and a proposed fix to -hackers [1]

In the particular case that I investigated, a database crash/recovery was involved. But I think we should be able to create a scenario where OID wrap-around or a standby promotion triggers the problem.  I don't know if any of that was involved in the cases reported on this thread, but I've a strong suspicion that the underlying bug is probably the same.

Thanks,
Pavan


--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-general by date:

Previous
From: hmidi slim
Date:
Subject: Using enum instead of join tables
Next
From: Alexandre Arruda
Date:
Subject: Re: ERROR: found multixact from before relminmxid