Pavan writes:
>> FWIW one of our support customers reported a very similar TOAST table
>> corruption issue last week which nearly caused an outage. After a lot of
>> analysis, I think I've now fully understood the reasons behind the
>> corruption, the underlying bug(s) and possible remedy. I am currently
>> working on writing a reproducible test case to demonstrate the problem and
>> writing the fix. More details on that soon.
>
> I've posted a reproducer and a proposed fix to -hackers [1]
>
> In the particular case that I investigated, a database crash/recovery was
> involved. But I think we should be able to create a scenario where OID
> wrap-around or a standby promotion triggers the problem. I don't know if
> any of that was involved in the cases reported on this thread, but I've a
> strong suspicion that the underlying bug is probably the same.
In our case there was no crash+recovery, but we do have a high write
load (and the problem occurring quite seldom), so it sounds like it
fits.
> [1] https://www.postgresql.org/message-id/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-%3Dh4OLosXHf9w%40mail.gmail.com
Impressive investigation!
Just a couple of data points: your script to reproduce the problem does
so for all three versions of PostgreSQL we have in use: 9.3.22, 9.6.8
and 10.3.
And I specifically tested on our production machine which exhibits the
problem, running 9.3.22, and your script reproduced the problem there as
well.
I.e. ending with:
ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384
REINDEX
ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384
VACUUM
ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384
Best regards,
Adam
--
"No more than that, but very powerful all the Adam Sjøgren
same; simple things are good." adsj@novozymes.com