Thread: Bugs in TOAST handling, OID assignment and redo recovery

Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

10 April 2018, 13:29:05

Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of TOAST errors post a crash recovery, nearly causing an outage. Most errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some workarounds, I investigated this in more detail and identified two long standing bugs in TOAST as well as redo recovery.

Immediately after the crash recovery, server started issuing OIDs that conflicted with the OIDs issued just prior to the crash. The XLOG_NEXTOID and buffer LSN interlock is supposed to guard against that, but a bug in the redo recovery breaks that protection. We shall see that in a minute.

When duplicate OIDs are issued, either because of a bug in redo recovery or because of OID counter wraparound, GetNewOidWithIndex() is supposed to protect us against that by first scanning the toast table to check if a duplicate chunk_id or toast value already exists. If so, a new OID is requested and we repeat until a non-conflicting OID is found.

This mostly works, except in one corner case. GetNewOidWithIndex() uses SnapshotDirty to look for conflicting OID. But this fails if the existing toast tuple was recently deleted by another committed transaction. If the OldestXmin has not advanced enough, such tuples are reported as RECENTLY_DEAD and fail the SnapshotDirty qualifications. So toast_save_datum() happily goes ahead and inserts a conflicting toast tuple. The UNIQUE index on <chunk_id, chunk_seq> does not complain because it seems the other tuple as RECENTLY_DEAD too. At this point, we have two toast tuples with the same <chunk_id, chunk_seq> - one of these is RECENTLY_DEAD and the other is LIVE. Later when toast_fetch_datum() and friends scan the toast table for the chunk_id (toast value), it uses SnapshotToast for retrieving the tuple. This special snapshot doesn't do any visibility checks on the toast tuple, assuming that since the main heap tuple is visible, the toast tuple(s) should be visible too. This snapshot unfortunately also sees the RECENTLY_DEAD toast tuple and thus returns duplicate toast tuples, leading to the errors such as above. This same issue can also lead to other kinds of errors in tuptoaster.c.

ISTM that the right fix is to teach toast_save_datum() to check for existence of duplicate chunk_id by scanning the table with the same SnapshotToast that it later uses to fetch the tuples. We already do that in case of toast rewrite, but not for regular inserts. I propose to do that for regular path too, as done in the attached patch.

Duplicate OIDs may be generated when the counter wraps around and it may cause this problem.

- OID counter wraps around

- Old toast tuple is deleted and the deleting transaction commits

- OldestXmin is held back by an open transaction

- The same OID if then inserted again into the toast table

- The toasted value is fetched before the table is vacuumed or the index pointer is killed. This leads to duplicates being returned.

This is probably quite a corner case, but not impossible given that we already have reports for such problems. But the problem gets accentuated in case of a crash recovery or on a Hot standby. This is because of another somewhat unrelated bug which manifested in our case.

An interesting case to consider is the following sequence of events:

1. ONLINE checkpoint starts. We note down the nextOid counter, after rightly accounting for the cached values too. CreateCheckPoint() does this:

LWLockAcquire(OidGenLock, LW_SHARED);

checkPoint.nextOid = ShmemVariableCache->nextOid;

if (!shutdown)

checkPoint.nextOid += ShmemVariableCache->oidCount;

LWLockRelease(OidGenLock);

2. The cached OIDs are then consumed, and a new set of OIDs are allocated, generating XLOG_NEXTOID WAL record.

3. More OIDs are consumed from the newly allocated range. Those OIDs make to the disk after following buffer LSN interlock. So XLOG_NEXTOID gets flushed to disk as well.

4. Checkpoint finishes, it writes a ONLINE checkpoint record with the nextOid value saved in step 1

5. CRASH

The redo recovery starts at some LSN written before step 1. It initialises the ShmemVariableCache->nextOid with the value stored in the checkpoint record and starts replaying WAL records. When it sees the XLOG_NEXTOID WAL record generated at step 2, it correctly advances the OID counter to cover the next range.

But while applying ONLINE checkpoint written at step 4, we do this:

/* ... but still treat OID counter as exact */

LWLockAcquire(OidGenLock, LW_EXCLUSIVE);

ShmemVariableCache->nextOid = checkPoint.nextOid;

ShmemVariableCache->oidCount = 0;

LWLockRelease(OidGenLock);

So effectively we overwrite the nextOid counter and drag it backwards again. If we don't see another XLOG_NEXTOID record before the crash recovery ends, then we shall start the server with a stale value, generating duplicate OIDs immediately after the restart.

I don't think this is an unusual case and I'm surprised that we didn't notice this before. Most likely the other protections in the system to guard against wrapped around OIDs masked the bug. The TOAST table though escapes those protections in the specific case we discussed above and brought out the bug in open.

I think the right fix is to simply ignore the nextOid counter while replaying ONLINE checkpoint record. We must have already initialised ShmemVariableCache->nextOid to the value stored in this (or some previous) checkpoint record when redo recovery is started. As and when we see XLOG_NEXTOID record, we would maintain the shared memory counter correctly. If we don't see any XLOG_NEXTOID record, the value we started with must be the correct value. I see no problem even when OID wraps around during redo recovery. XLOG_NEXTOID should record that correctly.

Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin counter advances enough so that RECENTLY_DEAD tuples are ignored by the REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or a VACUUM should fix these transient errors. Also once manifested, the errors would stay until a REINDEX or VACUUM is performed.

We tested this at least upto 9.1 and the bugs exist there too. In fact, these bugs probably existed forever, though I did not check very old releases.

Attached is a simple reproducer and a proposed fix to address both the bugs. We should consider backpatching it all supported releases.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Bernd Helmle

Date:

10 April 2018, 15:34:39

Am Dienstag, den 10.04.2018, 15:59 +0530 schrieb Pavan Deolasee:
> One of our 2ndQuadrant support customers recently reported a sudden
> rush of
> TOAST errors post a crash recovery, nearly causing an outage. Most
> errors
> read like this:
> 
> ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn
> 
> While we could bring back the cluster to normal quickly using some
> workarounds, I investigated this in more detail and identified two
> long
> standing bugs in TOAST as well as redo recovery.

Wow! I'm currently also investigating issues on a customer system, too,
where suddenly TOAST errors arised after shutdown immediate.

I haven't dug into your findings yet, but it seems to perfectly fit
into the reports i got.

    Thanks,
        Bernd

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Heikki Linnakangas

Date:

10 April 2018, 16:37:55

On 10/04/18 13:29, Pavan Deolasee wrote:
> Hello,
> 
> One of our 2ndQuadrant support customers recently reported a sudden rush of
> TOAST errors post a crash recovery, nearly causing an outage. Most errors
> read like this:
> 
> ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn
> 
> While we could bring back the cluster to normal quickly using some
> workarounds, I investigated this in more detail and identified two long
> standing bugs in TOAST as well as redo recovery.

Great detective work!

> ISTM that the right fix is to teach toast_save_datum() to check for
> existence of duplicate chunk_id by scanning the table with the same
> SnapshotToast that it later uses to fetch the tuples. We already do that in
> case of toast rewrite, but not for regular inserts. I propose to do that
> for regular path too, as done in the attached patch.

It would seem more straightforward to add a snapshot parameter to 
GetNewOidWithIndex(), so that the this one caller could pass 
SnapshotToast, while others pass SnapshotDirty. With your patch, you 
check the index twice: first with SnapshotDirty, in 
GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

If I'm reading the rewrite case correctly, it's a bit different and 
quite special. In the loop with GetNewOidWithIndex(), it needs to check 
that the OID is unused in two tables, the old TOAST table, and the new 
one. You can only pass one index to GetNewOidWithIndex(), so it needs to 
check the second index manually. It's not because of the snapshot issue. 
Although I wonder if we should be using SnapshotToast in that 
GetNewOidWithIndex() call, too. I.e. if we should be checking both the 
old and the new toast table with SnapshotToast.

> I think the right fix is to simply ignore the nextOid counter while
> replaying ONLINE checkpoint record. We must have already initialised
> ShmemVariableCache->nextOid
> to the value stored in this (or some previous) checkpoint record when redo
> recovery is started. As and when we see XLOG_NEXTOID record, we would
> maintain the shared memory counter correctly. If we don't see any
> XLOG_NEXTOID record, the value we started with must be the correct value. I
> see no problem even when OID wraps around during redo recovery.
> XLOG_NEXTOID should record that correctly.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the 
value in the online checkpoint record is greater than 
ShmemVariableCache->nextXid. But we don't have such a wraparound-aware 
concept of "greater than" for OIDs. Ignoring the online checkpoint 
record's nextOid value seem fine to me.

- Heikki

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

10 April 2018, 16:54:17

Hi Heikki,

On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would seem more straightforward to add a snapshot parameter to GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast, while others pass SnapshotDirty. With your patch, you check the index twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

Hmm. I actually wrote my first patch exactly like that. I am trying to remember why I discarded that approach. Was that to do with the fact that SnapshotToast can't see all toast tuples either? Yeah, I think so. For example, it won't see tuples with uncommitted-xmin, leading to different issues. Now it's unlikely that we will have a OID conflict where the old tuple has uncommitted-xmin, but not sure if we can completely rule that out. For example, if we did not uncover the redo recovery bug, we could have had a prepared transaction having inserted the old tuple, which now conflicts with new tuple and not detected by SnapshotToast.

If I'm reading the rewrite case correctly, it's a bit different and quite special. In the loop with GetNewOidWithIndex(), it needs to check that the OID is unused in two tables, the old TOAST table, and the new one. You can only pass one index to GetNewOidWithIndex(), so it needs to check the second index manually. It's not because of the snapshot issue. Although I wonder if we should be using SnapshotToast in that GetNewOidWithIndex() call, too. I.e. if we should be checking both the old and the new toast table with SnapshotToast.

As I said, I am not sure if checking with just SnapshotToast is enough because it can't see "dirty" tuples.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the value in the online checkpoint record is greater than ShmemVariableCache->nextXid. But we don't have such a wraparound-aware concept of "greater than" for OIDs. Ignoring the online checkpoint record's nextOid value seem fine to me.

Ok. Thanks for checking.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

11 April 2018, 14:51:28

On Tue, Apr 10, 2018 at 7:24 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

Hi Heikki,

On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would seem more straightforward to add a snapshot parameter to GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast, while others pass SnapshotDirty. With your patch, you check the index twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

Hmm. I actually wrote my first patch exactly like that. I am trying to remember why I discarded that approach. Was that to do with the fact that SnapshotToast can't see all toast tuples either? Yeah, I think so. For example, it won't see tuples with uncommitted-xmin, leading to different issues. Now it's unlikely that we will have a OID conflict where the old tuple has uncommitted-xmin, but not sure if we can completely rule that out.

Or may be we simply err on the side of caution and scan the toast table with SnapshotAny while looking for a duplicate? That might prevent us from reusing an OID for a known-dead tuple, but should save us a second index scan and still work.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Tom Lane

Date:

11 April 2018, 17:50:47

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> Or may be we simply err on the side of caution and scan the toast table
> with SnapshotAny while looking for a duplicate? That might prevent us from
> reusing an OID for a known-dead tuple, but should save us a second index
> scan and still work.

+1.  We really don't want to expend two indexscans on this.

I was worried about changing the signature of GetNewOidWithIndex in
a back-patched fix, but after looking around I think that's probably
safe.  External callers really shouldn't be using anything lower-level
than GetNewOid.

            regards, tom lane

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

11 April 2018, 21:43:18

On Wed, Apr 11, 2018 at 8:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> Or may be we simply err on the side of caution and scan the toast table
> with SnapshotAny while looking for a duplicate? That might prevent us from
> reusing an OID for a known-dead tuple, but should save us a second index
> scan and still work.

+1. We really don't want to expend two indexscans on this.

signature.asc

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Tom Lane

Date:

12 April 2018, 02:51:35

Michael Paquier <michael@paquier.xyz> writes:
> I have not really checked this thread in details, but one thing that
> strikes me is that it would be rather easy to add a TAP test based on
> the initial script that Pavan has sent.  Would that be worth testing
> cycles or not?

I doubt it --- that test is so specialized that it'd be unlikely to
catch any other bug.

An example of the sort of bug that could exist in the same area would
be if we'd failed to note the places in tuptoaster.c that also need to
be switched to SnapshotAny.  Only a test covering heap-rewrite could
have found that, which this test case surely doesn't.

What I think would actually be worth spending some time on is taking
a hard look at the remaining uses of SnapshotDirty to see if any of
them look squishy.

            regards, tom lane

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Peter Eisentraut

Date:

12 April 2018, 03:23:31

On 4/10/18 06:29, Pavan Deolasee wrote:
> One of our 2ndQuadrant support customers recently reported a sudden rush
> of TOAST errors post a crash recovery, nearly causing an outage. Most
> errors read like this:
> 
> ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent.  It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably.  The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit.  Thoughts?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Clarify-TOAST-terminology-in-variables-and-error-mes.patch

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

12 April 2018, 16:09:38

On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So while looking at this, it suddenly occurred to me that probing with
SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
either.

Yeah it occurred to me as well, but when I looked at the code, I couldn't find a case that is broken. I even tried a few test cases with DDLs etc. But I think what you did is fine and more bullet proof. So +1 to that.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

12 April 2018, 16:12:46

On Thu, Apr 12, 2018 at 5:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael@paquier.xyz> writes:
> I have not really checked this thread in details, but one thing that
> strikes me is that it would be rather easy to add a TAP test based on
> the initial script that Pavan has sent. Would that be worth testing
> cycles or not?

I doubt it --- that test is so specialized that it'd be unlikely to
catch any other bug.

I agree; it's not worth adding a TAP test except may be as a demonstration to write future tests for catching such issues. This was a very specialised test case written after getting a full grasp on the bug. And it just tests the thing that I knew is broken based on code reading. Also, with OID duplicate issue fixed, hitting more bugs in this area is going to be even more difficult.

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Tom Lane

Date:

12 April 2018, 16:57:26

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So while looking at this, it suddenly occurred to me that probing with
>> SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
>> either.

> Yeah it occurred to me as well, but when I looked at the code, I couldn't
> find a case that is broken. I even tried a few test cases with DDLs etc.

I think it probably can't happen for catalog MVCC scans, because we
generally take full new snapshots for those.  The situation that would be
hazardous is where we use an existing snapshot (so that it doesn't see
the just-committed Other Transaction) but advance its command counter
(so that it can see the new object we just made).  So the sort of failure
I'd predict is that a user query just after an object creation could see
duplicate OIDs in the catalogs.  To get to that point, you might need a
stable function (using the troublesome snapshot) calling a volatile one
(which contains the actual DDL).

> But I think what you did is fine and more bullet proof. So +1 to that.

Yeah, even if this isn't actually a reachable case, it's hard to argue
that we should complicate the code and take API-break risks in the back
branches to make a tiny optimization on the assumption that it can never
happen.

            regards, tom lane

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Pavan Deolasee

Date:

18 April 2018, 09:37:52

On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

On 4/10/18 06:29, Pavan Deolasee wrote:
> One of our 2ndQuadrant support customers recently reported a sudden rush
> of TOAST errors post a crash recovery, nearly causing an outage. Most
> errors read like this:
>
> ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent. It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably. The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit. Thoughts?

While I agree that we should clean it up, I wonder if changing error text would be a good idea. These errors are being reported by a very long time and if we change the text, we might forget the knowledge about the past reports.

Also, "toast value" is same as "chunk_id". Should we clean up something there too? "chunk_seq number" -- should that be just "chunk_seq"?

Thanks,

Pavan

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Bruce Momjian

Date:

24 April 2018, 18:16:33

On Wed, Apr 18, 2018 at 12:07:52PM +0530, Pavan Deolasee wrote:
> 
> 
> On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut <
> peter.eisentraut@2ndquadrant.com> wrote:
> 
>     On 4/10/18 06:29, Pavan Deolasee wrote:
>     > One of our 2ndQuadrant support customers recently reported a sudden rush
>     > of TOAST errors post a crash recovery, nearly causing an outage. Most
>     > errors read like this:
>     >
>     > ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn
> 
>     While researching this, I found that the terminology in this code is
>     quite inconsistent.  It talks about chunks ids, chunk indexes, chunk
>     numbers, etc. seemingly interchangeably.  The above error is actually
>     about the chunk_seq, not about the chunk_id, as one might think.
> 
>     The attached patch is my attempt to clean this up a bit.  Thoughts?
> 
> 
> While I agree that we should clean it up, I wonder if changing error text would
> be a good idea. These errors are being reported by a very long time and if we
> change the text, we might forget the knowledge about the past reports.
> 
> Also, "toast value" is same as "chunk_id". Should we clean up something there
> too? "chunk_seq number" -- should that be just "chunk_seq"?

We can put a comment next to the error message C string if we want to
keep historical knowledge.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Peter Eisentraut

Date:

24 April 2018, 18:49:25

On 4/18/18 02:37, Pavan Deolasee wrote:
> While I agree that we should clean it up, I wonder if changing error
> text would be a good idea. These errors are being reported by a very
> long time and if we change the text, we might forget the knowledge about
> the past reports.

It's presumably fixed now, so if we see the new error message, we'll
know it's a different issue.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

From

Alvaro Herrera

Date:

23 July 2018, 22:13:15

On 2018-Apr-11, Peter Eisentraut wrote:

> On 4/10/18 06:29, Pavan Deolasee wrote:
> > One of our 2ndQuadrant support customers recently reported a sudden rush
> > of TOAST errors post a crash recovery, nearly causing an outage. Most
> > errors read like this:
> > 
> > ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn
> 
> While researching this, I found that the terminology in this code is
> quite inconsistent.  It talks about chunks ids, chunk indexes, chunk
> numbers, etc. seemingly interchangeably.  The above error is actually
> about the chunk_seq, not about the chunk_id, as one might think.
> 
> The attached patch is my attempt to clean this up a bit.  Thoughts?

I didn't review your patch carefully, but +1 on the idea.  By all means
let's not forget this for pg12.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services