Re: logical decoding bug: segfault in ReorderBufferToastReplace() - Mailing list pgsql-hackers

From Jeremy Schneider
Subject Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Date
Msg-id 444215b4-8fb5-6a82-a534-645abafbffb4@amazon.com
Whole thread Raw
In response to Re: logical decoding bug: segfault in ReorderBufferToastReplace()  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: logical decoding bug: segfault in ReorderBufferToastReplace()
List pgsql-hackers
On 6/4/21 23:42, Amit Kapila wrote:
On 2021-Jun-04, Jeremy Schneider wrote:
ERROR: XX000: could not open relation with OID 0
LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
Even, if this fixes the issue, I guess it is better to find why this
happens? I think the reason why the code is giving an error is that
after toast insertions we always expect the insert on the main table
of toast table, but if there happens to be a case where after toast
insertion, instead of getting the insertion on the main table we get
an insert in some other table then you will see this error. I think
this can happen for speculative insertions where insertions lead to a
toast table insert, then we get a speculative abort record, and then
insertion on some other table. The main thing is currently decoding
code ignores speculative aborts due to which such a problem can occur.
Now, there could be other cases where such a problem can happen but if
my theory is correct then the patch we are discussing in the thread
[1] should solve this problem.

Jeremy, is this problem reproducible? Can we get a testcase or
pg_waldump output of previous WAL records?

[1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com

It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that we are hitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and walk the WAL records then I'll post back here. FYI, Bertrand was able to replicate the exact error message with pretty much the same repro that's in the other email thread which is linked above.

Separately, would there be any harm in adding the relation OID to the error message? Personally, I just think the error message is generally more useful if it shows the main relation OID (since we know that the toast OID can be 0). Not a big deal though.

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Fdw batch insert error out when set batch_size > 65535
Next
From: Peter Geoghegan
Date:
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic