Thread: Re: logical decoding bug: segfault in ReorderBufferToastReplace()

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Jeremy Schneider
Date:
On 12/13/19 16:13, Jeremy Schneider wrote:
On 12/11/19 08:35, Andres Freund wrote:
Seems like we clearly should add an elog(ERROR) here, so we error out,
rather than crash.
done - in the commit that I replied to when I started this thread :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=69f883fef14a3fc5849126799278abcc43f40f56


Another PostgreSQL user ran into this issue. This time on version 12.5 - so instead of a crash they got the error message from the commit.

ERROR: XX000: could not open relation with OID 0
LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305

Upon seeing this error message, I realized that the base relation OID would be very useful when the toast relation OID is "0".

Would this patch work to show that?

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e1279bb..b90603b051 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4598,8 +4598,8 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,

        toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
        if (!RelationIsValid(toast_rel))
-               elog(ERROR, "could not open relation with OID %u",
-                        relation->rd_rel->reltoastrelid);
+               elog(ERROR, "could not open toast relation with OID %u (base relation with OID %u)",
+                        relation->rd_rel->reltoastrelid, relation->rd_rel->oid);

        toast_desc = RelationGetDescr(toast_rel);

Thoughts?

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Alvaro Herrera
Date:
On 2021-Jun-04, Jeremy Schneider wrote:

> ERROR: XX000: could not open relation with OID 0
> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305

Hah.

It seems to me that this code should silently return if
rd_rel->reltoastrelid == 0; just like in the case of
txn->toast_hash == NULL.  It evidently means that no datum can be
toasted, and therefor no toast replacement is needed.

(As far as I recall, a table cannot go from having a toast table to not
having one.)

-- 
Álvaro Herrera       Valdivia, Chile
"¿Qué importan los años?  Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo"  (Mafalda)



Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Amit Kapila
Date:
On Sat, Jun 5, 2021 at 5:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Jun-04, Jeremy Schneider wrote:
>
> > ERROR: XX000: could not open relation with OID 0
> > LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
>
> Hah.
>
> It seems to me that this code should silently return if
> rd_rel->reltoastrelid == 0; just like in the case of
> txn->toast_hash == NULL.  It evidently means that no datum can be
> toasted, and therefor no toast replacement is needed.
>

Even, if this fixes the issue, I guess it is better to find why this
happens? I think the reason why the code is giving an error is that
after toast insertions we always expect the insert on the main table
of toast table, but if there happens to be a case where after toast
insertion, instead of getting the insertion on the main table we get
an insert in some other table then you will see this error. I think
this can happen for speculative insertions where insertions lead to a
toast table insert, then we get a speculative abort record, and then
insertion on some other table. The main thing is currently decoding
code ignores speculative aborts due to which such a problem can occur.
Now, there could be other cases where such a problem can happen but if
my theory is correct then the patch we are discussing in the thread
[1] should solve this problem.

Jeremy, is this problem reproducible? Can we get a testcase or
pg_waldump output of previous WAL records?

[1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Jeremy Schneider
Date:
On 6/4/21 23:42, Amit Kapila wrote:
On 2021-Jun-04, Jeremy Schneider wrote:
ERROR: XX000: could not open relation with OID 0
LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
Even, if this fixes the issue, I guess it is better to find why this
happens? I think the reason why the code is giving an error is that
after toast insertions we always expect the insert on the main table
of toast table, but if there happens to be a case where after toast
insertion, instead of getting the insertion on the main table we get
an insert in some other table then you will see this error. I think
this can happen for speculative insertions where insertions lead to a
toast table insert, then we get a speculative abort record, and then
insertion on some other table. The main thing is currently decoding
code ignores speculative aborts due to which such a problem can occur.
Now, there could be other cases where such a problem can happen but if
my theory is correct then the patch we are discussing in the thread
[1] should solve this problem.

Jeremy, is this problem reproducible? Can we get a testcase or
pg_waldump output of previous WAL records?

[1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com

It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that we are hitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and walk the WAL records then I'll post back here. FYI, Bertrand was able to replicate the exact error message with pretty much the same repro that's in the other email thread which is linked above.

Separately, would there be any harm in adding the relation OID to the error message? Personally, I just think the error message is generally more useful if it shows the main relation OID (since we know that the toast OID can be 0). Not a big deal though.

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Amit Kapila
Date:
On Wed, Jun 9, 2021 at 12:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>
> On 6/4/21 23:42, Amit Kapila wrote:
>
> On 2021-Jun-04, Jeremy Schneider wrote:
>
> ERROR: XX000: could not open relation with OID 0
> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
>
> Even, if this fixes the issue, I guess it is better to find why this
> happens? I think the reason why the code is giving an error is that
> after toast insertions we always expect the insert on the main table
> of toast table, but if there happens to be a case where after toast
> insertion, instead of getting the insertion on the main table we get
> an insert in some other table then you will see this error. I think
> this can happen for speculative insertions where insertions lead to a
> toast table insert, then we get a speculative abort record, and then
> insertion on some other table. The main thing is currently decoding
> code ignores speculative aborts due to which such a problem can occur.
> Now, there could be other cases where such a problem can happen but if
> my theory is correct then the patch we are discussing in the thread
> [1] should solve this problem.
>
> Jeremy, is this problem reproducible? Can we get a testcase or
> pg_waldump output of previous WAL records?
>
> [1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com
>
>
> It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that we
arehitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and walk
theWAL records then I'll post back here. 
>

Okay, one thing you can check is if there is a usage of Insert .. On
Conflict .. statement in the actual production system?

> FYI, Bertrand was able to replicate the exact error message with pretty much the same repro that's in the other email
threadwhich is linked above. 
>
> Separately, would there be any harm in adding the relation OID to the error message? Personally, I just think the
errormessage is generally more useful if it shows the main relation OID (since we know that the toast OID can be 0).
Nota big deal though. 
>

I don't think that is a bad idea. However, I think it might be better
to propose that as a separate patch in a new thread.

--
With Regards,
Amit Kapila.



Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
"Drouvot, Bertrand"
Date:
Hi Amit,

On 6/9/21 5:33 AM, Amit Kapila wrote:
> On Wed, Jun 9, 2021 at 12:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>> On 6/4/21 23:42, Amit Kapila wrote:
>>
>> On 2021-Jun-04, Jeremy Schneider wrote:
>>
>> ERROR: XX000: could not open relation with OID 0
>> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
>>
>> Even, if this fixes the issue, I guess it is better to find why this
>> happens? I think the reason why the code is giving an error is that
>> after toast insertions we always expect the insert on the main table
>> of toast table, but if there happens to be a case where after toast
>> insertion, instead of getting the insertion on the main table we get
>> an insert in some other table then you will see this error. I think
>> this can happen for speculative insertions where insertions lead to a
>> toast table insert, then we get a speculative abort record, and then
>> insertion on some other table. The main thing is currently decoding
>> code ignores speculative aborts due to which such a problem can occur.
>> Now, there could be other cases where such a problem can happen but if
>> my theory is correct then the patch we are discussing in the thread
>> [1] should solve this problem.
>>
>> Jeremy, is this problem reproducible? Can we get a testcase or
>> pg_waldump output of previous WAL records?
>>
>> [1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com
>>
>>
>> It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that we
arehitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and walk
theWAL records then I'll post back here.
 
>>
> Okay, one thing you can check is if there is a usage of Insert .. On
> Conflict .. statement in the actual production system?

Yes that's the case, so that a speculative abort record followed by an 
insert on some other table looks a perfect valid scenario regarding this 
current issue.

Bertrand



Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
Amit Kapila
Date:
On Wed, Jun 9, 2021 at 11:37 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> On 6/9/21 5:33 AM, Amit Kapila wrote:
> > On Wed, Jun 9, 2021 at 12:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
> >> On 6/4/21 23:42, Amit Kapila wrote:
> >>
> >> On 2021-Jun-04, Jeremy Schneider wrote:
> >>
> >> ERROR: XX000: could not open relation with OID 0
> >> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
> >>
> >> Even, if this fixes the issue, I guess it is better to find why this
> >> happens? I think the reason why the code is giving an error is that
> >> after toast insertions we always expect the insert on the main table
> >> of toast table, but if there happens to be a case where after toast
> >> insertion, instead of getting the insertion on the main table we get
> >> an insert in some other table then you will see this error. I think
> >> this can happen for speculative insertions where insertions lead to a
> >> toast table insert, then we get a speculative abort record, and then
> >> insertion on some other table. The main thing is currently decoding
> >> code ignores speculative aborts due to which such a problem can occur.
> >> Now, there could be other cases where such a problem can happen but if
> >> my theory is correct then the patch we are discussing in the thread
> >> [1] should solve this problem.
> >>
> >> Jeremy, is this problem reproducible? Can we get a testcase or
> >> pg_waldump output of previous WAL records?
> >>
> >> [1] - https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com
> >>
> >>
> >> It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that
weare hitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and
walkthe WAL records then I'll post back here. 
> >>
> > Okay, one thing you can check is if there is a usage of Insert .. On
> > Conflict .. statement in the actual production system?
>
> Yes that's the case, so that a speculative abort record followed by an
> insert on some other table looks a perfect valid scenario regarding this
> current issue.
>

Okay, thanks for the confirmation. So the patch being discussed in
that thread will fix your problem.

--
With Regards,
Amit Kapila.



Re: logical decoding bug: segfault in ReorderBufferToastReplace()

From
"Drouvot, Bertrand"
Date:
Hi,

On 6/9/21 8:10 AM, Amit Kapila wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> On Wed, Jun 9, 2021 at 11:37 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>> On 6/9/21 5:33 AM, Amit Kapila wrote:
>>> On Wed, Jun 9, 2021 at 12:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>>>> On 6/4/21 23:42, Amit Kapila wrote:
>>>>
>>>> On 2021-Jun-04, Jeremy Schneider wrote:
>>>>
>>>> ERROR: XX000: could not open relation with OID 0
>>>> LOCATION: ReorderBufferToastReplace, reorderbuffer.c:305
>>>>
>>>> Even, if this fixes the issue, I guess it is better to find why this
>>>> happens? I think the reason why the code is giving an error is that
>>>> after toast insertions we always expect the insert on the main table
>>>> of toast table, but if there happens to be a case where after toast
>>>> insertion, instead of getting the insertion on the main table we get
>>>> an insert in some other table then you will see this error. I think
>>>> this can happen for speculative insertions where insertions lead to a
>>>> toast table insert, then we get a speculative abort record, and then
>>>> insertion on some other table. The main thing is currently decoding
>>>> code ignores speculative aborts due to which such a problem can occur.
>>>> Now, there could be other cases where such a problem can happen but if
>>>> my theory is correct then the patch we are discussing in the thread
>>>> [1] should solve this problem.
>>>>
>>>> Jeremy, is this problem reproducible? Can we get a testcase or
>>>> pg_waldump output of previous WAL records?
>>>>
>>>> [1] -
https://www.postgresql.org/message-id/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P%2BXgsNAViug15Fm99jA%40mail.gmail.com
>>>>
>>>>
>>>> It's unclear to me whether or not we'll be able to catch the repro on the actual production system. It seems that
weare hitting this somewhat consistently, but at irregular and infrequent intervals. If we are able to catch it and
walkthe WAL records then I'll post back here.
 
>>>>
>>> Okay, one thing you can check is if there is a usage of Insert .. On
>>> Conflict .. statement in the actual production system?
>> Yes that's the case, so that a speculative abort record followed by an
>> insert on some other table looks a perfect valid scenario regarding this
>> current issue.
>>
> Okay, thanks for the confirmation. So the patch being discussed in
> that thread will fix your problem.

Yes, thanks a lot!

Bertrand