Thread: relation OID in ReorderBufferToastReplace error message

relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
I've recently been working with a PostgreSQL database where we saw ReorderBufferToastReplace() fail due to a reltoastrelid value of 0.  In the original thread, Amit pointed out a connection to the speculative insert decoding memory leak bug.  Bertrand built a test case confirming that a speculative abort record followed by an insert on another table would in fact produce the error message from ReorderBufferToastReplace() that we had seen, and accordingly we're pretty sure that was the root cause in this particular database.

https://www.postgresql.org/message-id/5f9a118e-86c5-d4f1-b584-199a33757cd4%40amazon.com

Nonetheless, in the process of troubleshooting it occurred to me that this error message would be more useful in general if it included the base relation OID in the error message. Amit suggested a separate thread - so here we are.  :)

https://www.postgresql.org/message-id/CAA4eK1%2BdeBBnMVPxfLuv1Aa7tuh-7e3FvnSvWTaCy4-_HPcBLg%40mail.gmail.com

Anyone have thoughts?  Would I need a commitfest entry to propose a two-line tweak like this?

-Jeremy



diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e1279bb..b90603b051 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4598,8 +4598,8 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,

        toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
        if (!RelationIsValid(toast_rel))
-               elog(ERROR, "could not open relation with OID %u",
-                        relation->rd_rel->reltoastrelid);
+               elog(ERROR, "could not open toast relation with OID %u (base relation with OID %u)",
+                        relation->rd_rel->reltoastrelid, relation->rd_rel->oid);

        toast_desc = RelationGetDescr(toast_rel);



-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: relation OID in ReorderBufferToastReplace error message

From
Tom Lane
Date:
Jeremy Schneider <schnjere@amazon.com> writes:
> Nonetheless, in the process of troubleshooting it occurred to me that
> this error message would be more useful in general if it included the
> base relation OID in the error message. Amit suggested a separate thread
> - so here we are.  :)
> Anyone have thoughts?  Would I need a commitfest entry to propose a
> two-line tweak like this?

Probably not, unless it slips through the cracks.  But I wonder why
print the parent's OID, when we have access to its name.

            regards, tom lane



Re: relation OID in ReorderBufferToastReplace error message

From
"Schneider (AWS), Jeremy"
Date:
> On Jul 1, 2021, at 18:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> But I wonder why
> print the parent's OID, when we have access to its name.

Seems like a few people do schema-based multi-tenancy with similarly named relations in different namespaces, so I’d
havea preference for OID over an unqualified relation name. Also the error message shows the OID for the toast relation
sothis is consistent.
 

Relation name could work too though, especially if the schema was included

-Jeremy


Sent from my TI-83


Re: relation OID in ReorderBufferToastReplace error message

From
Tom Lane
Date:
"Schneider (AWS), Jeremy" <schnjere@amazon.com> writes:
>> On Jul 1, 2021, at 18:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But I wonder why
>> print the parent's OID, when we have access to its name.

> Seems like a few people do schema-based multi-tenancy with similarly named relations in different namespaces, so I’d
havea preference for OID over an unqualified relation name. Also the error message shows the OID for the toast relation
sothis is consistent. 

Um, well, the reason we're printing the OID for the toast rel is exactly
that we've not been able to resolve an associated relation, so the OID is
the *only* thing that we have.  But we do have the parent rel's relcache
entry at hand.

> Relation name could work too though, especially if the schema was included

I'm definitely -1 on trying to print the schema name, because that would
require an additional catalog fetch.  If we're in this situation at all,
catalog fetches are suspect; we could easily end up hitting an additional
failure that will probably print a *totally* uninformative message.

I understand your point about similarly-named tables in different schemas,
but Postgres' error messages are uniformly unfriendly to that case, and
I'm not seeing a good reason why this one needs to be different.  On the
other side, I think there are good reasons not to print an OID when we
don't have to: there are plenty of situations where you can't readily
trace the OID to anything at all.  For example, if this error showed up
in a buildfarm run, what do you think the odds would be of identifying
which table the OID referred to?  Maybe you could do it given knowledge
that the error could only be referencing one of a few specific tables,
but it'd still be questionable.

So I think the relation name is what to print here.  That's generally
what we do, and there's not much argument for this case to be different.

(I'm not unsympathetic to the idea that printing schema names
would be helpful.  Just that here is not where to start with
that.  Maybe we could consider sucking in the schema name
during relcache entry build, and then print from that copy so we
don't need an additional catalog fetch under error conditions?)

            regards, tom lane



Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 7/1/21 20:44, Tom Lane wrote:
So I think the relation name is what to print here.  That's generally
what we do, and there's not much argument for this case to be different.

Works for me.  Anything in the error message is quickly and easily visible to users, without attaching debuggers or decoding WAL. A lot of PostgreSQL users won't know how to do the advanced troubleshooting, but knowing the table name might give a clue for better identifying SQL that might have been related, and could help produce better bug reports to the mailing lists in the future.

Related to that, I do think it could be useful to backpatch this - we know that users are hitting this error on older versions.  Even though it's most likely that the speculative insert decoding memory leak bug will address the problem, this update still seems low-risk and useful to me just in the off-chance that someone hits the error again.

(I'm not unsympathetic to the idea that printing schema names
would be helpful.  Just that here is not where to start with
that.  Maybe we could consider sucking in the schema name
during relcache entry build, and then print from that copy so we
don't need an additional catalog fetch under error conditions?)

Agreed not to tackle that in this thread; I wouldn't want to slow down the simple tweak that could be useful.

-Jeremy


diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a44e..6a3a35d05d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4625,8 +4625,8 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,

        toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
        if (!RelationIsValid(toast_rel))
-               elog(ERROR, "could not open relation with OID %u",
-                        relation->rd_rel->reltoastrelid);
+               elog(ERROR, "could not open toast relation with OID %u (base relation \"%s\")",
+                        relation->rd_rel->reltoastrelid, relation->rd_rel->relname);

        toast_desc = RelationGetDescr(toast_rel);




-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 7/2/21 18:57, Jeremy Schneider wrote:
On 7/1/21 20:44, Tom Lane wrote:
So I think the relation name is what to print here.  That's generally
what we do, and there's not much argument for this case to be different.

Works for me.  Anything in the error message is quickly and easily visible to users, without attaching debuggers or decoding WAL. A lot of PostgreSQL users won't know how to do the advanced troubleshooting, but knowing the table name might give a clue for better identifying SQL that might have been related, and could help produce better bug reports to the mailing lists in the future.

Related to that, I do think it could be useful to backpatch this - we know that users are hitting this error on older versions.  Even though it's most likely that the speculative insert decoding memory leak bug will address the problem, this update still seems low-risk and useful to me just in the off-chance that someone hits the error again.

I have a few new thoughts, related to some new recent developments here.

Back in the first thread, Amit had asked if this was reproducible.  The user we know of who recently ran into this error hasn't been able to come up with a series of steps that reliably reproduces it, but they did encounter the error again last week on a test system after they left a test workload running for some time. (This apparently took a bit of effort to do and isn't easily repeatable.)

We investigated that system and we verified specifically for the incident last week that it definitely did *not* hit the speculative insert abort bug. We attached a debugger and checked the XLH_DELETE_IS_SUPER flag just before the decoding process error-ed out, and the flag was not set.

We looked a bit further at that system and while I'm not sure everything is solved, it did help Bertrand find one clear bug with an obvious reproduction. We have proposed that fix for the next commitfest:

https://commitfest.postgresql.org/34/3241/

The process of trying to understand this recent incident has given me some new insight about what information would be helpful up front in this error message for faster resolution.

First off, and most importantly, the current WAL record we're processing when the error is encountered. I wonder if it could easily print the LSN?

Secondly, the transaction ID. In the specific bug Bertrand found, the problem is actually not with the actual WAL record that's being processed - but rather because previous WAL records in the same transaction left the decoder process in a state where the current WAL record [a commit] generated an error.  So it's the entire transaction that needs to be examined to reproduce the error.  (Andres actually pointed this out on the original thread back in December 2019.)  I realize that once you know the LSN you can easily get the XID with pg_waldump, but personally I'd just as soon include the XID in the error message since I think it will usually be a first step for debugging any problems with WAL decoding. The I can go straight to filtering that XID on my first pg_waldump run.

Thoughts?

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Thu, Jul 15, 2021 at 6:14 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>
> On 7/2/21 18:57, Jeremy Schneider wrote:
>
> The process of trying to understand this recent incident has given me some new insight about what information would
behelpful up front in this error message for faster resolution. 
>
> First off, and most importantly, the current WAL record we're processing when the error is encountered. I wonder if
itcould easily print the LSN? 
>
> Secondly, the transaction ID. In the specific bug Bertrand found, the problem is actually not with the actual WAL
recordthat's being processed - but rather because previous WAL records in the same transaction left the decoder process
ina state where the current WAL record [a commit] generated an error.  So it's the entire transaction that needs to be
examinedto reproduce the error.  (Andres actually pointed this out on the original thread back in December 2019.)  I
realizethat once you know the LSN you can easily get the XID with pg_waldump, but personally I'd just as soon include
theXID in the error message since I think it will usually be a first step for debugging any problems with WAL decoding.
TheI can go straight to filtering that XID on my first pg_waldump run. 
>

I don't think it is a bad idea to print additional information as you
are suggesting but why only for this error? It could be useful to
investigate any other error we get during decoding. I think normally
we add such additional information via error_context. We have recently
added/enhanced it for apply-workers, see commit [1].

I think here we should just print the relation name in the error
message you pointed out and then work on adding additional information
via error context as a separate patch. What do you think?

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=abc0910e2e0adfc5a17e035465ee31242e32c4fc


--
With Regards,
Amit Kapila.



Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Fri, Sep 17, 2021 at 10:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I don't think it is a bad idea to print additional information as you
> are suggesting but why only for this error? It could be useful to
> investigate any other error we get during decoding. I think normally
> we add such additional information via error_context. We have recently
> added/enhanced it for apply-workers, see commit [1].
>
> I think here we should just print the relation name in the error
> message you pointed out and then work on adding additional information
> via error context as a separate patch. What do you think?
>

Attached please find the patch which just modifies the current error
message as proposed by you. I am planning to commit it in a day or two
unless there are comments or any other suggestions.

-- 
With Regards,
Amit Kapila.

Attachment

Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 9/20/21 22:14, Amit Kapila wrote:
> On Fri, Sep 17, 2021 at 10:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> I don't think it is a bad idea to print additional information as you
>> are suggesting but why only for this error? It could be useful to
>> investigate any other error we get during decoding. I think normally
>> we add such additional information via error_context. We have recently
>> added/enhanced it for apply-workers, see commit [1].
>>
>> I think here we should just print the relation name in the error
>> message you pointed out and then work on adding additional information
>> via error context as a separate patch. What do you think?
> 
> Attached please find the patch which just modifies the current error
> message as proposed by you. I am planning to commit it in a day or two
> unless there are comments or any other suggestions.

Looks good to me. I see that I hadn't used the macro for getting the
relation name, thanks for fixing that!

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services



Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Wed, Sep 22, 2021 at 2:17 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>
> On 9/20/21 22:14, Amit Kapila wrote:
> > On Fri, Sep 17, 2021 at 10:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> I don't think it is a bad idea to print additional information as you
> >> are suggesting but why only for this error? It could be useful to
> >> investigate any other error we get during decoding. I think normally
> >> we add such additional information via error_context. We have recently
> >> added/enhanced it for apply-workers, see commit [1].
> >>
> >> I think here we should just print the relation name in the error
> >> message you pointed out and then work on adding additional information
> >> via error context as a separate patch. What do you think?
> >
> > Attached please find the patch which just modifies the current error
> > message as proposed by you. I am planning to commit it in a day or two
> > unless there are comments or any other suggestions.
>
> Looks good to me. I see that I hadn't used the macro for getting the
> relation name, thanks for fixing that!
>

Pushed.

-- 
With Regards,
Amit Kapila.



Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 9/21/21 20:58, Amit Kapila wrote:
On 9/20/21 22:14, Amit Kapila wrote:
Attached please find the patch which just modifies the current error
message as proposed by you. I am planning to commit it in a day or two
unless there are comments or any other suggestions.

Pushed.

Any chance of back-patching this? That would get it much sooner into the particular PG system whose encounter with a logical replication bug inspired it.

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>
> On 9/21/21 20:58, Amit Kapila wrote:
>
> On 9/20/21 22:14, Amit Kapila wrote:
>
> Attached please find the patch which just modifies the current error
> message as proposed by you. I am planning to commit it in a day or two
> unless there are comments or any other suggestions.
>
> Pushed.
>
>
> Any chance of back-patching this?
>

Normally, we don't back-patch code improvements unless they fix some
bug or avoid future back-patch efforts. So, I am not inclined to
back-patch this but if others also feel strongly about this then we
can consider it.

-- 
With Regards,
Amit Kapila.



Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 9/22/21 20:11, Amit Kapila wrote:
> 
> On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>>
>> Any chance of back-patching this?
> 
> Normally, we don't back-patch code improvements unless they fix some
> bug or avoid future back-patch efforts. So, I am not inclined to
> back-patch this but if others also feel strongly about this then we
> can consider it.

The original thread about the logical replication bugs spawned a few
different threads and code changes. The other code changes coming out of
those threads were all back-patched, but I guess I can see arguments
both ways on this one.

https://www.postgresql.org/message-id/flat/E1i74JE-00024V-9J%40gemulon.postgresql.org

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b5905470285bf730f6fe7cc5ddb3513d0e6945

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=df3640e5293dccbf964508babfc067282ea7a2fc


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services



Re: relation OID in ReorderBufferToastReplace error message

From
Alvaro Herrera
Date:
On 2021-Sep-23, Jeremy Schneider wrote:

> On 9/22/21 20:11, Amit Kapila wrote:
> > 
> > On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
> >>
> >> Any chance of back-patching this?
> > 
> > Normally, we don't back-patch code improvements unless they fix some
> > bug or avoid future back-patch efforts. So, I am not inclined to
> > back-patch this but if others also feel strongly about this then we
> > can consider it.
> 
> The original thread about the logical replication bugs spawned a few
> different threads and code changes. The other code changes coming out of
> those threads were all back-patched, but I guess I can see arguments
> both ways on this one.

I think that for patches that are simple debugging aids we do
backpatch, with the intent to get them deployed in users' systems as
soon and as widely possible.  I did that in this one, for example

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master [961dd7565] 2021-08-30 16:29:12 -0400
Branch: REL_14_STABLE [eae08e216] 2021-08-30 16:29:12 -0400
Branch: REL_13_STABLE [6197d7b53] 2021-08-30 16:29:12 -0400
Branch: REL_12_STABLE [fa8ae19be] 2021-08-30 16:29:12 -0400
Branch: REL_11_STABLE [0105b7aaa] 2021-08-30 16:29:12 -0400
Branch: REL_10_STABLE [02797ffa9] 2021-08-30 16:29:12 -0400
Branch: REL9_6_STABLE [37e468252] 2021-08-30 16:29:12 -0400

    Report tuple address in data-corruption error message

    Most data-corruption reports mention the location of the problem, but
    this one failed to.  Add it.
    
    Backpatch all the way back.  In 12 and older, also assign the
    ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf890 for
    13 and later.
    
    Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php



Re: relation OID in ReorderBufferToastReplace error message

From
"Bossart, Nathan"
Date:
On 9/23/21, 11:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Sep-23, Jeremy Schneider wrote:
>
>> On 9/22/21 20:11, Amit Kapila wrote:
>> >
>> > On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
>> >>
>> >> Any chance of back-patching this?
>> >
>> > Normally, we don't back-patch code improvements unless they fix some
>> > bug or avoid future back-patch efforts. So, I am not inclined to
>> > back-patch this but if others also feel strongly about this then we
>> > can consider it.
>>
>> The original thread about the logical replication bugs spawned a few
>> different threads and code changes. The other code changes coming out of
>> those threads were all back-patched, but I guess I can see arguments
>> both ways on this one.
>
> I think that for patches that are simple debugging aids we do
> backpatch, with the intent to get them deployed in users' systems as
> soon and as widely possible.  I did that in this one, for example

+1 for back-patching

Nathan


Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Fri, Oct 15, 2021 at 3:40 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 9/23/21, 11:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> > On 2021-Sep-23, Jeremy Schneider wrote:
> >
> >> On 9/22/21 20:11, Amit Kapila wrote:
> >> >
> >> > On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
> >> >>
> >> >> Any chance of back-patching this?
> >> >
> >> > Normally, we don't back-patch code improvements unless they fix some
> >> > bug or avoid future back-patch efforts. So, I am not inclined to
> >> > back-patch this but if others also feel strongly about this then we
> >> > can consider it.
> >>
> >> The original thread about the logical replication bugs spawned a few
> >> different threads and code changes. The other code changes coming out of
> >> those threads were all back-patched, but I guess I can see arguments
> >> both ways on this one.
> >
> > I think that for patches that are simple debugging aids we do
> > backpatch, with the intent to get them deployed in users' systems as
> > soon and as widely possible.  I did that in this one, for example
>
> +1 for back-patching
>

I can take care of backpatching this in the next few days unless there
is any objection.

-- 
With Regards,
Amit Kapila.



Re: relation OID in ReorderBufferToastReplace error message

From
"Bossart, Nathan"
Date:
On 10/18/21, 3:31 AM, "Amit Kapila" <amit.kapila16@gmail.com> wrote:
> I can take care of backpatching this in the next few days unless there
> is any objection.

Thanks!

Nathan


Re: relation OID in ReorderBufferToastReplace error message

From
Amit Kapila
Date:
On Mon, Oct 18, 2021 at 3:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Oct 15, 2021 at 3:40 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >
> > On 9/23/21, 11:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> > > On 2021-Sep-23, Jeremy Schneider wrote:
> > >
> > >> On 9/22/21 20:11, Amit Kapila wrote:
> > >> >
> > >> > On Thu, Sep 23, 2021 at 3:06 AM Jeremy Schneider <schnjere@amazon.com> wrote:
> > >> >>
> > >> >> Any chance of back-patching this?
> > >> >
> > >> > Normally, we don't back-patch code improvements unless they fix some
> > >> > bug or avoid future back-patch efforts. So, I am not inclined to
> > >> > back-patch this but if others also feel strongly about this then we
> > >> > can consider it.
> > >>
> > >> The original thread about the logical replication bugs spawned a few
> > >> different threads and code changes. The other code changes coming out of
> > >> those threads were all back-patched, but I guess I can see arguments
> > >> both ways on this one.
> > >
> > > I think that for patches that are simple debugging aids we do
> > > backpatch, with the intent to get them deployed in users' systems as
> > > soon and as widely possible.  I did that in this one, for example
> >
> > +1 for back-patching
> >
>
> I can take care of backpatching this in the next few days unless there
> is any objection.
>

Done.

-- 
With Regards,
Amit Kapila.



Re: relation OID in ReorderBufferToastReplace error message

From
Jeremy Schneider
Date:
On 10/21/21 01:42, Amit Kapila wrote:
On Mon, Oct 18, 2021 at 3:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I can take care of backpatching this in the next few days unless there is any objection.
Done.

Thanks Amit


-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services