Re: relation OID in ReorderBufferToastReplace error message - Mailing list pgsql-hackers

From Jeremy Schneider
Subject Re: relation OID in ReorderBufferToastReplace error message
Date
Msg-id ace5f8ea-4777-b045-ba02-6a042a071919@amazon.com
Whole thread Raw
In response to Re: relation OID in ReorderBufferToastReplace error message  (Jeremy Schneider <schnjere@amazon.com>)
Responses Re: relation OID in ReorderBufferToastReplace error message
List pgsql-hackers
On 7/2/21 18:57, Jeremy Schneider wrote:
On 7/1/21 20:44, Tom Lane wrote:
So I think the relation name is what to print here.  That's generally
what we do, and there's not much argument for this case to be different.

Works for me.  Anything in the error message is quickly and easily visible to users, without attaching debuggers or decoding WAL. A lot of PostgreSQL users won't know how to do the advanced troubleshooting, but knowing the table name might give a clue for better identifying SQL that might have been related, and could help produce better bug reports to the mailing lists in the future.

Related to that, I do think it could be useful to backpatch this - we know that users are hitting this error on older versions.  Even though it's most likely that the speculative insert decoding memory leak bug will address the problem, this update still seems low-risk and useful to me just in the off-chance that someone hits the error again.

I have a few new thoughts, related to some new recent developments here.

Back in the first thread, Amit had asked if this was reproducible.  The user we know of who recently ran into this error hasn't been able to come up with a series of steps that reliably reproduces it, but they did encounter the error again last week on a test system after they left a test workload running for some time. (This apparently took a bit of effort to do and isn't easily repeatable.)

We investigated that system and we verified specifically for the incident last week that it definitely did *not* hit the speculative insert abort bug. We attached a debugger and checked the XLH_DELETE_IS_SUPER flag just before the decoding process error-ed out, and the flag was not set.

We looked a bit further at that system and while I'm not sure everything is solved, it did help Bertrand find one clear bug with an obvious reproduction. We have proposed that fix for the next commitfest:

https://commitfest.postgresql.org/34/3241/

The process of trying to understand this recent incident has given me some new insight about what information would be helpful up front in this error message for faster resolution.

First off, and most importantly, the current WAL record we're processing when the error is encountered. I wonder if it could easily print the LSN?

Secondly, the transaction ID. In the specific bug Bertrand found, the problem is actually not with the actual WAL record that's being processed - but rather because previous WAL records in the same transaction left the decoder process in a state where the current WAL record [a commit] generated an error.  So it's the entire transaction that needs to be examined to reproduce the error.  (Andres actually pointed this out on the original thread back in December 2019.)  I realize that once you know the LSN you can easily get the XID with pg_waldump, but personally I'd just as soon include the XID in the error message since I think it will usually be a first step for debugging any problems with WAL decoding. The I can go straight to filtering that XID on my first pg_waldump run.

Thoughts?

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: [PATCH] Use optimized single-datum tuplesort in ExecSort
Next
From: David Rowley
Date:
Subject: Re: [PATCH] Use optimized single-datum tuplesort in ExecSort