Thread: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Erik Rijkers
Date:
Hi, On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()" on an assert-enabled instance and with (I think) data over a certain length. I whittled it down to the attached bash (careful - it drops stuff). It has 5 tsv-data lines (one long line) that COPY slurps into a table. The middle, third line causes the problem, later on. Shortening the long line to somewhere below 2000 characters fixes it again. More info in the attached .sh file. If debug-assert is 'off', the problem does not occur. (REL_14_STABLE also does not have the problem, assertions or not) thanks, Erik Rijkers
Attachment
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Sun, 27 Mar 2022 20:32:45 +0200, Erik Rijkers <er@xs4all.nl> wrote in > On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()" > on an assert-enabled instance and with (I think) data over a certain > length. > > I whittled it down to the attached bash (careful - it drops stuff). > It has 5 tsv-data lines (one long line) that COPY slurps into a table. > The middle, third line causes the problem, later on. Shortening the > long line to somewhere below 2000 characters fixes it again. > > More info in the attached .sh file. It is reproducible for me. Thanks for the reproducer. > If debug-assert is 'off', the problem does not occur. (REL_14_STABLE > also does not have the problem, assertions or not) It seems like related with [1]? Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but I'm not sure where is the right place to do this.. [1] https://www.postgresql.org/message-id/flat/20210623035916.GL29179%40telsasoft.com#f802617a00cee4d013ad8fa69e1af048 For someone's information, this is more readable stack trace. #0 0x00007f43aeed037f in raise () from /lib64/libc.so.6 #1 0x00007f43aeebadb5 in abort () from /lib64/libc.so.6 #2 0x0000000000b28747 in ExceptionalCondition ( conditionName=0xba2c48 "HaveRegisteredOrActiveSnapshot()", errorType=0xba2882 "FailedAssertion", fileName=0xba2870 "toast_internals.c", lineNumber=670) at assert.c:69 #3 0x00000000004ac776 in init_toast_snapshot (toast_snapshot=0x7ffce64f7440) at toast_internals.c:670 #4 0x00000000005164ea in heap_fetch_toast_slice (toastrel=0x7f43b193cad0, valueid=16393, attrsize=1848, sliceoffset=0, slicelength=1848, result=0x1cbb948) at heaptoast.c:688 #5 0x000000000049fc86 in table_relation_fetch_toast_slice ( toastrel=0x7f43b193cad0, valueid=16393, attrsize=1848, sliceoffset=0, slicelength=1848, result=0x1cbb948) at ../../../../src/include/access/tableam.h:1892 #6 0x00000000004a0a0f in toast_fetch_datum (attr=0x1d6b171) at detoast.c:375 #7 0x000000000049fffb in detoast_attr (attr=0x1d6b171) at detoast.c:123 #8 0x0000000000b345ba in pg_detoast_datum_packed (datum=0x1d6b171) at fmgr.c:1757 #9 0x0000000000aece72 in text_to_cstring (t=0x1d6b171) at varlena.c:225 #10 0x0000000000aedda2 in textout (fcinfo=0x7ffce64f77a0) at varlena.c:574 #11 0x0000000000b331bf in FunctionCall1Coll (flinfo=0x1d695e0, collation=0, arg1=30847345) at fmgr.c:1138 #12 0x0000000000b3422b in OutputFunctionCall (flinfo=0x1d695e0, val=30847345) at fmgr.c:1575 #13 0x00000000004a6b6c in printtup (slot=0x1cb81f0, self=0x1c96e90) at printtup.c:357 #14 0x000000000099499f in RunFromStore (portal=0x1cf9380, direction=ForwardScanDirection, count=0, dest=0x1c96e90) at pquery.c:1096 #15 0x00000000009944e3 in PortalRunSelect (portal=0x1cf9380, forward=true, count=0, dest=0x1c96e90) at pquery.c:917 #16 0x00000000009941d3 in PortalRun (portal=0x1cf9380, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1c96e90, altdest=0x1c96e90, qc=0x7ffce64f7ac0) at pquery.c:765 #17 0x000000000098df4b in exec_simple_query ( query_string=0x1c96030 "fetch all in myportal;") at postgres.c:1250 #18 0x00000000009923a3 in PostgresMain (dbname=0x1cc11b0 "postgres", username=0x1cc1188 "horiguti") at postgres.c:4520 #19 0x00000000008c6caf in BackendRun (port=0x1cb74c0) at postmaster.c:4593 #20 0x00000000008c6631 in BackendStartup (port=0x1cb74c0) at postmaster.c:4321 #21 0x00000000008c29cb in ServerLoop () at postmaster.c:1801 #22 0x00000000008c2298 in PostmasterMain (argc=1, argv=0x1c8e0e0) at postmaster.c:1473 #23 0x00000000007c14c3 in main (argc=1, argv=0x1c8e0e0) at main.c:202 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but > I'm not sure where is the right place to do this.. Then, I found that portal->holdSnapshot is that. I came up with the attached. It does the follows: 1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt. 2. Use holdSnapshot in RunFromStore if any. The rerpducer is reduced to as small as the following. CREATE TABLE t (a text); INSERT INTO t VALUES('some random text'); BEGIN; DECLARE c CURSOR FOR SELECT * FROM t; FETCH ALL IN c; But I haven't come up with a reasonable way to generate the 'some random text' yet. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Then, I found that portal->holdSnapshot is that. I came up with the > attached. It does the follows: > > 1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt. > > 2. Use holdSnapshot in RunFromStore if any. > > > The rerpducer is reduced to as small as the following. > > CREATE TABLE t (a text); > INSERT INTO t VALUES('some random text'); > BEGIN; > DECLARE c CURSOR FOR SELECT * FROM t; > FETCH ALL IN c; > > But I haven't come up with a reasonable way to generate the 'some > random text' yet. I gave up and took a straightforward way to generate one. I don't like that it uses a fixed length for the random text, but anyway it works for now... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Matthias van de Meent
Date:
On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > Then, I found that portal->holdSnapshot is that. I came up with the > > attached. It does the follows: > > > > 1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt. > > > > 2. Use holdSnapshot in RunFromStore if any. > > > > > > The rerpducer is reduced to as small as the following. > > > > CREATE TABLE t (a text); > > INSERT INTO t VALUES('some random text'); > > BEGIN; > > DECLARE c CURSOR FOR SELECT * FROM t; > > FETCH ALL IN c; > > > > But I haven't come up with a reasonable way to generate the 'some > > random text' yet. > > I gave up and took a straightforward way to generate one. > > I don't like that it uses a fixed length for the random text, but > anyway it works for now... An shorter (?) reproducer might be the following, which forces any value for 'a' to be toasted and thus triggering the check in init_toast_snapshot regardless of value length: CREATE TABLE t (a text); ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL; INSERT INTO t VALUES ('toast'); BEGIN; DECLARE c CURSOR FOR SELECT * FROM t; FETCH ALL IN c; Enjoy, -Matthias
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Erik Rijkers
Date:
Op 29-03-2022 om 12:50 schreef Matthias van de Meent: > On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >> >> At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >>> At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >>> Then, I found that portal->holdSnapshot is that. I came up with the >>> attached. It does the follows: >>> >>> 1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt. >>> >>> 2. Use holdSnapshot in RunFromStore if any. >>> >>> >>> The rerpducer is reduced to as small as the following. >>> >>> CREATE TABLE t (a text); >>> INSERT INTO t VALUES('some random text'); >>> BEGIN; >>> DECLARE c CURSOR FOR SELECT * FROM t; >>> FETCH ALL IN c; >>> >>> But I haven't come up with a reasonable way to generate the 'some >>> random text' yet. >> >> I gave up and took a straightforward way to generate one. >> >> I don't like that it uses a fixed length for the random text, but >> anyway it works for now... > > An shorter (?) reproducer might be the following, which forces any > value for 'a' to be toasted and thus triggering the check in > init_toast_snapshot regardless of value length: > > CREATE TABLE t (a text); > ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL; > INSERT INTO t VALUES ('toast'); > BEGIN; > DECLARE c CURSOR FOR SELECT * FROM t; > FETCH ALL IN c; Excellent. That indeed immediately forces the error. (and the patch prevents it) Thanks! > > Enjoy, > > -Matthias
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in > Op 29-03-2022 om 12:50 schreef Matthias van de Meent: > > An shorter (?) reproducer might be the following, which forces any > > value for 'a' to be toasted and thus triggering the check in > > init_toast_snapshot regardless of value length: > > CREATE TABLE t (a text); > > ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL; > > INSERT INTO t VALUES ('toast'); > > BEGIN; > > DECLARE c CURSOR FOR SELECT * FROM t; > > FETCH ALL IN c; Yeah, unfortunately I tried that first and saw it didn't work. And it still doesn't for me. With such a short text pg_detoast_datum_pakced doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master) I think I'm missing something here. I'm going to examine around. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Wed, 30 Mar 2022 10:06:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in > > Op 29-03-2022 om 12:50 schreef Matthias van de Meent: > > > An shorter (?) reproducer might be the following, which forces any > > > value for 'a' to be toasted and thus triggering the check in > > > init_toast_snapshot regardless of value length: > > > CREATE TABLE t (a text); > > > ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL; > > > INSERT INTO t VALUES ('toast'); > > > BEGIN; > > > DECLARE c CURSOR FOR SELECT * FROM t; > > > FETCH ALL IN c; > > Yeah, unfortunately I tried that first and saw it didn't work. And it > still doesn't for me. With such a short text pg_detoast_datum_pakced > doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master) > > I think I'm missing something here. I'm going to examine around. Hmm. Strange. My memory tells that I did the same thing before.. I thought that it is somewhat related to compression since repeat('x', 4096) didin't seem working at that time, but it worked this time. Maybe I was confused between extended and external.. But, in the first place the *fix* has been found to be wrong. I'm going to search for the right fix.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > But, in the first place the *fix* has been found to be wrong. I'm > going to search for the right fix.. FETCH uses the snapshot at DECLARE. So anyhow I needed to set the queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's holdSnapshot. What I did in this version is: 1. Add a new member "snapshot" to the type DestReceiver. 2. In PortalRunSelect, set the DECLARE'd query's snapshot to the member iff the dest is tupelstore and the active snapshot is not set. 3. In FillPortalStore, copy the snapshot to the portal's holdSnapshot. 4. RunFromStore uses holdSnapshot if any. I'm not still confident on this, but it should be better than the v1. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Kyotaro Horiguchi
Date:
At Wed, 30 Mar 2022 17:58:24 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > But, in the first place the *fix* has been found to be wrong. I'm > > going to search for the right fix.. > > FETCH uses the snapshot at DECLARE. So anyhow I needed to set the > queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's > holdSnapshot. What I did in this version is: By the way, this is, given that the added check in init_toast_snapshot is valid, a long-standing "bug", which at least back to 12. I'm not sure what to do for this. 1. ignore the check for certain cases? 2. apply any fix only to master and call it a day. 14 and earlier doesn't have the assertion check so they don't complain. 3. apply a fix to all affected versions. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Justin Pryzby
Date:
On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote: > I'm not still confident on this, but it should be better than the v1. +Andres as this seems to be related to 277692220. I added it as an Opened Item.
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote: > On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote: > > I'm not still confident on this, but it should be better than the v1. > > +Andres as this seems to be related to 277692220. FWIW, that'd just mean it's an old bug that wasn't easily noticeable before, not that it's the fault of 277692220. > I added it as an Opened Item. IOW, it'd belong in "Older bugs affecting stable branches". Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote: > > On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote: > > > I'm not still confident on this, but it should be better than the v1. > > > > +Andres as this seems to be related to 277692220. > > FWIW, that'd just mean it's an old bug that wasn't easily noticeable > before, not that it's the fault of 277692220. I think you're still on the hook to do something about it for this release. You could decide to revert the commit adding the assertion and punt on doing thing about the underlying bug, but we can't just be like "oh, an assertion is failing, we'll get to that someday". -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote: >> FWIW, that'd just mean it's an old bug that wasn't easily noticeable >> before, not that it's the fault of 277692220. > I think you're still on the hook to do something about it for this > release. I think you're trying to shoot the messenger. As Andres says, 277692220 just exposes that there is some pre-existing bug here. It's probably related to 84f5c2908, so I was planning to take a look at it at some point, but there are a few other higher-priority bugs in the way. I see no point in reverting 277692220. Removing the Assert would prevent, or at least complicate, detection of other similar bugs. And it'd do nothing to help end users, who won't have assertions enabled anyway. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote: > >> FWIW, that'd just mean it's an old bug that wasn't easily noticeable > >> before, not that it's the fault of 277692220. > > > I think you're still on the hook to do something about it for this > > release. > > I think you're trying to shoot the messenger. As Andres says, > 277692220 just exposes that there is some pre-existing bug here. > It's probably related to 84f5c2908, so I was planning to take a > look at it at some point, but there are a few other higher-priority > bugs in the way. Well, if you're willing to look at it that's fine, but I just don't agree that it's OK to just commit things that add failing assertions and drive off into the sunset. The code is always going to have a certain number of unfixed bugs, and that's fine, and finding them is good in itself, but people want to be able to run the software in the meantime, and some of those people are developers or other individuals who want to run it with assertions enabled. It's a judgement call whether this assertion failure is going to bite enough people to be a problem, but if it were something that happened easily enough to prevent you from working on the source code, I'm sure you wouldn't be OK with leaving it in there until someone got around to looking at it. Given that it took about a month and a half for someone to report them problem, it's not as bad as all that, but I guess I won't be surprised if we keep getting complaints until something gets done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-14 11:33:45 -0400, Robert Haas wrote: > On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote: > > >> FWIW, that'd just mean it's an old bug that wasn't easily noticeable > > >> before, not that it's the fault of 277692220. > > > > > I think you're still on the hook to do something about it for this > > > release. > > > > I think you're trying to shoot the messenger. As Andres says, > > 277692220 just exposes that there is some pre-existing bug here. > > It's probably related to 84f5c2908, so I was planning to take a > > look at it at some point, but there are a few other higher-priority > > bugs in the way. > > Well, if you're willing to look at it that's fine, but I just don't > agree that it's OK to just commit things that add failing assertions > and drive off into the sunset. I'm not planning to ride into the sunset / ignore this issue. All I said is that it's imo not the right thing to say that that commit broke things in 15. And that not a semantics game, because it means that the fix needs to go back further than 277692220. Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think you're trying to shoot the messenger. As Andres says, >> 277692220 just exposes that there is some pre-existing bug here. >> It's probably related to 84f5c2908, so I was planning to take a >> look at it at some point, but there are a few other higher-priority >> bugs in the way. > Well, if you're willing to look at it that's fine, but I just don't > agree that it's OK to just commit things that add failing assertions > and drive off into the sunset. I don't think Andres had any intention of ignoring it indefinitely. What he is doing is prioritizing it lower than several of the other open items, an opinion I fully agree with. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't think Andres had any intention of ignoring it indefinitely. > What he is doing is prioritizing it lower than several of the other > open items, an opinion I fully agree with. Well, my point is that, historically, relegating things to the older bugs section often means nothing happens, which I think would not be great in this case. However, I'll try to shut up about the procedural issues for the moment, since I seem to be in the minority. I got curious and looked at the underlying problem here and I am wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It seems to me that the code is always going to return true if there are any active snapshots, and the rest of the function is intended to test whether there is a registered snapshot other than the catalog snapshot. But I don't think that's what this code does: if (pairingheap_is_empty(&RegisteredSnapshots) || !pairingheap_is_singular(&RegisteredSnapshots)) return false; return CatalogSnapshot == NULL; So, if there are 0 active snapshots we return false. And if there is 1 active snapshot then we return true if there's no catalog snapshot and otherwise false. So far so good. But if there are 2 or more registered snapshots then the pairing heap will be neither empty nor singular so it seems to me we will return false, which seems to me to be the wrong answer. I tried this: diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index a0be0c411a..4e8e26a362 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -1646,9 +1646,10 @@ HaveRegisteredOrActiveSnapshot(void) * removed at any time due to invalidation processing. If explicitly * registered more than one snapshot has to be in RegisteredSnapshots. */ - if (pairingheap_is_empty(&RegisteredSnapshots) || - !pairingheap_is_singular(&RegisteredSnapshots)) + if (pairingheap_is_empty(&RegisteredSnapshots)) return false; + if (!pairingheap_is_singular(&RegisteredSnapshots)) + return true; return CatalogSnapshot == NULL; } I find that 'make check-world' passes with this change, which is disturbing, because it also passes without this change. That means we don't have any tests that reach HaveRegisteredOrActiveSnapshot() with more than one registered snapshot. Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in more places, I think we should consider revising the whole approach here. The way init_toast_snapshot() is coded, we basically have some code that tests for active and registered snapshots and finds the oldest one. We error out if there isn't one. And then we get to this assertion, which checks the same stuff a second time but with an additional check to see whether we ended up with the catalog snapshot. Wouldn't it make more sense if GetOldestSnapshot() just refused to return the catalog snapshot in the first place? -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-14 12:16:45 -0400, Robert Haas wrote: > I got curious and looked at the underlying problem here and I am > wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It > seems to me that the code is always going to return true if there are > any active snapshots, and the rest of the function is intended to test > whether there is a registered snapshot other than the catalog > snapshot. But I don't think that's what this code does: > > if (pairingheap_is_empty(&RegisteredSnapshots) || > !pairingheap_is_singular(&RegisteredSnapshots)) > return false; > > return CatalogSnapshot == NULL; Certainly looks off... > I find that 'make check-world' passes with this change, which is > disturbing, because it also passes without this change. That means we > don't have any tests that reach HaveRegisteredOrActiveSnapshot() with > more than one registered snapshot. Part of that is because right now the assertion is placed "too deep" - it should be much higher up, so it's reached even if there's not actually a toast datum. But there's of other bugs preventing that :(. A lot of bugs have been hidden by the existence of CatalogSnapshot (which of course isn't something one actually can rely on). > Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in > more places, I think we should, but there's the other bugs that need to be fixed first :(. Namely that we have plenty places doing catalog accesses without an active or registered snapshot :(. > I think we should consider revising the whole approach here. The way > init_toast_snapshot() is coded, we basically have some code that tests > for active and registered snapshots and finds the oldest one. We error > out if there isn't one. And then we get to this assertion, which > checks the same stuff a second time but with an additional check to > see whether we ended up with the catalog snapshot. Wouldn't it make > more sense if GetOldestSnapshot() just refused to return the catalog > snapshot in the first place? I'm worried that that could cause additional bugs. Consider code using GetOldestSnapshot() to check if tuples need to be preserved or such. Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-14 09:54:25 -0700, Andres Freund wrote: > On 2022-04-14 12:16:45 -0400, Robert Haas wrote: > > I got curious and looked at the underlying problem here and I am > > wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It > > seems to me that the code is always going to return true if there are > > any active snapshots, and the rest of the function is intended to test > > whether there is a registered snapshot other than the catalog > > snapshot. But I don't think that's what this code does: > > > > if (pairingheap_is_empty(&RegisteredSnapshots) || > > !pairingheap_is_singular(&RegisteredSnapshots)) > > return false; > > > > return CatalogSnapshot == NULL; > > Certainly looks off... > > > > I find that 'make check-world' passes with this change, which is > > disturbing, because it also passes without this change. That means we > > don't have any tests that reach HaveRegisteredOrActiveSnapshot() with > > more than one registered snapshot. > > Part of that is because right now the assertion is placed "too deep" - > it should be much higher up, so it's reached even if there's not > actually a toast datum. But there's of other bugs preventing that :(. A > lot of bugs have been hidden by the existence of CatalogSnapshot (which > of course isn't something one actually can rely on). > > > > Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in > > more places, > > I think we should, but there's the other bugs that need to be fixed > first :(. Namely that we have plenty places doing catalog accesses > without an active or registered snapshot :(. Ah, we actually were debating some of these issues more recently, in: https://www.postgresql.org/message-id/20220311030721.olixpzcquqkw2qyt%40alap3.anarazel.de https://www.postgresql.org/message-id/20220311021047.hgtqkrl6n52srvdu%40alap3.anarazel.de It looks like the same bug, and that the patch in this thread fixes them. And that we need to backpatch the fix, right? Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > Part of that is because right now the assertion is placed "too deep" - > it should be much higher up, so it's reached even if there's not > actually a toast datum. But there's of other bugs preventing that :(. A > lot of bugs have been hidden by the existence of CatalogSnapshot (which > of course isn't something one actually can rely on). I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on the wrong horse. When I invented CatalogSnapshot, I intended for it to be an ephemeral snapshot that we'd keep for as long as we could safely do so and then discard it as soon as there's any hint of a problem. But that commit really takes the opposite approach, trying to keep CatalogSnapshot valid for a longer period of time instead of just chucking it. > > Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in > > more places, > > I think we should, but there's the other bugs that need to be fixed > first :(. Namely that we have plenty places doing catalog accesses > without an active or registered snapshot :(. Ouch. > I'm worried that that could cause additional bugs. Consider code using > GetOldestSnapshot() to check if tuples need to be preserved or such. But there is no such code: GetOldestSnapshot() has only one caller. And I don't think we'd add more just to do something like that, because we have other mechanisms that are specifically designed for testing whether tuples are prunable that are better suited to such tasks. It should really only be used when you don't know which of the backend's current snapshots ought to be used for something, but are sure that using the oldest one will be good enough. And in that kind of situation, it's hard to see why using the catalog snapshot would ever be the right idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on > the wrong horse. When I invented CatalogSnapshot, I intended for it to > be an ephemeral snapshot that we'd keep for as long as we could safely > do so and then discard it as soon as there's any hint of a problem. > But that commit really takes the opposite approach, trying to keep > CatalogSnapshot valid for a longer period of time instead of just > chucking it. Not really following? ISTM the point of ffaa44cb5 was that if we don't include the CatalogSnapshot in our advertised xmin, then the length of time we can safely use it is *zero*. > But there is no such code: GetOldestSnapshot() has only one caller. > And I don't think we'd add more just to do something like that, > because we have other mechanisms that are specifically designed for > testing whether tuples are prunable that are better suited to such > tasks. It should really only be used when you don't know which of the > backend's current snapshots ought to be used for something, but are > sure that using the oldest one will be good enough. And in that kind > of situation, it's hard to see why using the catalog snapshot would > ever be the right idea. What if the reason we need a snapshot is to detoast some toasted item we read from a catalog with the CatalogSnapshot? There might not be any other valid snapshot, so I don't think I buy your argument here. ... Unless your argument is that the session should always have an older non-catalog snapshot, which I'm not sure whether I buy or not. But if we do believe that, then adding mechanisms to force it to be so could be an alternative solution. But why is that better than what we're doing? regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-14 13:36:51 -0400, Tom Lane wrote: > What if the reason we need a snapshot is to detoast some toasted item > we read from a catalog with the CatalogSnapshot? There might not be > any other valid snapshot, so I don't think I buy your argument here. We definitely do have places doing that, but is it ever actually safe? Part of the catalog access might cause cache invalidations to be processed, which can invalidate the snapshot (including resetting MyProc->xmin). Afaics we always would have push or register the snapshot, either will copy the snapshot, I think? Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 1:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on > > the wrong horse. When I invented CatalogSnapshot, I intended for it to > > be an ephemeral snapshot that we'd keep for as long as we could safely > > do so and then discard it as soon as there's any hint of a problem. > > But that commit really takes the opposite approach, trying to keep > > CatalogSnapshot valid for a longer period of time instead of just > > chucking it. > > Not really following? ISTM the point of ffaa44cb5 was that if we don't > include the CatalogSnapshot in our advertised xmin, then the length > of time we can safely use it is *zero*. No, I don't think so. I'm proposing that you shouldn't be taking a catalog snapshot unless you already hold some other snapshot, and that the catalog snapshot should always be newer than whatever that other snapshot is. If we made that be true, then your catalog snapshot couldn't ever be the thing holding back xmin -- and then I think it doesn't need to be registered. > > But there is no such code: GetOldestSnapshot() has only one caller. > > And I don't think we'd add more just to do something like that, > > because we have other mechanisms that are specifically designed for > > testing whether tuples are prunable that are better suited to such > > tasks. It should really only be used when you don't know which of the > > backend's current snapshots ought to be used for something, but are > > sure that using the oldest one will be good enough. And in that kind > > of situation, it's hard to see why using the catalog snapshot would > > ever be the right idea. > > What if the reason we need a snapshot is to detoast some toasted item > we read from a catalog with the CatalogSnapshot? There might not be > any other valid snapshot, so I don't think I buy your argument here. > > ... Unless your argument is that the session should always have an > older non-catalog snapshot, which I'm not sure whether I buy or not. > But if we do believe that, then adding mechanisms to force it to be so > could be an alternative solution. But why is that better than what > we're doing? That is exactly my argument, but I'm not actually sure whether it is in fact better. I was responding to Andres's statement that CatalogSnapshot was hiding a lot of bugs because it makes it look like we have a snapshot when we don't really. And my point is that ffaa44cb559db332baeee7d25dedd74a61974203 made such problems a lot more likely, because before that the snapshot was not in RegisteredSnapshots, and therefore anybody asking "hey, do we have a snapshot?" wasn't likely to get confused, because they would only see the catalog snapshot if they specifically went and looked at CatalogSnapshot(Set), and if you do that, hopefully you'll think about the special semantics of that snapshot and write code that works with them. But with CatalogSnapshot in RegisteredSnapshots, any code that looks through RegisteredSnapshots has to worry about whether what it's finding there is actually just the CatalogSnapshot, and if Andres's statement that we have a lot of bugs here is to be believed, then we have not done a good job finding and updating all such code. We can continue down the path of finding and fixing it -- or we can back out parts of ffaa44cb559db332baeee7d25dedd74a61974203. Just to be clear, I'm not debating that that commit fixed some real problems and I think parts of it are really necessary fixes. But to quote from the commit message: The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting for whether MyPgXact->xmin could be cleared or advanced. In normal transactions this was masked by the fact that the transaction snapshot would be older, but during backend startup and certain utility commands it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had been cleared ... And what I'm suggesting is that *perhaps* we ought to have fixed those "during backend startup and certain utility commands" by having SnapshotResetXmin() do InvalidateCatalogSnapshot(), and maybe also made the code in those commands that's doing catalog lookups hold acquire and hold a snapshot of its own around the operation. The alternative you chose, namely, incorporating the xmin into the backend's xmin computation, I think can also be made to work. I am just not sure that it's the best approach. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2022-04-14 12:16:45 -0400, Robert Haas wrote: >> I got curious and looked at the underlying problem here and I am >> wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It >> seems to me that the code is always going to return true if there are >> any active snapshots, and the rest of the function is intended to test >> whether there is a registered snapshot other than the catalog >> snapshot. But I don't think that's what this code does: >> >> if (pairingheap_is_empty(&RegisteredSnapshots) || >> !pairingheap_is_singular(&RegisteredSnapshots)) >> return false; >> >> return CatalogSnapshot == NULL; > Certainly looks off... Yeah, this is broken. Whilst waiting around for a build on wrasse's host, I reproduced the problem shown in this thread, and here's what I see at the point of the exception: (gdb) p RegisteredSnapshots $5 = {ph_compare = 0x9a6000 <xmin_cmp>, ph_arg = 0x0, ph_root = 0xec3168 <CatalogSnapshotData+72>} (gdb) p *RegisteredSnapshots.ph_root $6 = {first_child = 0x2d85d70, next_sibling = 0x0, prev_or_parent = 0x0} (gdb) p CatalogSnapshotData $7 = {snapshot_type = SNAPSHOT_MVCC, xmin = 52155, xmax = 52155, xip = 0x2d855b0, xcnt = 0, subxip = 0x2de9130, subxcnt = 0, suboverflowed = false, takenDuringRecovery = false, copied = false, curcid = 0, speculativeToken = 0, vistest = 0x0, active_count = 0, regd_count = 0, ph_node = {first_child = 0x2d85d70, next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken = 0, lsn = 0, snapXactCompletionCount = 1} (gdb) p CatalogSnapshot $8 = (Snapshot) 0xec3120 <CatalogSnapshotData> (gdb) p *(Snapshot) (0x2d85d70-72) $9 = {snapshot_type = SNAPSHOT_MVCC, xmin = 52155, xmax = 52155, xip = 0x0, xcnt = 0, subxip = 0x0, subxcnt = 0, suboverflowed = false, takenDuringRecovery = false, copied = true, curcid = 0, speculativeToken = 0, vistest = 0x0, active_count = 0, regd_count = 2, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0xec3168 <CatalogSnapshotData+72>}, whenTaken = 0, lsn = 0, snapXactCompletionCount = 0} (gdb) p ActiveSnapshot $10 = (ActiveSnapshotElt *) 0x0 So in fact there IS another registered snapshot, and HaveRegisteredOrActiveSnapshot is just lying. I think the correct test is more nearly what we have in InvalidateCatalogSnapshotConditionally: if (CatalogSnapshot && ActiveSnapshot == NULL && pairingheap_is_singular(&RegisteredSnapshots)) // then the CatalogSnapshot is the only one. Ergo, this actually is a bug in 277692220. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 14, 2022 at 1:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not really following? ISTM the point of ffaa44cb5 was that if we don't >> include the CatalogSnapshot in our advertised xmin, then the length >> of time we can safely use it is *zero*. > No, I don't think so. I'm proposing that you shouldn't be taking a > catalog snapshot unless you already hold some other snapshot, and that > the catalog snapshot should always be newer than whatever that other > snapshot is. If we made that be true, then your catalog snapshot > couldn't ever be the thing holding back xmin -- and then I think it > doesn't need to be registered. If you don't register it, then you need to also make sure that it's destroyed whenever that older snapshot is. Which I think would require either a lot of useless/inefficient CatalogSnapshot destructions, or infrastructure that's more or less isomorphic to the RegisteredSnapshot heap. > That is exactly my argument, but I'm not actually sure whether it is > in fact better. I was responding to Andres's statement that > CatalogSnapshot was hiding a lot of bugs because it makes it look like > we have a snapshot when we don't really. Well, we DO have a snapshot, and it is 100% perfectly safe to use, if it's registered. Andres' complaint is that that snapshot might get invalidated when you weren't expecting it, but I'm not really convinced that we have all that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing find them? regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 3:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you don't register it, then you need to also make sure that it's > destroyed whenever that older snapshot is. Which I think would require > either a lot of useless/inefficient CatalogSnapshot destructions, or > infrastructure that's more or less isomorphic to the RegisteredSnapshot > heap. Well, if that's true, then I agree that it's a good argument against that approach. But I guess I'm confused as to why we'd end up in that situation. Suppose we do these two things: 1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's the other way around right now, but that's only because we're registering the catalog snapshot. 2. Bomb out in GetCatalogSnapshot if you don't have an active or registered snapshot already. Is there some reason we'd need any more infrastructure than that? > Well, we DO have a snapshot, and it is 100% perfectly safe to use, if it's > registered. Andres' complaint is that that snapshot might get invalidated > when you weren't expecting it, but I'm not really convinced that we have > all that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing > find them? Hmm, that's a good question. I don't know. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 14, 2022 at 3:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If you don't register it, then you need to also make sure that it's >> destroyed whenever that older snapshot is. > Well, if that's true, then I agree that it's a good argument against > that approach. But I guess I'm confused as to why we'd end up in that > situation. Suppose we do these two things: > 1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's > the other way around right now, but that's only because we're > registering the catalog snapshot. > 2. Bomb out in GetCatalogSnapshot if you don't have an active or > registered snapshot already. > Is there some reason we'd need any more infrastructure than that? Yes. 1. Create snapshot 1 (beginning of transaction). 2. Create catalog snapshot (okay because of snapshot 1). 3. Create snapshot 2. 4. Destroy snapshot 1. 5. Catalog snapshot is still there and is now the oldest. The implementation you propose would have to also forbid this sequence of events, which is (a) difficult and (b) would add instability to the system, since there's really no reason that this should be Not OK. I'm basically not on board with adding complication to make the system less performant and more brittle, and I don't see how the direction you want to go isn't that. (BTW, this thought experiment also puts a hole in the test added by 277692220: even if HaveRegisteredOrActiveSnapshot were doing what it claims to do, it would allow use of the catalog snapshot for detoasting after step 4, which I suppose is not what Andres intended.) regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-14 15:05:50 -0400, Tom Lane wrote: > Andres' complaint is that that snapshot might get invalidated when you > weren't expecting it, but I'm not really convinced that we have all > that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing > find them? Don't see why it would - we don't have any mechanism in place for enforcing that we don't update / delete a tuple we've looked up with an xmin that wasn't continually enforced. A typical pattern is to use a catalog cache (registered an all) for a syscache lookup, but then not have a registered / active snapshot until an eventual update / delete (after the syscache scan ends). Which isn't safe, because without a MyProc->xmin set, the tuple we're updating / deleting could be updated, removed and replaced with another tuple. Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Thu, Apr 14, 2022 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Well, if that's true, then I agree that it's a good argument against > > that approach. But I guess I'm confused as to why we'd end up in that > > situation. Suppose we do these two things: > > > 1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's > > the other way around right now, but that's only because we're > > registering the catalog snapshot. > > 2. Bomb out in GetCatalogSnapshot if you don't have an active or > > registered snapshot already. > > > Is there some reason we'd need any more infrastructure than that? > > Yes. > > 1. Create snapshot 1 (beginning of transaction). > 2. Create catalog snapshot (okay because of snapshot 1). > 3. Create snapshot 2. > 4. Destroy snapshot 1. > 5. Catalog snapshot is still there and is now the oldest. Sorry, I'm not seeing the problem. If we call SnapshotResetXmin() after step 4, then the catalog snapshot would get invalidated under my proposal. If we don't, then our advertised xmin has not changed and nothing can be pruned out from under us. > I'm basically not on board with adding complication to make the system > less performant and more brittle, and I don't see how the direction > you want to go isn't that. Well ... I agree that brittle is bad, but it's not clear to me which way is actually less brittle. As for performant, I think you might be misjudging the situation. My original design for removing SnapshotNow didn't even have the catalog snapshot - it just took a new snapshot every time. That was mostly fine, but Andres found a somewhat extreme test case where it exhibited a significant regression, so I added the catalog snapshot stuff to work around that. So I'm not AT ALL convinced that giving catalog snapshots longer lifetimes is a relevant thing to do. There's some value in it if you can construct a test case where the overall rate of snapshot taking is extremely high, but in normal cases that isn't so. It's certainly not worth complicating the code for backend startup or DDL commands to reduce the number of snapshots, because you're never going to have those things happening at a high enough rate to matter, or so I believe. The way you get a benefit from CatalogSnapshot is to construct a workload with a lot of really cheap SQL statements each of which has to do a bunch of catalog lookups, and then run that at high concurrency. The concurrency makes taking snapshots more expensive, because the cache lines are contended, and having the same command use the same snapshot over and over again instead of taking new ones then brings the cost down enough to be measurable. But people run SQL statements in a tight loop, not DDL or backend startup. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Well ... I agree that brittle is bad, but it's not clear to me which > way is actually less brittle. As for performant, I think you might be > misjudging the situation. My original design for removing SnapshotNow > didn't even have the catalog snapshot - it just took a new snapshot > every time. That was mostly fine, but Andres found a somewhat extreme > test case where it exhibited a significant regression, so I added the > catalog snapshot stuff to work around that. So I'm not AT ALL > convinced that giving catalog snapshots longer lifetimes is a relevant > thing to do. Perhaps not. But right now we have to first think about correctness and how much trouble it will be to get to correctness. ISTM you are arguing for a design in which it's required that there is always a registered or active snapshot that's older than the catalog snapshot (if any). I tried revising snapmgr.c to enforce that, starting with adding @@ -421,6 +421,13 @@ GetNonHistoricCatalogSnapshot(Oid relid) if (CatalogSnapshot == NULL) { + /* + * The catalog snapshot must always be newer than some active or + * registered snapshot. (XXX explain why) + */ + Assert(ActiveSnapshot != NULL || + !pairingheap_is_empty(&RegisteredSnapshots)); + /* Get new snapshot. */ CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData); and this blew up with truly impressive thoroughness. The autovac launcher, logical replication launcher, and incoming backends all fail this assertion instantly, making it impossible to find out what else might be broken --- but I'm sure there is a lot. (If you want to try this for yourself, remember that the postmaster will relaunch the AV and LR launchers on failure, meaning that your disk will fill with core files very very quickly. Just sayin'.) So maybe we can go that direction, but it's going to require a lot of code additions to push extra snapshots in places that haven't bothered to date; and I'm not convinced that that'd be buying us anything except more GetSnapshotData() calls. Plan B is to grant catalog snapshots some more-durable status than what Plan A envisions. I'm not very sure about the details, but if you don't want to go that route then you'd better set about making the above assertion work. In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot is failing to meet its contract, I'm going to go fix it. I think (based on the above argument) that what it intends to enforce is not really the system design we need, but it certainly isn't helping anyone that it enforces that design incorrectly. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-16 14:42:39 -0400, Tom Lane wrote: > In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot > is failing to meet its contract, I'm going to go fix it. +1 > I think (based on the above argument) that what it intends to enforce > is not really the system design we need, but it certainly isn't > helping anyone that it enforces that design incorrectly. I think it's approximately right for the current caller. But that caller likely needs an improved design around snapshots... Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2022-04-16 14:42:39 -0400, Tom Lane wrote: >> I think (based on the above argument) that what it intends to enforce >> is not really the system design we need, but it certainly isn't >> helping anyone that it enforces that design incorrectly. > I think it's approximately right for the current caller. But that caller > likely needs an improved design around snapshots... Yeah, I think the real issue is that checking HaveRegisteredOrActiveSnapshot in this way doesn't provide a very good guarantee of what we really want to know, which is that the session's advertised xmin is old enough to prevent removal of whatever toast data we're trying to fetch. The fact that we have a snapshot at the instant of fetch doesn't prove that it existed continually since we fetched the toast reference, which seems to be the condition we actually need to assure. (And TBH I see little reason to think that whether the snapshot is the CatalogSnapshot or not changes things in any meaningful way.) I don't yet see a practical way to check for the real concern. While it's something to worry about, there's no reason to think that v15 is any worse than prior versions in this area, is there? So I'm inclined to remove this from the list of v15 open items, or at least demote the remaining concern to "older bug" status. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Sat, Apr 16, 2022 at 2:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > and this blew up with truly impressive thoroughness. The autovac > launcher, logical replication launcher, and incoming backends all > fail this assertion instantly, making it impossible to find out > what else might be broken --- but I'm sure there is a lot. OK, thanks for trying that. > In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot > is failing to meet its contract, I'm going to go fix it. I think > (based on the above argument) that what it intends to enforce is not > really the system design we need, but it certainly isn't helping > anyone that it enforces that design incorrectly. +1. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Andres Freund
Date:
Hi, On 2022-04-17 11:51:58 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-04-16 14:42:39 -0400, Tom Lane wrote: > >> I think (based on the above argument) that what it intends to enforce > >> is not really the system design we need, but it certainly isn't > >> helping anyone that it enforces that design incorrectly. > > > I think it's approximately right for the current caller. But that caller > > likely needs an improved design around snapshots... > > Yeah, I think the real issue is that checking > HaveRegisteredOrActiveSnapshot in this way doesn't provide a very > good guarantee of what we really want to know, which is that the > session's advertised xmin is old enough to prevent removal of > whatever toast data we're trying to fetch. Right. It's better than what was there before though - I added HaveRegisteredOrActiveSnapshot() in the course of 7c38ef2a5d6cf6d8dc3834399d7a1c364d64ce64. Where the problem was that we didn't have *any* snapshot other than the catalog snapshot, and the catalog snapshot only sometimes (iirc for that bug it depended on the order in which objects were deleted). That makes such bugs much harder to detect. > The fact that we have a snapshot at the instant of fetch doesn't prove > that it existed continually since we fetched the toast reference, > which seems to be the condition we actually need to assure. Right. > (And TBH I see little reason to think that whether the snapshot is the > CatalogSnapshot or not changes things in any meaningful way.) It is a meaningful difference, see e.g. the bug referenced above. > I don't yet see a practical way to check for the real concern. > While it's something to worry about, there's no reason to think > that v15 is any worse than prior versions in this area, is there? > So I'm inclined to remove this from the list of v15 open items, > or at least demote the remaining concern to "older bug" status. Yes. Greetings, Andres Freund
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Mon, Apr 18, 2022 at 10:39 AM Andres Freund <andres@anarazel.de> wrote: > Right. It's better than what was there before though - I added > HaveRegisteredOrActiveSnapshot() in the course of > 7c38ef2a5d6cf6d8dc3834399d7a1c364d64ce64. Where the problem was that we > didn't have *any* snapshot other than the catalog snapshot, and the > catalog snapshot only sometimes (iirc for that bug it depended on the > order in which objects were deleted). That makes such bugs much harder > to detect. I still think it would be better to have GetOldestSnapshot() be smarter and refuse to return the catalog snapshot. For one thing, that way we'd be testing for the problem case in non-assert builds also. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2022-04-17 11:51:58 -0400, Tom Lane wrote: >> The fact that we have a snapshot at the instant of fetch doesn't prove >> that it existed continually since we fetched the toast reference, >> which seems to be the condition we actually need to assure. > Right. >> (And TBH I see little reason to think that whether the snapshot is the >> CatalogSnapshot or not changes things in any meaningful way.) > It is a meaningful difference, see e.g. the bug referenced above. Well, that's true given the current arrangements for managing CatalogSnapshot; but that doesn't make the CatalogSnapshot any less of a protection when it exists. The direction I was vaguely imagining is that we create some refcount-like infrastructure directly ensuring that once a snapshot is used to read a toast reference, it gets kept around until we dereference or discard that reference. With a scheme like that, there'd be no reason to discriminate against a CatalogSnapshot as being the protective snapshot. (I hasten to add that I have no idea how to make this half-baked plan work, and there may be better solutions anyway.) >> While it's something to worry about, there's no reason to think >> that v15 is any worse than prior versions in this area, is there? >> So I'm inclined to remove this from the list of v15 open items, >> or at least demote the remaining concern to "older bug" status. > Yes. OK, I'll update the open-items page. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I still think it would be better to have GetOldestSnapshot() be > smarter and refuse to return the catalog snapshot. For one thing, that > way we'd be testing for the problem case in non-assert builds also. I was wondering about that too. On the other hand, given that we know this area is squishy, transforming fails-in-assert-builds to fails-everywhere is not necessarily desirable. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Mon, Apr 18, 2022 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I still think it would be better to have GetOldestSnapshot() be > > smarter and refuse to return the catalog snapshot. For one thing, that > > way we'd be testing for the problem case in non-assert builds also. > > I was wondering about that too. On the other hand, given that > we know this area is squishy, transforming fails-in-assert-builds > to fails-everywhere is not necessarily desirable. I agree that it's a little unclear. In general, I think if we're going to blow up and die, doing it closer to the place where the problem is happening is for the best. On the other hand, if in most practical cases we're going to stumble through and get the right answer anyway, then it's maybe not great to break a bunch of accidentally-working cases. However, it does strikes me that this principal could easily be overdone. init_toast_snapshot() could pick a random snapshot (or take a new one) in order to call InitToastSnapshot() and that would often work fine. Yet, upon realizing that things are busted, it chooses to error out instead. I approve of that choice, and don't think we should rule out the idea of making that check more robust. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I agree that it's a little unclear. In general, I think if we're going > to blow up and die, doing it closer to the place where the problem is > happening is for the best. On the other hand, if in most practical > cases we're going to stumble through and get the right answer anyway, > then it's maybe not great to break a bunch of accidentally-working > cases. However, it does strikes me that this principal could easily be > overdone. init_toast_snapshot() could pick a random snapshot (or take > a new one) in order to call InitToastSnapshot() and that would often > work fine. Yet, upon realizing that things are busted, it chooses to > error out instead. I approve of that choice, and don't think we should > rule out the idea of making that check more robust. I'm all for improving robustness, but failing in cases that would have worked before (even if only accidentally) is not going to be seen by users as more robust. I think that this late stage of the development cycle is not the time to be putting in changes that are not actually going to fix bugs but only call greater attention to the possibility that a bug exists. TBH, given where we are in the dev cycle, I thought there was a lot of sense behind your earlier thought that HaveRegisteredOrActiveSnapshot should be reverted entirely. I'm okay with keeping it as an assertion- only check, so that it won't bother end users. I'm not okay with adding end-user-visible failures, at least not till early in v16. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Mon, Apr 18, 2022 at 11:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I agree that it's a little unclear. In general, I think if we're going > > to blow up and die, doing it closer to the place where the problem is > > happening is for the best. On the other hand, if in most practical > > cases we're going to stumble through and get the right answer anyway, > > then it's maybe not great to break a bunch of accidentally-working > > cases. However, it does strikes me that this principal could easily be > > overdone. init_toast_snapshot() could pick a random snapshot (or take > > a new one) in order to call InitToastSnapshot() and that would often > > work fine. Yet, upon realizing that things are busted, it chooses to > > error out instead. I approve of that choice, and don't think we should > > rule out the idea of making that check more robust. > > I'm all for improving robustness, but failing in cases that would have > worked before (even if only accidentally) is not going to be seen by > users as more robust. I think that this late stage of the development > cycle is not the time to be putting in changes that are not actually > going to fix bugs but only call greater attention to the possibility > that a bug exists. > > TBH, given where we are in the dev cycle, I thought there was a lot of > sense behind your earlier thought that HaveRegisteredOrActiveSnapshot > should be reverted entirely. I'm okay with keeping it as an assertion- > only check, so that it won't bother end users. I'm not okay with > adding end-user-visible failures, at least not till early in v16. I wasn't really taking a position either way about timing. If we can demonstrate that things other than HaveRegisteredOrActiveSnapshot() itself are misbehaving, then I think fixes for those bugs are potentially back-patchable no matter where we are in the release cycle, but in terms of when we make changes to try to detect bugs we don't know about yet, I could go either way on whether to do that now or wait. We can't know whether the bugs we haven't found yet will cause a big problem for someone tomorrow, ten years from now, or never. I am not really very happy about HaveRegisteredOrActiveSnapshot(), honestly. I think in the form we have it in the tree it looks under-engineered. It's not really testing the right thing (even leaving aside the bug fix) as you have been pointing out, it doesn't really mesh well with the sanity checking that was there before as I have been pointing out, and it's only used in one place. I wouldn't be sad if it got reverted. However, I don't think it's going to do us any great harm, either. Although it's a long way from the best thing we have in the tree, it's also a long way from the worst thing we have in the tree. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > I wasn't really taking a position either way about timing. If we can > demonstrate that things other than HaveRegisteredOrActiveSnapshot() > itself are misbehaving, then I think fixes for those bugs are > potentially back-patchable no matter where we are in the release > cycle, Sure, but ... > but in terms of when we make changes to try to detect bugs we > don't know about yet, I could go either way on whether to do that now > or wait. We can't know whether the bugs we haven't found yet will > cause a big problem for someone tomorrow, ten years from now, or > never. ... I think in this case we do have a pretty good idea of the possible consequences. Most of the time, an unsafe toast fetch will work fine because the toast data is still there. If you're very unlucky then it's been deleted, and vacuumed away, and then you get a "missing chunk number" error. If you're really astronomically unlucky, perhaps the toast OID has been recycled and you get the wrong data (it's not clear to me whether the toast snapshot visibility rules would prevent this). I doubt we need to factor that last scenario into practical risk estimates, though. So adding a non-assert check for snapshot misuse would effectively convert "if you're very unlucky you get a weird error" to "lucky or not, you get some other weird error", which no user is going to see as an improvement. > I am not really very happy about HaveRegisteredOrActiveSnapshot(), > honestly. Me either. If we find any other cases where it gives a false positive, I'll be for removing it rather than fixing it. But for the moment I'm content to leave it until we have a well-engineered solution to the real problem. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2022-04-17 11:51:58 -0400, Tom Lane wrote: >> The fact that we have a snapshot at the instant of fetch doesn't prove >> that it existed continually since we fetched the toast reference, >> which seems to be the condition we actually need to assure. > Right. BTW, after thinking about this for a bit I am less concerned than I was about the system being full of bugs of this ilk. The executor per se should be fine because it does everything under a live snapshot. We had bugs with cases that shove executor output into long-lived tuplestores, but we've dealt with that scenario. Catalog updates performed on tuples fetched from a catalog scan seem safe enough too. Andres was worried about catalog updates performed using tuples fetched from catcache, but that's not a problem because we detoasted every value when it went into the catcache, cf 08e261cbc. (Mind you, 08e261cbc's solution is risky performancewise, because it means we have to re-toast every value during such catalog updates, instead of being able to carry the old values of unchanged columns forward. But it's not a correctness bug.) (Also, the whining I did in 08e261cbc's commit message is no longer relevant now that we read catalogs with MVCC snapshots.) There may be some corner cases that aren't described by any of these three blanket scenarios, but they've got to be pretty few and far between. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Mon, Apr 18, 2022 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > There may be some corner cases that aren't described by any of these > three blanket scenarios, but they've got to be pretty few and far > between. My first thought whenever anything like this comes up is cursors, especially but not only holdable cursors. Also, plpgsql variables, maybe mixed with embedded COMMIT/ROLLBACK. I don't find it particularly hard to believe we have some bugs in insufficiently-well-considered parts of the system that pass around datums outside of the normal executor flow, but I don't know exactly how to find them all, either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 18, 2022 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There may be some corner cases that aren't described by any of these >> three blanket scenarios, but they've got to be pretty few and far >> between. > My first thought whenever anything like this comes up is cursors, > especially but not only holdable cursors. Also, plpgsql variables, > maybe mixed with embedded COMMIT/ROLLBACK. Those exact cases have had detoasting bugs in the past and are now fixed. > I don't find it > particularly hard to believe we have some bugs in > insufficiently-well-considered parts of the system that pass around > datums outside of the normal executor flow, but I don't know exactly > how to find them all, either. I'm not here to claim that there are precisely zero remaining bugs of this ilk. I'm just saying that I think we've flushed out most of them. I think there is some value in trying to think of a way to prove that none remain, but it's not a problem we can solve for v15. regards, tom lane
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
From
Robert Haas
Date:
On Tue, Apr 19, 2022 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not here to claim that there are precisely zero remaining bugs > of this ilk. I'm just saying that I think we've flushed out most > of them. I think there is some value in trying to think of a way > to prove that none remain, but it's not a problem we can solve > for v15. Sure, that's fine. -- Robert Haas EDB: http://www.enterprisedb.com