Thread: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17368 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 14.1 Operating system: Ubuntu 20.04 Description: The isolation test "read-only-anomaly-3" modified as follows: setup { CREATE TABLE bank_account (id TEXT PRIMARY KEY, balance DECIMAL NOT NULL); INSERT INTO bank_account (id, balance) VALUES ('X', 0), ('Y', 0); } teardown { DROP TABLE bank_account; } session s1 setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; } step s1ry { SELECT balance FROM bank_account WHERE id = 'Y'; } step s1wy { UPDATE bank_account SET balance = 20 WHERE id = 'Y'; } step s1c { COMMIT; } session s2 setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; } step s2rx { SELECT balance FROM bank_account WHERE id = 'X'; } step s2ry { SELECT balance FROM bank_account WHERE id = 'Y'; } step s2wx { UPDATE bank_account SET balance = -11 WHERE id = 'X'; } step s2c { COMMIT; } session s3 setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE; } step s3r { SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id; } step s3c { COMMIT; } session s4 setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; } step s4to { SET lock_timeout = '10ms' } step s4wx { UPDATE bank_account SET balance = -11 WHERE id = 'X'; } step s4c { COMMIT; } permutation s2rx s2ry s1ry s1wy s1c s3r s2wx s2c s3c permutation s2ry s1wy s1c s2wx s4to s4wx s4c s3r s2c s3c causes an assertion failure with the following stacktrace: Core was generated by `postgres: law isolation_regression [local] SELECT '. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f53a8a7c859 in __GI_abort () at abort.c:79 #2 0x000055edb1e89381 in ExceptionalCondition (conditionName=0x55edb2832980 "SxactIsROSafe(MySerializableXact)", errorType=0x55edb2831da0 "FailedAssertion", fileName=0x55edb2831d60 "predicate.c", lineNumber=1612) at assert.c:69 #3 0x000055edb1714442 in GetSafeSnapshot (origSnapshot=0x55edb6ac6e40 <CurrentSnapshotData>) at predicate.c:1612 #4 0x000055edb1714b91 in GetSerializableTransactionSnapshot (snapshot=0x55edb6ac6e40 <CurrentSnapshotData>) at predicate.c:1703 #5 0x000055edb201d623 in GetTransactionSnapshot () at snapmgr.c:291 #6 0x000055edb17613b5 in exec_simple_query ( query_string=0x625000005220 "SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;") at postgres.c:1103 #7 0x000055edb176ea2e in PostgresMain (argc=1, argv=0x7ffdfa7290d0, dbname=<optimized out>, username=0x629000011258 "law") at postgres.c:4486 #8 0x000055edb13ea951 in BackendRun (port=0x615000004900) at postmaster.c:4530 #9 0x000055edb13e9142 in BackendStartup (port=0x615000004900) at postmaster.c:4252 #10 0x000055edb13df0a3 in ServerLoop () at postmaster.c:1745 #11 0x000055edb13dcf20 in PostmasterMain (argc=8, argv=0x607000000170) at postmaster.c:1417 #12 0x000055edb0f075c3 in main (argc=8, argv=0x607000000170) at main.c:209
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Alexander Lakhin
Date:
15.01.2022 11:00, PG Bug reporting form wrote:
This bug is separated from the #17116, because it's an unrelated issue. But it was discussed in the thread #17116 and there was proposed the working v4-0002-Fix-race-in-SERIALIZABLE-READ-ONLY.patch [1].The following bug has been logged on the website: Bug reference: 17368
I've attached the modification for the read-only-anomaly-3 test to reproduce the bug (the result is shown in the bug report).
For ease of reading, all relevant info quoted here:
On Wed, Jul 28, 2021 at 5:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:On Wed, Jul 28, 2021 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:(Except for read-only-anomaly-3, that causes another assertion fail but I believe that it's not related to the initial issue and deserves another bug report.)Huh. I tried and failed to find that one with concurrent runs. I'll wait for your next report before I do anything, just in case there's a connection.This is indeed an unrelated issue, and much older. After many kilowatt hours, I reproduced it here and worked out that it comes from GetSafeSnapshot() not expecting possibleUnsafeConflicts to be empty when WritableSxactCount == 0. More soon.
[1] https://www.postgresql.org/message-id/CA%2BhUKGJd-%3DmdUdS%2BGh-FN4rZgAg4M-V%3D%3DG7FMCU-KbUffyPJDw%40mail.gmail.comOn Fri, Jul 30, 2021 at 9:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:The problem is that if we don't take the fast exit here: if (XactReadOnly && PredXact->WritableSxactCount == 0) { ReleasePredXact(sxact); LWLockRelease(SerializableXactHashLock); return snapshot; } ... then we search for writable SSI transactions here: for (othersxact = FirstPredXact(); othersxact != NULL; othersxact = NextPredXact(othersxact)) { if (!SxactIsCommitted(othersxact) && !SxactIsDoomed(othersxact) && !SxactIsReadOnly(othersxact)) { SetPossibleUnsafeConflict(sxact, othersxact); } } ... but that can find nothing if all writers happen to be doomed. WritableSxactCount doesn't count read-only or committed transactions so we don't have to worry about those confusing us, but it does count doomed transactions. Having an empty list here is mostly fine, except that nobody will ever set our RO_SAFE flag, and in the case of DEFERRABLE, the assertion that someone has set it will fail. This is the case since: === commit bdaabb9b22caa71021754d3967b4032b194d9880 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Fri Jul 8 00:36:30 2011 +0300 There's a small window wherein a transaction is committed but not yet on the finished list, and we shouldn't flag it as a potential conflict if so. We can also skip adding a doomed transaction to the list of possible conflicts because we know it won't commit. Dan Ports and Kevin Grittner. === I see three fixes: 1. Check if the list is empty, and if so, set our own SXACT_FLAG_RO_SAFE. Then the assertion in GetSafeSnapshot() will pass, and also non-DEFERRABLE callers will eventually see the flag and opt out. 2. Check if the list is empty, and if so, opt out immediately (as we do when WriteableSxactCount == 0). This would be strictly better than #1, because it happens sooner while we still have the lock. On the other hand, we'd have to undo more effects, and that involves writing more fragile and very rarely run code. 3. Just delete the assertion in GetSafeSnapshot(). This is attractively simple but it means that READ ONLY non-DEFERRABLE transactions arbitrarily miss out on the RO_SAFE optimisation in this (admittedly rare) case. The attached 0002 has idea #1, which I prefer for back-patching. I think #2 might be a good idea if we take the filtering logic further so that it becomes more likely that you get an empty list (for example, why don't we skip transactions that are running in a different database?), but then that'd be new code, not something we back-patch. Thoughts?
Best regards,
Alexander
Attachment
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
On Sun, Jan 16, 2022 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >> On Fri, Jul 30, 2021 at 9:41 AM Thomas Munro <thomas.munro@gmail.com> wrote: > The problem is that if we don't take the fast exit here: > > if (XactReadOnly && PredXact->WritableSxactCount == 0) > { > ReleasePredXact(sxact); > LWLockRelease(SerializableXactHashLock); > return snapshot; > } > > ... then we search for writable SSI transactions here: > > for (othersxact = FirstPredXact(); > othersxact != NULL; > othersxact = NextPredXact(othersxact)) > { > if (!SxactIsCommitted(othersxact) > && !SxactIsDoomed(othersxact) > && !SxactIsReadOnly(othersxact)) > { > SetPossibleUnsafeConflict(sxact, othersxact); > } > } > > ... but that can find nothing if all writers happen to be doomed. > WritableSxactCount doesn't count read-only or committed transactions > so we don't have to worry about those confusing us, but it does count > doomed transactions. Having an empty list here is mostly fine, except > that nobody will ever set our RO_SAFE flag, and in the case of > DEFERRABLE, the assertion that someone has set it will fail. This is > the case since: > > === > commit bdaabb9b22caa71021754d3967b4032b194d9880 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Fri Jul 8 00:36:30 2011 +0300 > > There's a small window wherein a transaction is committed but not yet > on the finished list, and we shouldn't flag it as a potential conflict > if so. We can also skip adding a doomed transaction to the list of > possible conflicts because we know it won't commit. > > Dan Ports and Kevin Grittner. > === > > I see three fixes: > > 1. Check if the list is empty, and if so, set our own > SXACT_FLAG_RO_SAFE. Then the assertion in GetSafeSnapshot() will > pass, and also non-DEFERRABLE callers will eventually see the flag and > opt out. > > 2. Check if the list is empty, and if so, opt out immediately (as we > do when WriteableSxactCount == 0). This would be strictly better than > #1, because it happens sooner while we still have the lock. On the > other hand, we'd have to undo more effects, and that involves writing > more fragile and very rarely run code. > > 3. Just delete the assertion in GetSafeSnapshot(). This is > attractively simple but it means that READ ONLY non-DEFERRABLE > transactions arbitrarily miss out on the RO_SAFE optimisation in this > (admittedly rare) case. > > The attached 0002 has idea #1, which I prefer for back-patching. I > think #2 might be a good idea if we take the filtering logic further > so that it becomes more likely that you get an empty list (for > example, why don't we skip transactions that are running in a > different database?), but then that'd be new code, not something we > back-patch. Thoughts? While swapping unrelated bug #17116 back into my brain, I remembered this one. Coincidentally, I also had a note on my to-do list to investigate a problem that has started showing up on CI, in the new "running" tests (which seem to be finding fun new timing bugs in committed code): http://cfbot.cputube.org/highlights/assertion-30.html Duh, it's this same ancient bug that we already figured out, which dates from 2011. I plan to do some more testing and then proceed with the idea mentioned above[1]. [1] https://www.postgresql.org/message-id/attachment/125053/v4-0002-Fix-race-in-SERIALIZABLE-READ-ONLY.patch
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
On Mon, Mar 6, 2023 at 4:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Duh, it's this same ancient bug that we already figured out, which > dates from 2011. I plan to do some more testing and then proceed with > the idea mentioned above[1]. Scratch that. I like the more ambitious idea better. I originally thought it would be less back-patchable, but that's nonsense -- this was a straight up thinko in ancient commit bdaabb9b and we should just fix it properly. That is, we should treat this case as a generalisation of the the PredXact->WritableSxactCount == 0 case, it's just that we had to do more work to figure that out. Here's a patch like that. As mentioned, this approach allows for more improvements in later patches. There is absolutely no reason for transactions in separate databases to be in each others' possibleUnsafeConflicts lists. At least, as long as PostgreSQL maintains its strict no-cross-database-access policy, which I don't see changing any time soon. So here is a draft patch like that, for the next commitfest. It's a little easier to exercise than the "everything-is-doomed" case. When I have more time I'll try to make proper tests, but basically you just need two databases: s1: \c db1 s1: CREATE TABLE t (i int); s1: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; s1: INSERT INTO t VALUES (42); s2: \c db2 s2: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE; s2: SELECT * FROM x; s2 shouldn't have to wait, and doesn't with this patch. But it does if it's in db1 instead.
Attachment
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes: > As mentioned, this approach allows for more improvements in later > patches. There is absolutely no reason for transactions in separate > databases to be in each others' possibleUnsafeConflicts lists. What if they both touched/modified a shared catalog? regards, tom lane
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
On Tue, Mar 7, 2023 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > As mentioned, this approach allows for more improvements in later > > patches. There is absolutely no reason for transactions in separate > > databases to be in each others' possibleUnsafeConflicts lists. > > What if they both touched/modified a shared catalog? I was assuming that that would already skip all predicate locking because: /* * Does this relation participate in predicate locking? Temporary and system * relations are exempt. */ static inline bool PredicateLockingNeededForRelation(Relation relation) { return !(relation->rd_id < FirstUnpinnedObjectId || RelationUsesLocalBuffers(relation)); } If we ever wanted to use SSI on catalogs, or allow shared relations that aren't catalogs, or allow cross-database access, then this optimisation wouldn't fly.
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Mar 7, 2023 at 12:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What if they both touched/modified a shared catalog? > I was assuming that that would already skip all predicate locking because: > /* > * Does this relation participate in predicate locking? Temporary and system > * relations are exempt. > */ Oh, I'd forgotten about that. > If we ever wanted to use SSI on catalogs, or allow shared relations > that aren't catalogs, or allow cross-database access, then this > optimisation wouldn't fly. I suppose there are some gotchas in the way of making that happen, so okay. regards, tom lane
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
On Tue, Mar 7, 2023 at 12:39 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Scratch that. I like the more ambitious idea better. I originally > thought it would be less back-patchable, but that's nonsense -- this > was a straight up thinko in ancient commit bdaabb9b and we should just > fix it properly. That is, we should treat this case as a > generalisation of the the PredXact->WritableSxactCount == 0 case, it's > just that we had to do more work to figure that out. Here's a patch > like that. Here's a better version. I plan to push this tomorrow if there are no objections.
Attachment
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Alexander Lakhin
Date:
Hi Thomas, 08.03.2023 14:57, Thomas Munro wrote: > On Tue, Mar 7, 2023 at 12:39 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Scratch that. I like the more ambitious idea better. I originally >> thought it would be less back-patchable, but that's nonsense -- this >> was a straight up thinko in ancient commit bdaabb9b and we should just >> fix it properly. That is, we should treat this case as a >> generalisation of the the PredXact->WritableSxactCount == 0 case, it's >> just that we had to do more work to figure that out. Here's a patch >> like that. > Here's a better version. I plan to push this tomorrow if there are no > objections. Just in case you find a test for the fix useful, maybe the attached patch could be added to the commit. (It triggers the assert on unpatched master.) Best regards, Alexander
Attachment
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
On Thu, Mar 9, 2023 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > Just in case you find a test for the fix useful, maybe the attached > patch could be added to the commit. > (It triggers the assert on unpatched master.) I was wondering about that. I like the test and I was using exactly that + also some other techniques while working on the fix, but the point of that particular test spec is to show a variation of the standard example from the read-only-anomaly paper. Hmm, yeah, that's not a problem, I can just add the extra schedule with a comment to explain. Will do. Thanks!
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Thomas Munro
Date:
Pushed, without that test. I realised that the test would not be stable in the build farm. If we made the lock timeout high, the test would be slow, but if we made it low, then there would be two possible outputs depending on a race, and 10ms as you had it seems -- I guess? -- likely to be unstable under valgrind or an RPi2 or something. There is probably some clever way to write a different test schedule, but the new code is exercised by existing tests, and the assertion has been failing once every couple of days on CI since I started collecting that data a few weeks ago, so we have some kind of coverage, at least for master.
Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
From
Alexander Lakhin
Date:
09.03.2023 07:39, Thomas Munro wrote: > Pushed, without that test. > > I realised that the test would not be stable in the build farm. If we > made the lock timeout high, the test would be slow, but if we made it > low, then there would be two possible outputs depending on a race, and > 10ms as you had it seems -- I guess? -- likely to be unstable under > valgrind or an RPi2 or something. There is probably some clever way > to write a different test schedule, but the new code is exercised by > existing tests, and the assertion has been failing once every couple > of days on CI since I started collecting that data a few weeks ago, so > we have some kind of coverage, at least for master. I had thought that we can use the same timeout that can be seen in a couple of other tests, but now I see that those tests don't depend on it directly/have outweighing timeouts, so I completely agree, that the test proposed is not elaborated enough to be committed. Thank you! Best regards, Alexander