Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
Date
Msg-id CA+hUKG+4m+fvnM_Czyp3Z_h4SaqmwR7sq9XE=k0GcaL_afEfUA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: BUG #17368: Assert failed in GetSafeSnapshot() for SERIALIZABLE READ ONLY DEFERRABLE transaction
List pgsql-bugs
On Sun, Jan 16, 2022 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> On Fri, Jul 30, 2021 at 9:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> The problem is that if we don't take the fast exit here:
>
>     if (XactReadOnly && PredXact->WritableSxactCount == 0)
>     {
>         ReleasePredXact(sxact);
>         LWLockRelease(SerializableXactHashLock);
>         return snapshot;
>     }
>
> ... then we search for writable SSI transactions here:
>
>         for (othersxact = FirstPredXact();
>              othersxact != NULL;
>              othersxact = NextPredXact(othersxact))
>         {
>             if (!SxactIsCommitted(othersxact)
>                 && !SxactIsDoomed(othersxact)
>                 && !SxactIsReadOnly(othersxact))
>             {
>                 SetPossibleUnsafeConflict(sxact, othersxact);
>             }
>         }
>
> ... but that can find nothing if all writers happen to be doomed.
> WritableSxactCount doesn't count read-only or committed transactions
> so we don't have to worry about those confusing us, but it does count
> doomed transactions.  Having an empty list here is mostly fine, except
> that nobody will ever set our RO_SAFE flag, and in the case of
> DEFERRABLE, the assertion that someone has set it will fail.  This is
> the case since:
>
> ===
> commit bdaabb9b22caa71021754d3967b4032b194d9880
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date:   Fri Jul 8 00:36:30 2011 +0300
>
>     There's a small window wherein a transaction is committed but not yet
>     on the finished list, and we shouldn't flag it as a potential conflict
>     if so. We can also skip adding a doomed transaction to the list of
>     possible conflicts because we know it won't commit.
>
>     Dan Ports and Kevin Grittner.
> ===
>
> I see three fixes:
>
> 1.  Check if the list is empty, and if so, set our own
> SXACT_FLAG_RO_SAFE.  Then the assertion in GetSafeSnapshot() will
> pass, and also non-DEFERRABLE callers will eventually see the flag and
> opt out.
>
> 2.  Check if the list is empty, and if so, opt out immediately (as we
> do when WriteableSxactCount == 0).  This would be strictly better than
> #1, because it happens sooner while we still have the lock.  On the
> other hand, we'd have to undo more effects, and that involves writing
> more fragile and very rarely run code.
>
> 3.  Just delete the assertion in GetSafeSnapshot().  This is
> attractively simple but it means that READ ONLY non-DEFERRABLE
> transactions arbitrarily miss out on the RO_SAFE optimisation in this
> (admittedly rare) case.
>
> The attached 0002 has idea #1, which I prefer for back-patching.  I
> think #2 might be a good idea if we take the filtering logic further
> so that it becomes more likely that you get an empty list (for
> example, why don't we skip transactions that are running in a
> different database?), but then that'd be new code, not something we
> back-patch.  Thoughts?

While swapping unrelated bug #17116 back into my brain, I remembered
this one.  Coincidentally, I also had a note on my to-do list to
investigate a problem that has started showing up on CI, in the new
"running" tests (which seem to be finding fun new timing bugs in
committed code):

http://cfbot.cputube.org/highlights/assertion-30.html

Duh, it's this same ancient bug that we already figured out, which
dates from 2011.  I plan to do some more testing and then proceed with
the idea mentioned above[1].

[1] https://www.postgresql.org/message-id/attachment/125053/v4-0002-Fix-race-in-SERIALIZABLE-READ-ONLY.patch



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: found a possible bug, modulus of an integer on a partition table appears to be wrong
Next
From: Thomas Munro
Date:
Subject: Re: BUG #17116: Assert failed in SerialSetActiveSerXmin() on commit of parallelized serializable transaction