Re: BUG #17949: Adding an index introduces serialisation anomalies. - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #17949: Adding an index introduces serialisation anomalies.
Date
Msg-id CA+hUKGKmR57CgZAPRjeSQfXb_oxjSUx7673SqDRw9RF2FfCw=A@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17949: Adding an index introduces serialisation anomalies.  (Dmitry Dolgov <9erthalion6@gmail.com>)
Responses Re: BUG #17949: Adding an index introduces serialisation anomalies.
Re: BUG #17949: Adding an index introduces serialisation anomalies.
List pgsql-bugs
On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've managed to resolve it, or at least reduce the chances for the issue to
> appear, via semi-randomly adding more CheckForSerializableConflictIn /
> PredicateLock around the new sublist that has to be created in
> ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
> changeset after running it multiple times for a couple of minutes, where on the
> main branch, with the two fixes from Thomas included, it was failing within a
> couple of seconds.

Ahh, right, thanks.  I don't think we need to lock all those pages as
you showed, as this whole "fast" path is supposed to be covered by the
meta page (in other words, GIN is expected to have the highest
possible serialisation failure rate under SSI unless you turn fast
updates off).  But there is an ordering bug with the existing
predicate lock in that code, which allows this to happen:

  S1: CheckForSerializableConflictIn(meta)

  S2: PredicateLockPage(meta)
  S2: scan, find no tuples

  S1: BufferLock(EXCLUSIVE
  S1: modify stuff...

CheckForSerializableConflictIn() was written with the assumption that
you're inserting a tuple (ie you have the page containing the tuple
locked), so you'll either conflict with a reader who already has a
predicate lock at that point OR you'll insert first and then the
reader will see your (invisible-to-snapshot) tuples, but here we're
doing some fancy footwork with a meta page, and we screwed up the
ordering and left a window where neither of those things happens.
Perhaps it was coded that way because there is drop-then-reacquire
dance, but it's easy enough to move the check in both branches.  Does
that make sense?

Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #17990: PSQL Process hangs in parallel mode
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17949: Adding an index introduces serialisation anomalies.