Thread: BUG #17949: Adding an index introduces serialisation anomalies.

BUG #17949: Adding an index introduces serialisation anomalies.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17949
Logged by:          Artem Anisimov
Email address:      artem.anisimov.255@gmail.com
PostgreSQL version: 15.3
Operating system:   fedora 38
Description:

Hello dear pg authors,

I have come across a behaviour in pg that contradicts
https://www.postgresql.org/docs/15/transaction-iso.html#XACT-SERIALIZABLE.

I've minimised my testcase to a scenario that essentially duplicates a
scenario that you describe in the following paragraph:
> In particular, it is possible to see unique constraint violations
> caused by conflicts with overlapping Serializable transactions
> even after explicitly checking that the key isn't present
> before attempting to insert it.
> This can be avoided by making sure that all Serializable transactions
that
> insert potentially conflicting keys explicitly check if they can do so
first.

At the end of the report there is a C reproducer that highlights the
problem. Let me give a high-level overview of the scenario first.

I have N threads each trying to "acquire an exclusive lock" this way:
  BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  SELECT * FROM locks WHERE path = $1;
  quit if there already is a lock
  INSERT INTO locks(path, ...) VALUES($1, ...);
  COMMIT;

Once all threads have attempted to acquire a lock, I count how many of them
succeeded to INSERT INTO. Normally, there is only one thread that succeeds,
which is what I expect. However, if I add a HASH or a BTREE index on "path",
it becomes possible for multiple threads to do a successful INSERT INTO,
which violates the serialisability.

The problem can be reproduced with the default postgresql.conf, but it takes
some time. If I increase "shared_buffers" to 1024MB, the issue appears
almost immediately (fewer chances to promote predicate locks to locks on the
whole table?).

I've seen this behaviour with pg 13, 15 and 16 (503b055).

Now let us see the reproducer. It has two primary components:
1. test_once() spawns 32 threads that try to acquire a lock, waits for them,
and counts the number of "acquired exclusive locks",
2. try_acquire_lock() executes a transaction described above.

To build the reproducer:
$  gcc -std=c99 -o test test.c -pthread `pkg-config libpq --cflags --libs`

To run the reproducer:
$ ./test
or
$ DB_CONNSTR="dbname=abc host=def user=ghi" ./test

The reproducer:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <assert.h>

#include <libpq-fe.h>


// Comment this to make serialisation anomalies go away.
#define WITH_INDEX


// I have seen the problem with as few as 3 threads. 32 threads make
// the issue appear much sooner.
#define NR_THREADS (32)
#define NR_RUNS (1024 * 1024)

static PGconn *conns[NR_THREADS];

static void* try_acquire_lock(void *arg)
{
        PGconn *c = arg;
        PGresult *res;
        int ntuples;

        res = PQexec(c, "BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE");
        assert(PQresultStatus(res) == PGRES_COMMAND_OK);
        PQclear(res);

        res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'");
        assert(PQresultStatus(res) == PGRES_TUPLES_OK);
        ntuples = PQntuples(res);
        PQclear(res);

        if (ntuples > 0) {
                // someone else already has a lock
                res = PQexec(c, "COMMIT");
                assert(PQresultStatus(res) == PGRES_COMMAND_OK);
                PQclear(res);
                return NULL;
        }

        res = PQexec(c, "INSERT INTO locks(path) VALUES('xyz')");
        PQclear(res);

        res = PQexec(c, "COMMIT");
        PQclear(res);

        return NULL;
}

static void test_once(void)
{
        PGconn *c = conns[0];
        PGresult *res;
        int ntuples;

        pthread_t thrs[NR_THREADS];
        for (int i = 0; i < NR_THREADS; ++i)
                pthread_create(&thrs[i], NULL, &try_acquire_lock,
conns[i]);
        for (int i = 0; i < NR_THREADS; ++i)
                pthread_join(thrs[i], NULL);

        res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'");
        assert(PQresultStatus(res) == PGRES_TUPLES_OK);
        ntuples = PQntuples(res);
        PQclear(res);

        if (ntuples != 1)
                printf("ntuples = %d\n", ntuples);
        assert(ntuples == 1);

        res = PQexec(c, "TRUNCATE TABLE locks");
        assert(PQresultStatus(res) == PGRES_COMMAND_OK);
        PQclear(res);
}

static void prepare_db(void)
{
        PGconn *c = conns[0];
        PGresult *res;

        res = PQexec(c, "DROP TABLE locks");
        PQclear(res);

        res = PQexec(c, "CREATE TABLE locks (path TEXT NOT NULL)");
        assert(PQresultStatus(res) == PGRES_COMMAND_OK);
        PQclear(res);

#ifdef WITH_INDEX
        res = PQexec(c, "CREATE INDEX ON locks USING HASH(path)");
        assert(PQresultStatus(res) == PGRES_COMMAND_OK);
        PQclear(res);
#endif
}

int main(void)
{
        const char *connstr = getenv("DB_CONNSTR");
        if (connstr == NULL)
                connstr = "dbname=postgres";

        for (int i = 0; i < NR_THREADS; ++i) {
                conns[i] = PQconnectdb(connstr);
                assert(PQstatus(conns[i]) == CONNECTION_OK);
        }
        prepare_db();

        for (int i = 0; i < NR_RUNS; ++i)
                test_once();

        for (int i = 0; i < NR_THREADS; ++i)
                PQfinish(conns[i]);

        return 0;
}


Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
Hi,

Reproduced here.  Thanks for the reproducer.  I agree that something
is wrong here, but I haven't had time to figure out what, yet, but let
me share what I noticed so far... I modified your test to add a pid
column to the locks table and to insert insert pg_backend_pid() into
it, and got:

postgres=# select xmin, * from locks;

┌───────┬──────┬───────┐
│ xmin  │ path │  pid  │
├───────┼──────┼───────┤
│ 17634 │ xyz  │ 32932 │
│ 17639 │ xyz  │ 32957 │
└───────┴──────┴───────┘

Then I filtered the logs (having turned the logging up to capture all
queries) so I could see just those PIDs and saw this sequence:

2023-05-29 00:15:43.933 EDT [32932] LOG:  duration: 0.182 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.934 EDT [32957] LOG:  duration: 0.276 ms
statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
2023-05-29 00:15:43.935 EDT [32932] LOG:  duration: 1.563 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.936 EDT [32932] LOG:  duration: 0.126 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 2.191 ms
statement: SELECT * FROM locks WHERE path = 'xyz'
2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 0.261 ms
statement: INSERT INTO locks(path, pid) VALUES('xyz',
pg_backend_pid())
2023-05-29 00:15:43.937 EDT [32932] LOG:  duration: 0.222 ms  statement: COMMIT
2023-05-29 00:15:43.939 EDT [32957] LOG:  duration: 1.775 ms  statement: COMMIT

That sequence if run (without overlap) in the logged order is normally
rejected.  The query plan being used (at least when I run the query
myself) looks like this:

Query Text: SELECT * FROM locks WHERE path = 'xyz'
Bitmap Heap Scan on locks  (cost=4.20..13.67 rows=6 width=36)
  Recheck Cond: (path = 'xyz'::text)
  ->  Bitmap Index Scan on locks_path_idx  (cost=0.00..4.20 rows=6 width=0)
        Index Cond: (path = 'xyz'::text)



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Artem Anisimov
Date:
Hi Thomas,


thank you for the confirmation and analysis. Did you have a chance to 
take a more detailed look at the problem?


Best regards,

Artem.


On 30/05/2023 04:29, Thomas Munro wrote:
> Hi,
>
> Reproduced here.  Thanks for the reproducer.  I agree that something
> is wrong here, but I haven't had time to figure out what, yet, but let
> me share what I noticed so far... I modified your test to add a pid
> column to the locks table and to insert insert pg_backend_pid() into
> it, and got:
>
> postgres=# select xmin, * from locks;
>
> ┌───────┬──────┬───────┐
> │ xmin  │ path │  pid  │
> ├───────┼──────┼───────┤
> │ 17634 │ xyz  │ 32932 │
> │ 17639 │ xyz  │ 32957 │
> └───────┴──────┴───────┘
>
> Then I filtered the logs (having turned the logging up to capture all
> queries) so I could see just those PIDs and saw this sequence:
>
> 2023-05-29 00:15:43.933 EDT [32932] LOG:  duration: 0.182 ms
> statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
> 2023-05-29 00:15:43.934 EDT [32957] LOG:  duration: 0.276 ms
> statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
> 2023-05-29 00:15:43.935 EDT [32932] LOG:  duration: 1.563 ms
> statement: SELECT * FROM locks WHERE path = 'xyz'
> 2023-05-29 00:15:43.936 EDT [32932] LOG:  duration: 0.126 ms
> statement: INSERT INTO locks(path, pid) VALUES('xyz',
> pg_backend_pid())
> 2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 2.191 ms
> statement: SELECT * FROM locks WHERE path = 'xyz'
> 2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 0.261 ms
> statement: INSERT INTO locks(path, pid) VALUES('xyz',
> pg_backend_pid())
> 2023-05-29 00:15:43.937 EDT [32932] LOG:  duration: 0.222 ms  statement: COMMIT
> 2023-05-29 00:15:43.939 EDT [32957] LOG:  duration: 1.775 ms  statement: COMMIT
>
> That sequence if run (without overlap) in the logged order is normally
> rejected.  The query plan being used (at least when I run the query
> myself) looks like this:
>
> Query Text: SELECT * FROM locks WHERE path = 'xyz'
> Bitmap Heap Scan on locks  (cost=4.20..13.67 rows=6 width=36)
>    Recheck Cond: (path = 'xyz'::text)
>    ->  Bitmap Index Scan on locks_path_idx  (cost=0.00..4.20 rows=6 width=0)
>          Index Cond: (path = 'xyz'::text)



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Tue, Jun 13, 2023 at 8:00 PM Artem Anisimov
<artem.anisimov.255@gmail.com> wrote:
> thank you for the confirmation and analysis. Did you have a chance to
> take a more detailed look at the problem?

Looking into it...



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Mon, May 29, 2023 at 09:29:09PM -0400, Thomas Munro wrote:
> Starred
>
> Hi,
>
> Reproduced here.  Thanks for the reproducer.  I agree that something
> is wrong here, but I haven't had time to figure out what, yet, but let
> me share what I noticed so far... I modified your test to add a pid
> column to the locks table and to insert insert pg_backend_pid() into
> it, and got:
>
> postgres=# select xmin, * from locks;
>
> ┌───────┬──────┬───────┐
> │ xmin  │ path │  pid  │
> ├───────┼──────┼───────┤
> │ 17634 │ xyz  │ 32932 │
> │ 17639 │ xyz  │ 32957 │
> └───────┴──────┴───────┘
>
> Then I filtered the logs (having turned the logging up to capture all
> queries) so I could see just those PIDs and saw this sequence:
>
> 2023-05-29 00:15:43.933 EDT [32932] LOG:  duration: 0.182 ms
> statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
> 2023-05-29 00:15:43.934 EDT [32957] LOG:  duration: 0.276 ms
> statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
> 2023-05-29 00:15:43.935 EDT [32932] LOG:  duration: 1.563 ms
> statement: SELECT * FROM locks WHERE path = 'xyz'
> 2023-05-29 00:15:43.936 EDT [32932] LOG:  duration: 0.126 ms
> statement: INSERT INTO locks(path, pid) VALUES('xyz',
> pg_backend_pid())
> 2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 2.191 ms
> statement: SELECT * FROM locks WHERE path = 'xyz'
> 2023-05-29 00:15:43.937 EDT [32957] LOG:  duration: 0.261 ms
> statement: INSERT INTO locks(path, pid) VALUES('xyz',
> pg_backend_pid())
> 2023-05-29 00:15:43.937 EDT [32932] LOG:  duration: 0.222 ms  statement: COMMIT
> 2023-05-29 00:15:43.939 EDT [32957] LOG:  duration: 1.775 ms  statement: COMMIT

I've tried to reproduce it as well, adding more logging around the
serialization code. If it helps, what I observe is the second
overlapping transaction, that has started a bit later, do not error out
because in OnConflict_CheckForSerializationFailure (when checking for
"writer has become a pivot") there are no more conflicts received from
SHMQueueNext. All the rest of the reported serialization conflicts are
coming from this check, so I assume the incorrect transaction should
fail there too. Not sure yet why is that so.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Thu, Jun 15, 2023 at 7:29 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've tried to reproduce it as well, adding more logging around the
> serialization code. If it helps, what I observe is the second
> overlapping transaction, that has started a bit later, do not error out
> because in OnConflict_CheckForSerializationFailure (when checking for
> "writer has become a pivot") there are no more conflicts received from
> SHMQueueNext. All the rest of the reported serialization conflicts are
> coming from this check, so I assume the incorrect transaction should
> fail there too. Not sure yet why is that so.

Some more observations: happens on 11 and master, happens with btrees,
happens with bitmapscan disabled (eg with plain index scan), but so
far in my testing it doesn't happen if the table already contains one
other tuple (ie if you change the reproducer to insert another row
('foo') after the TRUNCATE).  There is a special case for predicate
locking empty indexes, which uses a relation-level (since there are no
pages to lock yet), but that doesn't seem to be wrong and if you hack
it to lock pages 1 and 2 instead, it still reproduces.  Pondering the
empty index case made me wonder if the case "If we found one of our
own SIREAD locks to remove, remove it now" was implicated (that's
something that would not happen for a relation-level lock), but it
still reproduces if you comment out that optimisation.  So far I have
not been able to reproduce it below 8 threads.  Hmm, I wonder if there
might be a missing check/lock in some racy code path around the
initial creation of the root page...



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
How is this schedule supposed to work?

S1:  _bt_search(&buf)
S1:  if (!BufferIsValid(buf)) // because index is empty
S1:  {
S1:    PredicateLockRelation(...);
S1:    ...
S1:    return false; // no tuples for you

S2:  _bt_search(&buf)

S1:  INSERT ...

S2:  if (!BufferIsValid(buf)) // because index *was* empty
S2:  {
S2:    PredicateLockRelation(...);
S2:    ...
S2:    return false; // no tuples for you

S2:  ...

My point is that S2 won't ever scan S1's tuples, so it won't pass S1's
xid to CheckForSerializableConflictOut().  Am I missing something?  I
can repro this with NR_THREADS set to only 2, after inserting
pg_usleep(1) after the _bt_search() in _bt_first() (sched_yield()
wasn't quite enough).



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Mon, Jun 19, 2023 at 2:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> S2:  _bt_search(&buf)

> S1:  INSERT ...

> S2:    PredicateLockRelation(...);

> My point is that S2 won't ever scan S1's tuples, so it won't pass S1's
> xid to CheckForSerializableConflictOut()

[completing that sentence a little more] ... so this was our only
chance to detect that S2 read an object that S1 wrote.  But the
_bt_search() and PredicateLockXXX() calls are not atomic and not
rechecked, so a write between them is invisible to the algorithm.  I'm
not sure about this, but it's the idea I'm exploring...



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
The attached shows one approach to the problem I described already,
namely scans of empty btrees that concurrently gain an initial root
page.  This seems to fix the posted repro case if using a btree with
enable_seqscan=off, enable_bitmapscan=off.  That's the variant I had
been working with, because it was simpler.

But that turns out to be only one problem hit by the repro.
Apparently there's a second bug, if you let it use bitmap heapscans.
Or perhaps I am misdiagnosing the above.  It could be something like:
btgetbitmap not following the predicate lock protocol correctly (but
it looks OK at a glance), or bitmap heapscan not checking for
conflicts out comprehensively (xids on invisible tuples we scan), eg
not fetching heap tuples for some short cut reason somewhere.  But
that's all I have time for today.

Attachment

Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... or bitmap heapscan not checking for
> conflicts out comprehensively (xids on invisible tuples we scan), eg
> not fetching heap tuples for some short cut reason somewhere. ...

Ahh, here's such a place.  I can't reproduce it with the patch already
posted + this check commented out.

--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,

        hscan->rs_cindex = 0;
        hscan->rs_ntuples = 0;

+#if 0
        /*
         * Ignore any claimed entries past what we think is the end of the
         * relation. It may have been extended after the start of our scan (we
         * only hold an AccessShareLock, and it could be inserts from this
         * backend).
         */
        if (block >= hscan->rs_nblocks)
                return false;
+#endif



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
> On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... or bitmap heapscan not checking for
> > conflicts out comprehensively (xids on invisible tuples we scan), eg
> > not fetching heap tuples for some short cut reason somewhere. ...
>
> Ahh, here's such a place.  I can't reproduce it with the patch already
> posted + this check commented out.
>
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
>
>         hscan->rs_cindex = 0;
>         hscan->rs_ntuples = 0;
>
> +#if 0
>         /*
>          * Ignore any claimed entries past what we think is the end of the
>          * relation. It may have been extended after the start of our scan (we
>          * only hold an AccessShareLock, and it could be inserts from this
>          * backend).
>          */
>         if (block >= hscan->rs_nblocks)
>                 return false;
> +#endif

Great, thanks! Can confirm, after applying both the posted patch and the
change above the issue is not reproducible anymore.

One thing I've noticed is that one can observe a similar issue using a
gin index and int[] for the "path" column, even applying changes from
the thread. The gin implementation does something similar to btree in
startScanEntry -- it lands in "No entry found" branch, but instead of
locking the relation it locks "the leaf page, to lock the place where
the entry would've been, had there been one". The similar fix retrying
ginFindLeafPage didn't solve the problem, even if locking the whole
relation instead, but maybe I'm missing something.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
> > +#if 0
> >         /*
> >          * Ignore any claimed entries past what we think is the end of the
> >          * relation. It may have been extended after the start of our scan (we
> >          * only hold an AccessShareLock, and it could be inserts from this
> >          * backend).
> >          */
> >         if (block >= hscan->rs_nblocks)
> >                 return false;
> > +#endif
>
> Great, thanks! Can confirm, after applying both the posted patch and the
> change above the issue is not reproducible anymore.

Here's a cleaned-up version of the first two changes.  What do you
think about the assertions I make in the commit message for 0002?

> One thing I've noticed is that one can observe a similar issue using a
> gin index and int[] for the "path" column, even applying changes from
> the thread. The gin implementation does something similar to btree in
> startScanEntry -- it lands in "No entry found" branch, but instead of
> locking the relation it locks "the leaf page, to lock the place where
> the entry would've been, had there been one". The similar fix retrying
> ginFindLeafPage didn't solve the problem, even if locking the whole
> relation instead, but maybe I'm missing something.

Ouch.  I would have to go and study gin's interlocking model, but one
superficial bug I spotted is that ginget.c's collectMatchBitmap()
calls PredicateLockPage(stack->buffer), where a block number is
expected.  I wish we had strong typedefs, to reject stuff like that at
compile time.  But fixing that alone isn't enough.

In case someone who knows more about gin is interested in helping, I
attach Artem's repro, modified to use gin.

Attachment

Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Tue, Jun 20, 2023 at 1:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > One thing I've noticed is that one can observe a similar issue using a
> > gin index and int[] for the "path" column, even applying changes from
> > the thread. The gin implementation does something similar to btree in
> > startScanEntry -- it lands in "No entry found" branch, but instead of
> > locking the relation it locks "the leaf page, to lock the place where
> > the entry would've been, had there been one". The similar fix retrying
> > ginFindLeafPage didn't solve the problem, even if locking the whole
> > relation instead, but maybe I'm missing something.
>
> Ouch.  I would have to go and study gin's interlocking model, but one
> superficial bug I spotted is that ginget.c's collectMatchBitmap()
> calls PredicateLockPage(stack->buffer), where a block number is
> expected.  I wish we had strong typedefs, to reject stuff like that at
> compile time.  But fixing that alone isn't enough.
>
> In case someone who knows more about gin is interested in helping, I
> attach Artem's repro, modified to use gin.

This is probably going to go faster if I CC the authors of commit
0bef1c06.  Any ideas about how we're missing rw-conflicts under high
concurrency?



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> This is probably going to go faster if I CC the authors of commit
> 0bef1c06.  Any ideas about how we're missing rw-conflicts under high
> concurrency?

I guess one (probably stupid) question I have: where is the
CheckForSerializableConflictIn(rootPostingTree) that I was expecting
to see when writing to an existing posting tree?  IIUC that should
pair with PredicateLockPage(btree->index, rootPostingTree, snapshot)
when reading.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Tue, Jun 20, 2023 at 01:22:19PM +1200, Thomas Munro wrote:
> On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote:
> > > +#if 0
> > >         /*
> > >          * Ignore any claimed entries past what we think is the end of the
> > >          * relation. It may have been extended after the start of our scan (we
> > >          * only hold an AccessShareLock, and it could be inserts from this
> > >          * backend).
> > >          */
> > >         if (block >= hscan->rs_nblocks)
> > >                 return false;
> > > +#endif
> >
> > Great, thanks! Can confirm, after applying both the posted patch and the
> > change above the issue is not reproducible anymore.
>
> Here's a cleaned-up version of the first two changes.  What do you
> think about the assertions I make in the commit message for 0002?

Yep, it sounds correct to me. After a quick look I couldn't find where
exactly the similar code lives in the pre-tableam version, so can't say
anything about back-patching there.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Wed, Jun 21, 2023 at 11:20:28AM +1200, Thomas Munro wrote:
> On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > This is probably going to go faster if I CC the authors of commit
> > 0bef1c06.  Any ideas about how we're missing rw-conflicts under high
> > concurrency?
>
> I guess one (probably stupid) question I have: where is the
> CheckForSerializableConflictIn(rootPostingTree) that I was expecting
> to see when writing to an existing posting tree?  IIUC that should
> pair with PredicateLockPage(btree->index, rootPostingTree, snapshot)
> when reading.

I can't find it either, but based on my superficial investigation this
particular reproducer doesn't seem to hit posting trees functionality at
all. What I observe is the inserting transaction uses
ginHeapTupleFastCollect + ginHeapTupleFastInsert, and the corresponding
commentary says that serialization in this case depends on the
metabuffer:

    /*
     * An insertion to the pending list could logically belong anywhere in the
     * tree, so it conflicts with all serializable scans.  All scans acquire a
     * predicate lock on the metabuffer to represent that.
     */

Now the reading transaction actually does PredicateLockPage on the
metabuffer inside scanPendingInsert, but strangely enough it doesn't
lock anything because the SerializationNeededForRead condition is false.
I'm trying to verify if it's somehow a part of the issue, or something
is broken on my side.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Now the reading transaction actually does PredicateLockPage on the
> metabuffer inside scanPendingInsert, but strangely enough it doesn't
> lock anything because the SerializationNeededForRead condition is false.
> I'm trying to verify if it's somehow a part of the issue, or something
> is broken on my side.

Maybe you were confused by the presence of non-SSI transactions in the
repro (eg the transaction that sets up the index)?

To answer my own earlier question, the conflict-in check for posting
trees is hidden in getFindLeafPage(..., true, ...).

I spent some more time trying to grok this today.  FTR it reproduces
faster without the extra tuple that repro I posted inserts after
TRUNCATE (the point of that was to find out whether it was an
empty-to-non-empty transition).  I still don't know what's wrong but I
am beginning to suspect the "fast" code.  It seems as though, under
high concurrency, we sometimes don't scan a recently inserted
(invisible to our snapshot, but needed for SSI checks) tuple, but I
don't yet know why.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Thu, Jun 22, 2023 at 10:02:19PM +1200, Thomas Munro wrote:
> On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Now the reading transaction actually does PredicateLockPage on the
> > metabuffer inside scanPendingInsert, but strangely enough it doesn't
> > lock anything because the SerializationNeededForRead condition is false.
> > I'm trying to verify if it's somehow a part of the issue, or something
> > is broken on my side.
>
> Maybe you were confused by the presence of non-SSI transactions in the
> repro (eg the transaction that sets up the index)?

Yeah, sort of. Need to optimize the way how I consume the logs.

> To answer my own earlier question, the conflict-in check for posting
> trees is hidden in getFindLeafPage(..., true, ...).
>
> I spent some more time trying to grok this today.  FTR it reproduces
> faster without the extra tuple that repro I posted inserts after
> TRUNCATE (the point of that was to find out whether it was an
> empty-to-non-empty transition).  I still don't know what's wrong but I
> am beginning to suspect the "fast" code.  It seems as though, under
> high concurrency, we sometimes don't scan a recently inserted
> (invisible to our snapshot, but needed for SSI checks) tuple, but I
> don't yet know why.

Yep, it's definitely something in the "fast" path. Testing the same, but
with an index having (fastupdate=off) works just fine for me.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Dmitry Dolgov
Date:
> On Fri, Jun 23, 2023 at 04:05:42PM +0200, Dmitry Dolgov wrote:
> > I spent some more time trying to grok this today.  FTR it reproduces
> > faster without the extra tuple that repro I posted inserts after
> > TRUNCATE (the point of that was to find out whether it was an
> > empty-to-non-empty transition).  I still don't know what's wrong but I
> > am beginning to suspect the "fast" code.  It seems as though, under
> > high concurrency, we sometimes don't scan a recently inserted
> > (invisible to our snapshot, but needed for SSI checks) tuple, but I
> > don't yet know why.
>
> Yep, it's definitely something in the "fast" path. Testing the same, but
> with an index having (fastupdate=off) works just fine for me.

I've managed to resolve it, or at least reduce the chances for the issue to
appear, via semi-randomly adding more CheckForSerializableConflictIn /
PredicateLock around the new sublist that has to be created in
ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
changeset after running it multiple times for a couple of minutes, where on the
main branch, with the two fixes from Thomas included, it was failing within a
couple of seconds.

    diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
    @@ -198,6 +199,7 @@ makeSublist(Relation index, IndexTuple *tuples, int32 ntuples,
            /*
             * Write last page
             */
    +       CheckForSerializableConflictIn(index, NULL, BufferGetBlockNumber(curBuffer));
            res->tail = BufferGetBlockNumber(curBuffer);
            res->tailFreeSize = writeListPage(index, curBuffer,

    @@ -273,6 +275,11 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
                            separateList = true;
                            LockBuffer(metabuffer, GIN_UNLOCK);
                    }
    +               else
    +               {
    +                       CheckForSerializableConflictIn(index, NULL, metadata->head);
    +               }
            }

    diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
    @@ -140,7 +140,7 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
             * Predicate lock entry leaf page, following pages will be locked by
             * moveRightIfItNeeded()
             */
    -       PredicateLockPage(btree->index, stack->buffer, snapshot);
    +       PredicateLockPage(btree->index, BufferGetBlockNumber(stack->buffer), snapshot);

            for (;;)
            {
    @@ -1925,6 +1926,8 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
            /*
             * Set up the scan keys, and check for unsatisfiable query.
             */
    +       PredicateLockRelation(scan->indexRelation, scan->xs_snapshot);
            ginFreeScanKeys(so);            /* there should be no keys yet, but just to be

Now the last PredicateLockRelation does look rather weird. But without it, or
with this locking happening in scanPendingInsert (instead of locking the meta
page), or without other changes in ginfast.c, the reproducer is still failing.
This of course makes it not a proper solution by any mean, but hopefully it
will help to understand the problem a bit better.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I've managed to resolve it, or at least reduce the chances for the issue to
> appear, via semi-randomly adding more CheckForSerializableConflictIn /
> PredicateLock around the new sublist that has to be created in
> ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
> changeset after running it multiple times for a couple of minutes, where on the
> main branch, with the two fixes from Thomas included, it was failing within a
> couple of seconds.

Ahh, right, thanks.  I don't think we need to lock all those pages as
you showed, as this whole "fast" path is supposed to be covered by the
meta page (in other words, GIN is expected to have the highest
possible serialisation failure rate under SSI unless you turn fast
updates off).  But there is an ordering bug with the existing
predicate lock in that code, which allows this to happen:

  S1: CheckForSerializableConflictIn(meta)

  S2: PredicateLockPage(meta)
  S2: scan, find no tuples

  S1: BufferLock(EXCLUSIVE
  S1: modify stuff...

CheckForSerializableConflictIn() was written with the assumption that
you're inserting a tuple (ie you have the page containing the tuple
locked), so you'll either conflict with a reader who already has a
predicate lock at that point OR you'll insert first and then the
reader will see your (invisible-to-snapshot) tuples, but here we're
doing some fancy footwork with a meta page, and we screwed up the
ordering and left a window where neither of those things happens.
Perhaps it was coded that way because there is drop-then-reacquire
dance, but it's easy enough to move the check in both branches.  Does
that make sense?

Attachment

Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Peter Geoghegan
Date:
On Sun, Jun 25, 2023 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> CheckForSerializableConflictIn() was written with the assumption that
> you're inserting a tuple (ie you have the page containing the tuple
> locked), so you'll either conflict with a reader who already has a
> predicate lock at that point OR you'll insert first and then the
> reader will see your (invisible-to-snapshot) tuples, but here we're
> doing some fancy footwork with a meta page, and we screwed up the
> ordering and left a window where neither of those things happens.

FWIW, there is no fundamental reason why nbtree couldn't always
preallocate a single empty leaf page during CREATE INDEX -- this leaf
page is where values whose keyspace is between negative and positive
infinity (i.e. all items) are located. That would fix this bug. Since,
of course, it would make the theory of operation that you describe
work reliably, even with an empty index. This isn't a serious
proposal, of course -- lazily allocating the first real page has
value, and we're hardly going to throw that away just to fix this bug.
My point is that not allocating a leaf page in CREATE INDEX is the
special case here, if anything.

I'm not surprised that GIN has deeper problems than nbtree due to
things like posting trees. Many GIN features rely on the fact that GIN
only supports lossy index scans. For example, it doesn't matter if the
pending list has TIDs that appear elsewhere within the same index for
a while. It doesn't matter if you have a hard crash when merging the
pending list -- VACUUM will eventually clean everything up. Having the
same TID in two different places at the same time is tolerable. I
imagine that that kind of foundation is harder to build SSI predicate
locks on top of.

--
Peter Geoghegan



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Heikki Linnakangas
Date:
On 26/06/2023 03:25, Thomas Munro wrote:
> On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> I've managed to resolve it, or at least reduce the chances for the issue to
>> appear, via semi-randomly adding more CheckForSerializableConflictIn /
>> PredicateLock around the new sublist that has to be created in
>> ginHeapTupleFastInsert. I haven't seen the reproducer failing with this
>> changeset after running it multiple times for a couple of minutes, where on the
>> main branch, with the two fixes from Thomas included, it was failing within a
>> couple of seconds.
> 
> Ahh, right, thanks.  I don't think we need to lock all those pages as
> you showed, as this whole "fast" path is supposed to be covered by the
> meta page (in other words, GIN is expected to have the highest
> possible serialisation failure rate under SSI unless you turn fast
> updates off).  But there is an ordering bug with the existing
> predicate lock in that code, which allows this to happen:
> 
>    S1: CheckForSerializableConflictIn(meta)
> 
>    S2: PredicateLockPage(meta)
>    S2: scan, find no tuples
> 
>    S1: BufferLock(EXCLUSIVE
>    S1: modify stuff...
> 
> CheckForSerializableConflictIn() was written with the assumption that
> you're inserting a tuple (ie you have the page containing the tuple
> locked), so you'll either conflict with a reader who already has a
> predicate lock at that point OR you'll insert first and then the
> reader will see your (invisible-to-snapshot) tuples, but here we're
> doing some fancy footwork with a meta page, and we screwed up the
> ordering and left a window where neither of those things happens.
> Perhaps it was coded that way because there is drop-then-reacquire
> dance, but it's easy enough to move the check in both branches.  Does
> that make sense?

Yes, +1 on the patches. Any chance of constructing test cases for these? 
The above race condition is hard to reach, but some of these other bugs 
seem more testable.

Some minor nits: In 
v3-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patch:

>          /*
> -         * We only get here if the index is completely empty. Lock relation
> -         * because nothing finer to lock exists.
> +         * Since we have no pages locked, it's possible for another
> +         * transaction to insert data between _bt_search() and
> +         * PredicateLockRelation().  We have to try again after taking a
> +         * relation-level predicate lock, to close a narrow window where we
> +         * wouldn't scan concurrently inserted tuples, but the writer wouldn't
> +         * see our predicate lock.
>           */

I'd like to keep the old comment here, it's good context, and add the 
new text in addition to the old.

v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patch: Can we 
keep the optimization when not using SSI?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Mon, Jun 26, 2023 at 2:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> FWIW, there is no fundamental reason why nbtree couldn't always
> preallocate a single empty leaf page during CREATE INDEX -- this leaf
> page is where values whose keyspace is between negative and positive
> infinity (i.e. all items) are located. That would fix this bug. Since,
> of course, it would make the theory of operation that you describe
> work reliably, even with an empty index. This isn't a serious
> proposal, of course -- lazily allocating the first real page has
> value, and we're hardly going to throw that away just to fix this bug.
> My point is that not allocating a leaf page in CREATE INDEX is the
> special case here, if anything.

I did briefly wonder about creating the root page on demand here
(probably with a bogus use of BT_WRITE or something like that),
which'd be pretty much equivalent to what you're suggesting there
except it'd work for existing empty indexes in the wild, but I wasn't
sure what complications that might have and didn't look further once I
thought of the 0001 patch's approach.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Mon, Jun 26, 2023 at 8:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yes, +1 on the patches. Any chance of constructing test cases for these?

Thanks for looking.  I can't think of any good way to test
deterministically.  All three depend on unlucky scheduling.  Perhaps
if we had one of those 'pause insertion' systems that I have heard
talk of.

> >               /*
> > -              * We only get here if the index is completely empty. Lock relation
> > -              * because nothing finer to lock exists.
> > +              * Since we have no pages locked, it's possible for another
> > +              * transaction to insert data between _bt_search() and
> > +              * PredicateLockRelation().  We have to try again after taking a
> > +              * relation-level predicate lock, to close a narrow window where we
> > +              * wouldn't scan concurrently inserted tuples, but the writer wouldn't
> > +              * see our predicate lock.
> >                */
>
> I'd like to keep the old comment here, it's good context, and add the
> new text in addition to the old.

Done.

> v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patch: Can we
> keep the optimization when not using SSI?

Done.

I'll push these in a couple of days if there are no further comments.

Attachment

Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Wed, Jun 28, 2023 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I'll push these in a couple of days if there are no further comments.

Done.

Thanks Artem, Dmitry and Heikki.

I wonder how we might be more systematic about this.  There are some
general principles that were not respected here, but I'm not sure if
they're even written down let alone defended with code.  Something to
think about.

It's not great to add a new use of BufferGetBlockNumber() (in terms of
false sharing just to get a value that we must have had moment earlier
in order to pin the buffer), but we do that all the time.  That seems
like a micro-optimisation worth looking into in some systematic way
across all AMs.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Artem Anisimov
Date:
Hi Thomas,


thank you for the fixes. I've looked up the patches in pg's git repo, 
and they got me wondering: where is the repo with pg tests? I'd be 
really uneasy to make changes to concurrency-related code without a 
decent testsuite to verify them.


Best regards,

Artem.


On 04/07/2023 01:02, Thomas Munro wrote:
> On Wed, Jun 28, 2023 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I'll push these in a couple of days if there are no further comments.
> Done.
>
> Thanks Artem, Dmitry and Heikki.
>
> I wonder how we might be more systematic about this.  There are some
> general principles that were not respected here, but I'm not sure if
> they're even written down let alone defended with code.  Something to
> think about.
>
> It's not great to add a new use of BufferGetBlockNumber() (in terms of
> false sharing just to get a value that we must have had moment earlier
> in order to pin the buffer), but we do that all the time.  That seems
> like a micro-optimisation worth looking into in some systematic way
> across all AMs.



Re: BUG #17949: Adding an index introduces serialisation anomalies.

From
Thomas Munro
Date:
On Sat, Jul 15, 2023 at 1:05 AM Artem Anisimov
<artem.anisimov.255@gmail.com> wrote:
> thank you for the fixes. I've looked up the patches in pg's git repo,
> and they got me wondering: where is the repo with pg tests? I'd be
> really uneasy to make changes to concurrency-related code without a
> decent testsuite to verify them.

Generally, the tests for SSI are in:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=tree;f=src/test/isolation/specs

... and see also ../expected.  Typically they are created as features
are developed, but we'll add new tests to cover complicated bugfixes
if we can see how to do it.  There are also non-SSI related tests in
there because the "isolation" infrastructure turned out to be so
useful.

For the problems discovered in this thread, I couldn't see how to do
it.  These required unlucky scheduling to go wrong -- whereas the
existing test infrastructure is based on deterministic behaviour with
wait points at the statement level.  It has been suggested before that
we could perhaps have a way to insert test-harness-controlled
waitpoints.  But even if we had such infrastructure, the relevant wait
points are actually gone after the fixes (ie the window where you have
to do something in another thread to cause problems has been closed so
there is no candidate wait point left).  Such infrastructure might
have been useful for demonstrating the bugs deterministically while
the windows existed.  One of the basic techniques we often use when
trying to understand what is going on in such cases is to insert
sleeps into interesting places to widen windows and make failures
"almost" deterministic, as I did for one of the cases here.

I suppose we could in theory have a suite of 'high load' tests of a
more statistical nature that could include things like the repro you
sent in.  It would burn a whole bunch of CPU trying to break
complicated concurrency stuff in ways that have been known to be
broken in the past.  I'm not sure it's worth it though.  Sometimes
it's OK for tests to be temporarily useful, too...