Thread: BUG #17949: Adding an index introduces serialisation anomalies.
The following bug has been logged on the website: Bug reference: 17949 Logged by: Artem Anisimov Email address: artem.anisimov.255@gmail.com PostgreSQL version: 15.3 Operating system: fedora 38 Description: Hello dear pg authors, I have come across a behaviour in pg that contradicts https://www.postgresql.org/docs/15/transaction-iso.html#XACT-SERIALIZABLE. I've minimised my testcase to a scenario that essentially duplicates a scenario that you describe in the following paragraph: > In particular, it is possible to see unique constraint violations > caused by conflicts with overlapping Serializable transactions > even after explicitly checking that the key isn't present > before attempting to insert it. > This can be avoided by making sure that all Serializable transactions that > insert potentially conflicting keys explicitly check if they can do so first. At the end of the report there is a C reproducer that highlights the problem. Let me give a high-level overview of the scenario first. I have N threads each trying to "acquire an exclusive lock" this way: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT * FROM locks WHERE path = $1; quit if there already is a lock INSERT INTO locks(path, ...) VALUES($1, ...); COMMIT; Once all threads have attempted to acquire a lock, I count how many of them succeeded to INSERT INTO. Normally, there is only one thread that succeeds, which is what I expect. However, if I add a HASH or a BTREE index on "path", it becomes possible for multiple threads to do a successful INSERT INTO, which violates the serialisability. The problem can be reproduced with the default postgresql.conf, but it takes some time. If I increase "shared_buffers" to 1024MB, the issue appears almost immediately (fewer chances to promote predicate locks to locks on the whole table?). I've seen this behaviour with pg 13, 15 and 16 (503b055). Now let us see the reproducer. It has two primary components: 1. test_once() spawns 32 threads that try to acquire a lock, waits for them, and counts the number of "acquired exclusive locks", 2. try_acquire_lock() executes a transaction described above. To build the reproducer: $ gcc -std=c99 -o test test.c -pthread `pkg-config libpq --cflags --libs` To run the reproducer: $ ./test or $ DB_CONNSTR="dbname=abc host=def user=ghi" ./test The reproducer: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <assert.h> #include <libpq-fe.h> // Comment this to make serialisation anomalies go away. #define WITH_INDEX // I have seen the problem with as few as 3 threads. 32 threads make // the issue appear much sooner. #define NR_THREADS (32) #define NR_RUNS (1024 * 1024) static PGconn *conns[NR_THREADS]; static void* try_acquire_lock(void *arg) { PGconn *c = arg; PGresult *res; int ntuples; res = PQexec(c, "BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE"); assert(PQresultStatus(res) == PGRES_COMMAND_OK); PQclear(res); res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'"); assert(PQresultStatus(res) == PGRES_TUPLES_OK); ntuples = PQntuples(res); PQclear(res); if (ntuples > 0) { // someone else already has a lock res = PQexec(c, "COMMIT"); assert(PQresultStatus(res) == PGRES_COMMAND_OK); PQclear(res); return NULL; } res = PQexec(c, "INSERT INTO locks(path) VALUES('xyz')"); PQclear(res); res = PQexec(c, "COMMIT"); PQclear(res); return NULL; } static void test_once(void) { PGconn *c = conns[0]; PGresult *res; int ntuples; pthread_t thrs[NR_THREADS]; for (int i = 0; i < NR_THREADS; ++i) pthread_create(&thrs[i], NULL, &try_acquire_lock, conns[i]); for (int i = 0; i < NR_THREADS; ++i) pthread_join(thrs[i], NULL); res = PQexec(c, "SELECT * FROM locks WHERE path = 'xyz'"); assert(PQresultStatus(res) == PGRES_TUPLES_OK); ntuples = PQntuples(res); PQclear(res); if (ntuples != 1) printf("ntuples = %d\n", ntuples); assert(ntuples == 1); res = PQexec(c, "TRUNCATE TABLE locks"); assert(PQresultStatus(res) == PGRES_COMMAND_OK); PQclear(res); } static void prepare_db(void) { PGconn *c = conns[0]; PGresult *res; res = PQexec(c, "DROP TABLE locks"); PQclear(res); res = PQexec(c, "CREATE TABLE locks (path TEXT NOT NULL)"); assert(PQresultStatus(res) == PGRES_COMMAND_OK); PQclear(res); #ifdef WITH_INDEX res = PQexec(c, "CREATE INDEX ON locks USING HASH(path)"); assert(PQresultStatus(res) == PGRES_COMMAND_OK); PQclear(res); #endif } int main(void) { const char *connstr = getenv("DB_CONNSTR"); if (connstr == NULL) connstr = "dbname=postgres"; for (int i = 0; i < NR_THREADS; ++i) { conns[i] = PQconnectdb(connstr); assert(PQstatus(conns[i]) == CONNECTION_OK); } prepare_db(); for (int i = 0; i < NR_RUNS; ++i) test_once(); for (int i = 0; i < NR_THREADS; ++i) PQfinish(conns[i]); return 0; }
Hi, Reproduced here. Thanks for the reproducer. I agree that something is wrong here, but I haven't had time to figure out what, yet, but let me share what I noticed so far... I modified your test to add a pid column to the locks table and to insert insert pg_backend_pid() into it, and got: postgres=# select xmin, * from locks; ┌───────┬──────┬───────┐ │ xmin │ path │ pid │ ├───────┼──────┼───────┤ │ 17634 │ xyz │ 32932 │ │ 17639 │ xyz │ 32957 │ └───────┴──────┴───────┘ Then I filtered the logs (having turned the logging up to capture all queries) so I could see just those PIDs and saw this sequence: 2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE 2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE 2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms statement: SELECT * FROM locks WHERE path = 'xyz' 2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms statement: INSERT INTO locks(path, pid) VALUES('xyz', pg_backend_pid()) 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms statement: SELECT * FROM locks WHERE path = 'xyz' 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms statement: INSERT INTO locks(path, pid) VALUES('xyz', pg_backend_pid()) 2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT 2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT That sequence if run (without overlap) in the logged order is normally rejected. The query plan being used (at least when I run the query myself) looks like this: Query Text: SELECT * FROM locks WHERE path = 'xyz' Bitmap Heap Scan on locks (cost=4.20..13.67 rows=6 width=36) Recheck Cond: (path = 'xyz'::text) -> Bitmap Index Scan on locks_path_idx (cost=0.00..4.20 rows=6 width=0) Index Cond: (path = 'xyz'::text)
Hi Thomas, thank you for the confirmation and analysis. Did you have a chance to take a more detailed look at the problem? Best regards, Artem. On 30/05/2023 04:29, Thomas Munro wrote: > Hi, > > Reproduced here. Thanks for the reproducer. I agree that something > is wrong here, but I haven't had time to figure out what, yet, but let > me share what I noticed so far... I modified your test to add a pid > column to the locks table and to insert insert pg_backend_pid() into > it, and got: > > postgres=# select xmin, * from locks; > > ┌───────┬──────┬───────┐ > │ xmin │ path │ pid │ > ├───────┼──────┼───────┤ > │ 17634 │ xyz │ 32932 │ > │ 17639 │ xyz │ 32957 │ > └───────┴──────┴───────┘ > > Then I filtered the logs (having turned the logging up to capture all > queries) so I could see just those PIDs and saw this sequence: > > 2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms > statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE > 2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms > statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE > 2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms > statement: SELECT * FROM locks WHERE path = 'xyz' > 2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms > statement: INSERT INTO locks(path, pid) VALUES('xyz', > pg_backend_pid()) > 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms > statement: SELECT * FROM locks WHERE path = 'xyz' > 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms > statement: INSERT INTO locks(path, pid) VALUES('xyz', > pg_backend_pid()) > 2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT > 2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT > > That sequence if run (without overlap) in the logged order is normally > rejected. The query plan being used (at least when I run the query > myself) looks like this: > > Query Text: SELECT * FROM locks WHERE path = 'xyz' > Bitmap Heap Scan on locks (cost=4.20..13.67 rows=6 width=36) > Recheck Cond: (path = 'xyz'::text) > -> Bitmap Index Scan on locks_path_idx (cost=0.00..4.20 rows=6 width=0) > Index Cond: (path = 'xyz'::text)
On Tue, Jun 13, 2023 at 8:00 PM Artem Anisimov <artem.anisimov.255@gmail.com> wrote: > thank you for the confirmation and analysis. Did you have a chance to > take a more detailed look at the problem? Looking into it...
> On Mon, May 29, 2023 at 09:29:09PM -0400, Thomas Munro wrote: > Starred > > Hi, > > Reproduced here. Thanks for the reproducer. I agree that something > is wrong here, but I haven't had time to figure out what, yet, but let > me share what I noticed so far... I modified your test to add a pid > column to the locks table and to insert insert pg_backend_pid() into > it, and got: > > postgres=# select xmin, * from locks; > > ┌───────┬──────┬───────┐ > │ xmin │ path │ pid │ > ├───────┼──────┼───────┤ > │ 17634 │ xyz │ 32932 │ > │ 17639 │ xyz │ 32957 │ > └───────┴──────┴───────┘ > > Then I filtered the logs (having turned the logging up to capture all > queries) so I could see just those PIDs and saw this sequence: > > 2023-05-29 00:15:43.933 EDT [32932] LOG: duration: 0.182 ms > statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE > 2023-05-29 00:15:43.934 EDT [32957] LOG: duration: 0.276 ms > statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE > 2023-05-29 00:15:43.935 EDT [32932] LOG: duration: 1.563 ms > statement: SELECT * FROM locks WHERE path = 'xyz' > 2023-05-29 00:15:43.936 EDT [32932] LOG: duration: 0.126 ms > statement: INSERT INTO locks(path, pid) VALUES('xyz', > pg_backend_pid()) > 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 2.191 ms > statement: SELECT * FROM locks WHERE path = 'xyz' > 2023-05-29 00:15:43.937 EDT [32957] LOG: duration: 0.261 ms > statement: INSERT INTO locks(path, pid) VALUES('xyz', > pg_backend_pid()) > 2023-05-29 00:15:43.937 EDT [32932] LOG: duration: 0.222 ms statement: COMMIT > 2023-05-29 00:15:43.939 EDT [32957] LOG: duration: 1.775 ms statement: COMMIT I've tried to reproduce it as well, adding more logging around the serialization code. If it helps, what I observe is the second overlapping transaction, that has started a bit later, do not error out because in OnConflict_CheckForSerializationFailure (when checking for "writer has become a pivot") there are no more conflicts received from SHMQueueNext. All the rest of the reported serialization conflicts are coming from this check, so I assume the incorrect transaction should fail there too. Not sure yet why is that so.
On Thu, Jun 15, 2023 at 7:29 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I've tried to reproduce it as well, adding more logging around the > serialization code. If it helps, what I observe is the second > overlapping transaction, that has started a bit later, do not error out > because in OnConflict_CheckForSerializationFailure (when checking for > "writer has become a pivot") there are no more conflicts received from > SHMQueueNext. All the rest of the reported serialization conflicts are > coming from this check, so I assume the incorrect transaction should > fail there too. Not sure yet why is that so. Some more observations: happens on 11 and master, happens with btrees, happens with bitmapscan disabled (eg with plain index scan), but so far in my testing it doesn't happen if the table already contains one other tuple (ie if you change the reproducer to insert another row ('foo') after the TRUNCATE). There is a special case for predicate locking empty indexes, which uses a relation-level (since there are no pages to lock yet), but that doesn't seem to be wrong and if you hack it to lock pages 1 and 2 instead, it still reproduces. Pondering the empty index case made me wonder if the case "If we found one of our own SIREAD locks to remove, remove it now" was implicated (that's something that would not happen for a relation-level lock), but it still reproduces if you comment out that optimisation. So far I have not been able to reproduce it below 8 threads. Hmm, I wonder if there might be a missing check/lock in some racy code path around the initial creation of the root page...
How is this schedule supposed to work? S1: _bt_search(&buf) S1: if (!BufferIsValid(buf)) // because index is empty S1: { S1: PredicateLockRelation(...); S1: ... S1: return false; // no tuples for you S2: _bt_search(&buf) S1: INSERT ... S2: if (!BufferIsValid(buf)) // because index *was* empty S2: { S2: PredicateLockRelation(...); S2: ... S2: return false; // no tuples for you S2: ... My point is that S2 won't ever scan S1's tuples, so it won't pass S1's xid to CheckForSerializableConflictOut(). Am I missing something? I can repro this with NR_THREADS set to only 2, after inserting pg_usleep(1) after the _bt_search() in _bt_first() (sched_yield() wasn't quite enough).
On Mon, Jun 19, 2023 at 2:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: > S2: _bt_search(&buf) > S1: INSERT ... > S2: PredicateLockRelation(...); > My point is that S2 won't ever scan S1's tuples, so it won't pass S1's > xid to CheckForSerializableConflictOut() [completing that sentence a little more] ... so this was our only chance to detect that S2 read an object that S1 wrote. But the _bt_search() and PredicateLockXXX() calls are not atomic and not rechecked, so a write between them is invisible to the algorithm. I'm not sure about this, but it's the idea I'm exploring...
The attached shows one approach to the problem I described already, namely scans of empty btrees that concurrently gain an initial root page. This seems to fix the posted repro case if using a btree with enable_seqscan=off, enable_bitmapscan=off. That's the variant I had been working with, because it was simpler. But that turns out to be only one problem hit by the repro. Apparently there's a second bug, if you let it use bitmap heapscans. Or perhaps I am misdiagnosing the above. It could be something like: btgetbitmap not following the predicate lock protocol correctly (but it looks OK at a glance), or bitmap heapscan not checking for conflicts out comprehensively (xids on invisible tuples we scan), eg not fetching heap tuples for some short cut reason somewhere. But that's all I have time for today.
Attachment
On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... or bitmap heapscan not checking for > conflicts out comprehensively (xids on invisible tuples we scan), eg > not fetching heap tuples for some short cut reason somewhere. ... Ahh, here's such a place. I can't reproduce it with the patch already posted + this check commented out. --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, hscan->rs_cindex = 0; hscan->rs_ntuples = 0; +#if 0 /* * Ignore any claimed entries past what we think is the end of the * relation. It may have been extended after the start of our scan (we * only hold an AccessShareLock, and it could be inserts from this * backend). */ if (block >= hscan->rs_nblocks) return false; +#endif
> On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote: > On Mon, Jun 19, 2023 at 6:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > ... or bitmap heapscan not checking for > > conflicts out comprehensively (xids on invisible tuples we scan), eg > > not fetching heap tuples for some short cut reason somewhere. ... > > Ahh, here's such a place. I can't reproduce it with the patch already > posted + this check commented out. > > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -2127,16 +2127,18 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, > > hscan->rs_cindex = 0; > hscan->rs_ntuples = 0; > > +#if 0 > /* > * Ignore any claimed entries past what we think is the end of the > * relation. It may have been extended after the start of our scan (we > * only hold an AccessShareLock, and it could be inserts from this > * backend). > */ > if (block >= hscan->rs_nblocks) > return false; > +#endif Great, thanks! Can confirm, after applying both the posted patch and the change above the issue is not reproducible anymore. One thing I've noticed is that one can observe a similar issue using a gin index and int[] for the "path" column, even applying changes from the thread. The gin implementation does something similar to btree in startScanEntry -- it lands in "No entry found" branch, but instead of locking the relation it locks "the leaf page, to lock the place where the entry would've been, had there been one". The similar fix retrying ginFindLeafPage didn't solve the problem, even if locking the whole relation instead, but maybe I'm missing something.
On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote: > > +#if 0 > > /* > > * Ignore any claimed entries past what we think is the end of the > > * relation. It may have been extended after the start of our scan (we > > * only hold an AccessShareLock, and it could be inserts from this > > * backend). > > */ > > if (block >= hscan->rs_nblocks) > > return false; > > +#endif > > Great, thanks! Can confirm, after applying both the posted patch and the > change above the issue is not reproducible anymore. Here's a cleaned-up version of the first two changes. What do you think about the assertions I make in the commit message for 0002? > One thing I've noticed is that one can observe a similar issue using a > gin index and int[] for the "path" column, even applying changes from > the thread. The gin implementation does something similar to btree in > startScanEntry -- it lands in "No entry found" branch, but instead of > locking the relation it locks "the leaf page, to lock the place where > the entry would've been, had there been one". The similar fix retrying > ginFindLeafPage didn't solve the problem, even if locking the whole > relation instead, but maybe I'm missing something. Ouch. I would have to go and study gin's interlocking model, but one superficial bug I spotted is that ginget.c's collectMatchBitmap() calls PredicateLockPage(stack->buffer), where a block number is expected. I wish we had strong typedefs, to reject stuff like that at compile time. But fixing that alone isn't enough. In case someone who knows more about gin is interested in helping, I attach Artem's repro, modified to use gin.
Attachment
On Tue, Jun 20, 2023 at 1:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > One thing I've noticed is that one can observe a similar issue using a > > gin index and int[] for the "path" column, even applying changes from > > the thread. The gin implementation does something similar to btree in > > startScanEntry -- it lands in "No entry found" branch, but instead of > > locking the relation it locks "the leaf page, to lock the place where > > the entry would've been, had there been one". The similar fix retrying > > ginFindLeafPage didn't solve the problem, even if locking the whole > > relation instead, but maybe I'm missing something. > > Ouch. I would have to go and study gin's interlocking model, but one > superficial bug I spotted is that ginget.c's collectMatchBitmap() > calls PredicateLockPage(stack->buffer), where a block number is > expected. I wish we had strong typedefs, to reject stuff like that at > compile time. But fixing that alone isn't enough. > > In case someone who knows more about gin is interested in helping, I > attach Artem's repro, modified to use gin. This is probably going to go faster if I CC the authors of commit 0bef1c06. Any ideas about how we're missing rw-conflicts under high concurrency?
On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > This is probably going to go faster if I CC the authors of commit > 0bef1c06. Any ideas about how we're missing rw-conflicts under high > concurrency? I guess one (probably stupid) question I have: where is the CheckForSerializableConflictIn(rootPostingTree) that I was expecting to see when writing to an existing posting tree? IIUC that should pair with PredicateLockPage(btree->index, rootPostingTree, snapshot) when reading.
> On Tue, Jun 20, 2023 at 01:22:19PM +1200, Thomas Munro wrote: > On Tue, Jun 20, 2023 at 12:18 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Jun 19, 2023 at 09:30:12PM +1200, Thomas Munro wrote: > > > +#if 0 > > > /* > > > * Ignore any claimed entries past what we think is the end of the > > > * relation. It may have been extended after the start of our scan (we > > > * only hold an AccessShareLock, and it could be inserts from this > > > * backend). > > > */ > > > if (block >= hscan->rs_nblocks) > > > return false; > > > +#endif > > > > Great, thanks! Can confirm, after applying both the posted patch and the > > change above the issue is not reproducible anymore. > > Here's a cleaned-up version of the first two changes. What do you > think about the assertions I make in the commit message for 0002? Yep, it sounds correct to me. After a quick look I couldn't find where exactly the similar code lives in the pre-tableam version, so can't say anything about back-patching there.
> On Wed, Jun 21, 2023 at 11:20:28AM +1200, Thomas Munro wrote: > On Wed, Jun 21, 2023 at 10:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > This is probably going to go faster if I CC the authors of commit > > 0bef1c06. Any ideas about how we're missing rw-conflicts under high > > concurrency? > > I guess one (probably stupid) question I have: where is the > CheckForSerializableConflictIn(rootPostingTree) that I was expecting > to see when writing to an existing posting tree? IIUC that should > pair with PredicateLockPage(btree->index, rootPostingTree, snapshot) > when reading. I can't find it either, but based on my superficial investigation this particular reproducer doesn't seem to hit posting trees functionality at all. What I observe is the inserting transaction uses ginHeapTupleFastCollect + ginHeapTupleFastInsert, and the corresponding commentary says that serialization in this case depends on the metabuffer: /* * An insertion to the pending list could logically belong anywhere in the * tree, so it conflicts with all serializable scans. All scans acquire a * predicate lock on the metabuffer to represent that. */ Now the reading transaction actually does PredicateLockPage on the metabuffer inside scanPendingInsert, but strangely enough it doesn't lock anything because the SerializationNeededForRead condition is false. I'm trying to verify if it's somehow a part of the issue, or something is broken on my side.
On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Now the reading transaction actually does PredicateLockPage on the > metabuffer inside scanPendingInsert, but strangely enough it doesn't > lock anything because the SerializationNeededForRead condition is false. > I'm trying to verify if it's somehow a part of the issue, or something > is broken on my side. Maybe you were confused by the presence of non-SSI transactions in the repro (eg the transaction that sets up the index)? To answer my own earlier question, the conflict-in check for posting trees is hidden in getFindLeafPage(..., true, ...). I spent some more time trying to grok this today. FTR it reproduces faster without the extra tuple that repro I posted inserts after TRUNCATE (the point of that was to find out whether it was an empty-to-non-empty transition). I still don't know what's wrong but I am beginning to suspect the "fast" code. It seems as though, under high concurrency, we sometimes don't scan a recently inserted (invisible to our snapshot, but needed for SSI checks) tuple, but I don't yet know why.
> On Thu, Jun 22, 2023 at 10:02:19PM +1200, Thomas Munro wrote: > On Wed, Jun 21, 2023 at 9:04 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Now the reading transaction actually does PredicateLockPage on the > > metabuffer inside scanPendingInsert, but strangely enough it doesn't > > lock anything because the SerializationNeededForRead condition is false. > > I'm trying to verify if it's somehow a part of the issue, or something > > is broken on my side. > > Maybe you were confused by the presence of non-SSI transactions in the > repro (eg the transaction that sets up the index)? Yeah, sort of. Need to optimize the way how I consume the logs. > To answer my own earlier question, the conflict-in check for posting > trees is hidden in getFindLeafPage(..., true, ...). > > I spent some more time trying to grok this today. FTR it reproduces > faster without the extra tuple that repro I posted inserts after > TRUNCATE (the point of that was to find out whether it was an > empty-to-non-empty transition). I still don't know what's wrong but I > am beginning to suspect the "fast" code. It seems as though, under > high concurrency, we sometimes don't scan a recently inserted > (invisible to our snapshot, but needed for SSI checks) tuple, but I > don't yet know why. Yep, it's definitely something in the "fast" path. Testing the same, but with an index having (fastupdate=off) works just fine for me.
> On Fri, Jun 23, 2023 at 04:05:42PM +0200, Dmitry Dolgov wrote: > > I spent some more time trying to grok this today. FTR it reproduces > > faster without the extra tuple that repro I posted inserts after > > TRUNCATE (the point of that was to find out whether it was an > > empty-to-non-empty transition). I still don't know what's wrong but I > > am beginning to suspect the "fast" code. It seems as though, under > > high concurrency, we sometimes don't scan a recently inserted > > (invisible to our snapshot, but needed for SSI checks) tuple, but I > > don't yet know why. > > Yep, it's definitely something in the "fast" path. Testing the same, but > with an index having (fastupdate=off) works just fine for me. I've managed to resolve it, or at least reduce the chances for the issue to appear, via semi-randomly adding more CheckForSerializableConflictIn / PredicateLock around the new sublist that has to be created in ginHeapTupleFastInsert. I haven't seen the reproducer failing with this changeset after running it multiple times for a couple of minutes, where on the main branch, with the two fixes from Thomas included, it was failing within a couple of seconds. diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c @@ -198,6 +199,7 @@ makeSublist(Relation index, IndexTuple *tuples, int32 ntuples, /* * Write last page */ + CheckForSerializableConflictIn(index, NULL, BufferGetBlockNumber(curBuffer)); res->tail = BufferGetBlockNumber(curBuffer); res->tailFreeSize = writeListPage(index, curBuffer, @@ -273,6 +275,11 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) separateList = true; LockBuffer(metabuffer, GIN_UNLOCK); } + else + { + CheckForSerializableConflictIn(index, NULL, metadata->head); + } } diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c @@ -140,7 +140,7 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack, * Predicate lock entry leaf page, following pages will be locked by * moveRightIfItNeeded() */ - PredicateLockPage(btree->index, stack->buffer, snapshot); + PredicateLockPage(btree->index, BufferGetBlockNumber(stack->buffer), snapshot); for (;;) { @@ -1925,6 +1926,8 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm) /* * Set up the scan keys, and check for unsatisfiable query. */ + PredicateLockRelation(scan->indexRelation, scan->xs_snapshot); ginFreeScanKeys(so); /* there should be no keys yet, but just to be Now the last PredicateLockRelation does look rather weird. But without it, or with this locking happening in scanPendingInsert (instead of locking the meta page), or without other changes in ginfast.c, the reproducer is still failing. This of course makes it not a proper solution by any mean, but hopefully it will help to understand the problem a bit better.
On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I've managed to resolve it, or at least reduce the chances for the issue to > appear, via semi-randomly adding more CheckForSerializableConflictIn / > PredicateLock around the new sublist that has to be created in > ginHeapTupleFastInsert. I haven't seen the reproducer failing with this > changeset after running it multiple times for a couple of minutes, where on the > main branch, with the two fixes from Thomas included, it was failing within a > couple of seconds. Ahh, right, thanks. I don't think we need to lock all those pages as you showed, as this whole "fast" path is supposed to be covered by the meta page (in other words, GIN is expected to have the highest possible serialisation failure rate under SSI unless you turn fast updates off). But there is an ordering bug with the existing predicate lock in that code, which allows this to happen: S1: CheckForSerializableConflictIn(meta) S2: PredicateLockPage(meta) S2: scan, find no tuples S1: BufferLock(EXCLUSIVE S1: modify stuff... CheckForSerializableConflictIn() was written with the assumption that you're inserting a tuple (ie you have the page containing the tuple locked), so you'll either conflict with a reader who already has a predicate lock at that point OR you'll insert first and then the reader will see your (invisible-to-snapshot) tuples, but here we're doing some fancy footwork with a meta page, and we screwed up the ordering and left a window where neither of those things happens. Perhaps it was coded that way because there is drop-then-reacquire dance, but it's easy enough to move the check in both branches. Does that make sense?
Attachment
On Sun, Jun 25, 2023 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > CheckForSerializableConflictIn() was written with the assumption that > you're inserting a tuple (ie you have the page containing the tuple > locked), so you'll either conflict with a reader who already has a > predicate lock at that point OR you'll insert first and then the > reader will see your (invisible-to-snapshot) tuples, but here we're > doing some fancy footwork with a meta page, and we screwed up the > ordering and left a window where neither of those things happens. FWIW, there is no fundamental reason why nbtree couldn't always preallocate a single empty leaf page during CREATE INDEX -- this leaf page is where values whose keyspace is between negative and positive infinity (i.e. all items) are located. That would fix this bug. Since, of course, it would make the theory of operation that you describe work reliably, even with an empty index. This isn't a serious proposal, of course -- lazily allocating the first real page has value, and we're hardly going to throw that away just to fix this bug. My point is that not allocating a leaf page in CREATE INDEX is the special case here, if anything. I'm not surprised that GIN has deeper problems than nbtree due to things like posting trees. Many GIN features rely on the fact that GIN only supports lossy index scans. For example, it doesn't matter if the pending list has TIDs that appear elsewhere within the same index for a while. It doesn't matter if you have a hard crash when merging the pending list -- VACUUM will eventually clean everything up. Having the same TID in two different places at the same time is tolerable. I imagine that that kind of foundation is harder to build SSI predicate locks on top of. -- Peter Geoghegan
On 26/06/2023 03:25, Thomas Munro wrote: > On Sun, Jun 25, 2023 at 1:59 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> I've managed to resolve it, or at least reduce the chances for the issue to >> appear, via semi-randomly adding more CheckForSerializableConflictIn / >> PredicateLock around the new sublist that has to be created in >> ginHeapTupleFastInsert. I haven't seen the reproducer failing with this >> changeset after running it multiple times for a couple of minutes, where on the >> main branch, with the two fixes from Thomas included, it was failing within a >> couple of seconds. > > Ahh, right, thanks. I don't think we need to lock all those pages as > you showed, as this whole "fast" path is supposed to be covered by the > meta page (in other words, GIN is expected to have the highest > possible serialisation failure rate under SSI unless you turn fast > updates off). But there is an ordering bug with the existing > predicate lock in that code, which allows this to happen: > > S1: CheckForSerializableConflictIn(meta) > > S2: PredicateLockPage(meta) > S2: scan, find no tuples > > S1: BufferLock(EXCLUSIVE > S1: modify stuff... > > CheckForSerializableConflictIn() was written with the assumption that > you're inserting a tuple (ie you have the page containing the tuple > locked), so you'll either conflict with a reader who already has a > predicate lock at that point OR you'll insert first and then the > reader will see your (invisible-to-snapshot) tuples, but here we're > doing some fancy footwork with a meta page, and we screwed up the > ordering and left a window where neither of those things happens. > Perhaps it was coded that way because there is drop-then-reacquire > dance, but it's easy enough to move the check in both branches. Does > that make sense? Yes, +1 on the patches. Any chance of constructing test cases for these? The above race condition is hard to reach, but some of these other bugs seem more testable. Some minor nits: In v3-0001-Fix-race-in-SSI-interaction-with-empty-btrees.patch: > /* > - * We only get here if the index is completely empty. Lock relation > - * because nothing finer to lock exists. > + * Since we have no pages locked, it's possible for another > + * transaction to insert data between _bt_search() and > + * PredicateLockRelation(). We have to try again after taking a > + * relation-level predicate lock, to close a narrow window where we > + * wouldn't scan concurrently inserted tuples, but the writer wouldn't > + * see our predicate lock. > */ I'd like to keep the old comment here, it's good context, and add the new text in addition to the old. v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patch: Can we keep the optimization when not using SSI? -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, Jun 26, 2023 at 2:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > FWIW, there is no fundamental reason why nbtree couldn't always > preallocate a single empty leaf page during CREATE INDEX -- this leaf > page is where values whose keyspace is between negative and positive > infinity (i.e. all items) are located. That would fix this bug. Since, > of course, it would make the theory of operation that you describe > work reliably, even with an empty index. This isn't a serious > proposal, of course -- lazily allocating the first real page has > value, and we're hardly going to throw that away just to fix this bug. > My point is that not allocating a leaf page in CREATE INDEX is the > special case here, if anything. I did briefly wonder about creating the root page on demand here (probably with a bogus use of BT_WRITE or something like that), which'd be pretty much equivalent to what you're suggesting there except it'd work for existing empty indexes in the wild, but I wasn't sure what complications that might have and didn't look further once I thought of the 0001 patch's approach.
On Mon, Jun 26, 2023 at 8:04 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Yes, +1 on the patches. Any chance of constructing test cases for these? Thanks for looking. I can't think of any good way to test deterministically. All three depend on unlucky scheduling. Perhaps if we had one of those 'pause insertion' systems that I have heard talk of. > > /* > > - * We only get here if the index is completely empty. Lock relation > > - * because nothing finer to lock exists. > > + * Since we have no pages locked, it's possible for another > > + * transaction to insert data between _bt_search() and > > + * PredicateLockRelation(). We have to try again after taking a > > + * relation-level predicate lock, to close a narrow window where we > > + * wouldn't scan concurrently inserted tuples, but the writer wouldn't > > + * see our predicate lock. > > */ > > I'd like to keep the old comment here, it's good context, and add the > new text in addition to the old. Done. > v3-0002-Fix-race-in-SSI-interaction-with-bitmap-heap-scan.patch: Can we > keep the optimization when not using SSI? Done. I'll push these in a couple of days if there are no further comments.
Attachment
On Wed, Jun 28, 2023 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I'll push these in a couple of days if there are no further comments. Done. Thanks Artem, Dmitry and Heikki. I wonder how we might be more systematic about this. There are some general principles that were not respected here, but I'm not sure if they're even written down let alone defended with code. Something to think about. It's not great to add a new use of BufferGetBlockNumber() (in terms of false sharing just to get a value that we must have had moment earlier in order to pin the buffer), but we do that all the time. That seems like a micro-optimisation worth looking into in some systematic way across all AMs.
Hi Thomas, thank you for the fixes. I've looked up the patches in pg's git repo, and they got me wondering: where is the repo with pg tests? I'd be really uneasy to make changes to concurrency-related code without a decent testsuite to verify them. Best regards, Artem. On 04/07/2023 01:02, Thomas Munro wrote: > On Wed, Jun 28, 2023 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> I'll push these in a couple of days if there are no further comments. > Done. > > Thanks Artem, Dmitry and Heikki. > > I wonder how we might be more systematic about this. There are some > general principles that were not respected here, but I'm not sure if > they're even written down let alone defended with code. Something to > think about. > > It's not great to add a new use of BufferGetBlockNumber() (in terms of > false sharing just to get a value that we must have had moment earlier > in order to pin the buffer), but we do that all the time. That seems > like a micro-optimisation worth looking into in some systematic way > across all AMs.
On Sat, Jul 15, 2023 at 1:05 AM Artem Anisimov <artem.anisimov.255@gmail.com> wrote: > thank you for the fixes. I've looked up the patches in pg's git repo, > and they got me wondering: where is the repo with pg tests? I'd be > really uneasy to make changes to concurrency-related code without a > decent testsuite to verify them. Generally, the tests for SSI are in: https://git.postgresql.org/gitweb/?p=postgresql.git;a=tree;f=src/test/isolation/specs ... and see also ../expected. Typically they are created as features are developed, but we'll add new tests to cover complicated bugfixes if we can see how to do it. There are also non-SSI related tests in there because the "isolation" infrastructure turned out to be so useful. For the problems discovered in this thread, I couldn't see how to do it. These required unlucky scheduling to go wrong -- whereas the existing test infrastructure is based on deterministic behaviour with wait points at the statement level. It has been suggested before that we could perhaps have a way to insert test-harness-controlled waitpoints. But even if we had such infrastructure, the relevant wait points are actually gone after the fixes (ie the window where you have to do something in another thread to cause problems has been closed so there is no candidate wait point left). Such infrastructure might have been useful for demonstrating the bugs deterministically while the windows existed. One of the basic techniques we often use when trying to understand what is going on in such cases is to insert sleeps into interesting places to widen windows and make failures "almost" deterministic, as I did for one of the cases here. I suppose we could in theory have a suite of 'high load' tests of a more statistical nature that could include things like the repro you sent in. It would burn a whole bunch of CPU trying to break complicated concurrency stuff in ways that have been known to be broken in the past. I'm not sure it's worth it though. Sometimes it's OK for tests to be temporarily useful, too...