Thread: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17401 Logged by: Ben Chobot Email address: bench@silentmedia.com PostgreSQL version: 12.9 Operating system: Linux (Ubuntu) Description: This bug is is almost identical to BUG #17389, which I filed blaming pg_repack; however, further testing shows the same symptoms using vanilla REINDEX TABLE CONCURRENTLY. 1. Put some data in a table with a single btree index: create table public.simple_test (id int primary key); insert into public.simple_test(id) (select generate_series(1,1000)); 2. Set up streaming replication to a secondary db. 3. In a loop on the primary, concurrently REINDEX that table: while `true`; do psql -tAc "select now(),relfilenode from pg_class where relname='simple_test_pkey'" >> log; psql -tAc "reindex table concurrently public.simple_test"; done 4. In a loop on the secondary, have psql query the secondary db for an indexed value of that table: while `true`; do psql -tAc "select count(*) from simple_test where id='3'; select relfilenode from pg_class where relname='simple_test_pkey'" || break; done; date With those 4 steps, the client on the replica will reliably fail to open the OID of the index within 30 minutes of looping. ("ERROR: could not open relation with OID 6715827") When we run the same client loop on the primary instead of the replica, or if we reindex without the CONCURRENTLY clause, then the client loop will run for hours without fail, but neither of those workarounds are options for us in production. Like I said before, this isn't a new problem - we've seen it since at least 9.5 - but pre-12 we saw it using pg_repack, which is an easy (and reasonable) scapegoat. But now that we've upgraded to 12 and are still seeing it using vanilla concurrent reindexing, it seems more clear this is an actual postgres bug?
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Peter Geoghegan
Date:
On Tue, Feb 8, 2022 at 1:01 PM PG Bug reporting form <noreply@postgresql.org> wrote: > With those 4 steps, the client on the replica will reliably fail to open the > OID of the index within 30 minutes of looping. ("ERROR: could not open > relation with OID 6715827") When we run the same client loop on the primary > instead of the replica, or if we reindex without the CONCURRENTLY clause, > then the client loop will run for hours without fail, but neither of those > workarounds are options for us in production. I find the idea that we'd fail to WAL-log information that is needed during Hot Standby (to prevent this race condition) plausible. Michael? -- Peter Geoghegan
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote: > I find the idea that we'd fail to WAL-log information that is needed > during Hot Standby (to prevent this race condition) plausible. > Michael? Yeah, REINDEX relies on some existing index definition, so it feels like we are missing a piece related to invalid indexes in all that. A main difference is the lock level, as exclusive locks are getting logged so the standby can react and wait on that. The 30-minute mark is interesting. Ben, did you change any replication-related GUCs that could influence that? Say, wal_receiver_timeout, hot_standby_feedback or max_standby_streaming_delay? -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Ben Chobot
Date:
Michael Paquier wrote on 2/8/22 5:35 PM: > On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote: >> I find the idea that we'd fail to WAL-log information that is needed >> during Hot Standby (to prevent this race condition) plausible. >> Michael? > Yeah, REINDEX relies on some existing index definition, so it feels > like we are missing a piece related to invalid indexes in all that. A > main difference is the lock level, as exclusive locks are getting > logged so the standby can react and wait on that. The 30-minute mark > is interesting. Ben, did you change any replication-related GUCs that > could influence that? Say, wal_receiver_timeout, hot_standby_feedback > or max_standby_streaming_delay? Oh, to be clear, the 30 minute mark is more "the loop has always failed this far into it" and sometimes that's 5 minutes and sometimes it's more, but I've never seen it take more than somewhere in the 20s. I was thinking it was just because of the race condition, but, to answer your question, yes, we have tuned some replication parameters. Here are the ones you asked about; did you want to see the value of any others? =# show wal_receiver_timeout ; wal_receiver_timeout ────────────────────── 1min (1 row) 04:26:27 db: postgres@postgres, pid:29507 =# show hot_standby_feedback ; hot_standby_feedback ────────────────────── on (1 row) 04:26:40 db: postgres@postgres, pid:29507 =# show max_standby_streaming_delay ; max_standby_streaming_delay ───────────────────────────── 10min (1 row)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> 9 февр. 2022 г., в 01:41, PG Bug reporting form <noreply@postgresql.org> написал(а): > > PostgreSQL version: 12.9 I thought we fixed this in 12.9. But apparently not. Gitlab was observing similar problem on 12.7 here [0]. And I was convinced that it's manifestation of relcache bug fixedin fdd965d07 [1]. I think we could adapt current amcheck tests for replication checks too. It cannot do heapallindexedon standby, but probably can catch relcache bugs. Best regards, Andrey Borodin. [0] https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6120#note_796846035 [1] https://github.com/postgres/postgres/commit/fdd965d074d46765c295223b119ca437dbcac973
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Peter Geoghegan
Date:
On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > It cannot do heapallindexed on standby, but probably can catch relcache bugs. Why not? In general heapallindexed verification works just fine on a standby. You may be thinking of the extra verification options that are applied to the index structure itself when the user opts to call bt_index_parent_check() (rather than bt_index_check). But that's unrelated. -- Peter Geoghegan
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> 9 февр. 2022 г., в 10:30, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> It cannot do heapallindexed on standby, but probably can catch relcache bugs. > > Why not? In general heapallindexed verification works just fine on a standby. Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomes muchsmaller. Best regards, Andrey Borodin.
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Peter Geoghegan
Date:
On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomesmuch smaller. Actually, calling bt_index_check() + heapallindexed acquires only an AccessShareLock on both the table and the index. Again, it sounds like you're talking about bt_index_parent_check() (with or without heapallindexed). That does acquire a ShareLock, which will just throw an error on a standby. -- Peter Geoghegan
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> On 9 Feb 2022, at 12:13, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomesmuch smaller. > > Actually, calling bt_index_check() + heapallindexed acquires only an > AccessShareLock on both the table and the index. > > Again, it sounds like you're talking about bt_index_parent_check() > (with or without heapallindexed). That does acquire a ShareLock, which > will just throw an error on a standby. You are right. So, I’ve composed dirty test by reverting 7f580aa5d [0] to resurrect pgbench_background(). Now I observe in standby log: 2022-02-10 21:46:38.909 +05 [84272] 004_cic_standby.pl LOG: statement: SELECT bt_index_check('idx',false); 2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl ERROR: cannot check index "idx_ccold" 2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl DETAIL: Index is not valid. 2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',false); I think it’s somewhat unexpected. But is it a real problem? Thanks! [0] https://github.com/postgres/postgres/commit/7f580aa5d88a9b03d66fcb9a1d7c4fcd69d9e126
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Ben Chobot
Date:
Andrey Borodin wrote on 2/10/22 8:52 AM: > I think it’s somewhat unexpected. But is it a real problem? If you're asking me, then I can't speak to your standby log, but I'd say the bug I raised is a problem, yes. :)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> On 10 Feb 2022, at 22:18, Ben Chobot <bench@silentmedia.com> wrote: > > Andrey Borodin wrote on 2/10/22 8:52 AM: >> I think it’s somewhat unexpected. But is it a real problem? > > If you're asking me, then I can't speak to your standby log, but I'd say the bug I raised is a problem, yes. :) Well, not exactly. I had attached the patch with the test that reproduces some problem with dropped index on standby. In this test we have a standby and check query on it at some point finds out that it’s working with invalid index. I’m notsure that this is reproduction of your exact problem. I see that we are swapping indexes on primary in a following way: index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged CommitTransactionCommand(); // WAL-logged But I do not know is it expected to work on standby, since relcache invalidation is not logged. Maybe Peter or someone moreexperienced than me can explain how this is expected to work. Thanks! Best regards, Andrey Borodin.
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andres Freund
Date:
Hi, On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote: > index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged > CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged > CommitTransactionCommand(); // WAL-logged > > But I do not know is it expected to work on standby, since relcache > invalidation is not logged. Maybe Peter or someone more experienced than me > can explain how this is expected to work. What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It should be emitted at the commit, like other invalidation messages? Here's the WAL for a REINDEX CONCURRENTLY, leaving out the actual catalog changes (i.e. heap/heap2/btree rmgr): rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/0155F3F8, prev 0/0155F3B8, desc: CREATE base/5/16399 rmgr: Standby len (rec/tot): 42/ 42, tx: 731, lsn: 0/0155F428, prev 0/0155F3F8, desc: LOCK xid 731 db 5rel 16399 rmgr: Transaction len (rec/tot): 194/ 194, tx: 731, lsn: 0/0155F960, prev 0/0155F918, desc: COMMIT 2022-02-1015:53:56.173778 PST; inval msgs: catcache 53 catcache 52 catcache 7 catcache 6 catcache 32 relcache 16399 relcache16392 snapshot 2608 relcache 16392 rmgr: Transaction len (rec/tot): 146/ 146, tx: 732, lsn: 0/01561BB0, prev 0/01561B70, desc: COMMIT 2022-02-1015:53:56.192747 PST; inval msgs: catcache 53 catcache 52 catcache 32 relcache 16392 relcache 16399 relcache 16399 It might be worth writing a test that replays WAL records one-by-one, and that checks at which record things start to break. If the problem is not reproducible that way, we have a problem "within" a single WAL record replay. If it is reproducible, it's likely a problem in the way the primary generates WAL. That would be generally very useful tooling for detailed testing of sequences of WAL records. I'm pretty sure the problem is on the primary. Looking through ReindexRelationConcurrently() I think I found at least two problems: 1) We don't WAL log snapshot conflicts, afaict (i.e. the WaitForOlderSnapshots() in phase 3). Which I think means that the new index can end up being used "too soon" in pre-existing snapshots on the standby. I don't think this is the problem this thread is about, but it's definitely a problem. 2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting for sessions to see the results of Phase 4, 5, we can mark the index as dead (phase 5) and drop it (phase 6), while there are ongoing accesses. I think this is the likely cause of the reported bug. Both seems like a significant issue for several of the CONCURRENTLY commands, not just REINDEX. Greetings, Andres Freund
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Thu, Feb 10, 2022 at 04:12:40PM -0800, Andres Freund wrote: > I'm pretty sure the problem is on the primary. Looking through > ReindexRelationConcurrently() I think I found at least two problems: > > 1) We don't WAL log snapshot conflicts, afaict (i.e. the > WaitForOlderSnapshots() in phase 3). Which I think means that the new index > can end up being used "too soon" in pre-existing snapshots on the standby. > > I don't think this is the problem this thread is about, but it's definitely a > problem. > > 2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting > for sessions to see the results of Phase 4, 5, we can mark the index as dead > (phase 5) and drop it (phase 6), while there are ongoing accesses. > > I think this is the likely cause of the reported bug. Yep, I was planning to play with this problem from next week, FWIW, just lacked time/energy to do so. And the release was shipping anyway, so there is plenty of time. My impression is that we don't really need to change the WAL format based on the existing APIs we already have, or that in the worst case it would be possible to make things backward-compatible enough that it would not be a worry as long as the standbys are updated before the primaries. -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andres Freund
Date:
Hi, On 2022-02-11 10:38:34 +0900, Michael Paquier wrote: > My impression is that we don't really need to change the WAL format > based on the existing APIs we already have, or that in the worst case > it would be possible to make things backward-compatible enough that it > would not be a worry as long as the standbys are updated before the > primaries. I think it's a question of how "concurrently" we want concurrently to be on hot standby nodes. If we are OK with "not at all", we can just lock an AEL at the end of WaitForLockersMultiple(). It's a bit harder to cause the necessary snapshot conflict, but we could e.g. extend xl_running_xacts and use the record length to keep both older primary -> newer replica, and the reverse working. If we want to actually make concurrently work concurrently on the replica, we'd have to work harder. That doesn't strike me as feasible for backpatching. The most important bit bit would be to make WaitForOlderSnapshots() wait for the xmin of hot_standby_feedback walsenders and/or physical slot xmin horizons. For phase 5 & 6 we would probably have to do something like waiting for snapshot conflicts vs walsenders/slots in addition to the WaitForLockersMultiple() for conflicts on the primary. Greetings, Andres Freund
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> On 11 Feb 2022, at 05:12, Andres Freund <andres@anarazel.de> wrote: > > On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote: >> index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged >> CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged >> CommitTransactionCommand(); // WAL-logged >> >> But I do not know is it expected to work on standby, since relcache >> invalidation is not logged. Maybe Peter or someone more experienced than me >> can explain how this is expected to work. > > What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It > should be emitted at the commit, like other invalidation messages? > Thank you for the explanation! FWIW here’s more clean TAP-test reproduction. Sometimes I see 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR: could not open relation with OID 16401 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',true); In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report. Best regards, Andrey Borodin.
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> On 12 Feb 2022, at 20:03, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Thank you for the explanation! > FWIW here’s more clean TAP-test reproduction. Sometimes I see > 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR: could not open relation with OID 16401 > 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',true); > > In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report. I tried to dig into this again. And once again I simply cannot understand how it is supposed to work... I'd appreciate if someone explained it to me. When I do SELECT bt_index_check('idx',true) in fact I'm doing following: 1. regclassin() lookups 'idx', converts it into oid without taking any lock 2. bt_index_check() gets this oid and lookups index again. Do we rely on catalog snapshot here? Or how do we know that this oid is still of the index named 'idx' on standby? When backendwill understand that this oid is no more of the interest and get an error somehow? I think it must happen during index_open(). BTW is anyone working on this bug? Best regards, Andrey Borodin.
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Tue, Mar 08, 2022 at 06:44:31PM +0300, Andrey Borodin wrote: > I tried to dig into this again. And once again I simply cannot > understand how it is supposed to work... I'd appreciate if someone > explained it to me. > When I do SELECT bt_index_check('idx',true) in fact I'm doing following: > 1. regclassin() lookups 'idx', converts it into oid without taking any lock > 2. bt_index_check() gets this oid and lookups index again. > Do we rely on catalog snapshot here? Or how do we know that this oid > is still of the index named 'idx' on standby? When backend will > understand that this oid is no more of the interest and get an error > somehow? I think it must happen during index_open(). Yeah, I can see that, and I think that it is a second issue, rather independent of the cache lookup errors that we are seeing on a standby. As far as I have tested, it is possible to see this error as well with a sequence of CREATE/DROP INDEX CONCURRENTLY run repeatedly on a primary with the standby doing a scan of pg_index, like that for example in a tight loop: SELECT bt_index_check(indexrelid) FROM pg_index WHERE indisvalid AND indexrelid > 16000; It is a matter of seconds to see bt_index_check() complain that one of the so-said index is not valid, but the scan of pg_index was told to not select such an index, so we are missing something with the relation cache. > BTW is anyone working on this bug? It took me some time to come back to this thread, but I do now that we are done with the commit fest business. Speaking of the cache errors when working directly on the relations rebuilt, I got a patch that seems to work properly, with a mix of standby snapshots and AELs logged. I have been running a set of REINDEX INDEX/TABLE CONCURRENTLY on the primary through more than 100k relfilenodes without the standby complaining, and I should be able to get this one fixed before the next minor release. -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
> On 20 Apr 2022, at 12:14, Michael Paquier <michael@paquier.xyz> wrote: > > I got a patch that seems to work properly, Cool! Can you please share it? > with a mix of > standby snapshots and AELs logged. I did not succeed with my tests on this way. But probably I was doing something wrong. Or, perhaps, these tests were encounteringsecond problem. Best regards, Andrey Borodin.
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Wed, Apr 20, 2022 at 11:23:43PM +0500, Andrey Borodin wrote: > Cool! Can you please share it? Sure. I am finishing for this part with the attached, that would be backpatchable. Based on the analysis of upthread, I have stuck with logging standby snapshots once we are done in WaitForOlderSnapshots() so as we make sure that a standby does not attempt to look at the data of any running transactions it should wait for and the log of AELs at the end of WaitForLockersMultiple() to avoid the access of the lockers too early. I have to admit that a direct use of LogAccessExclusiveLock() feels a bit funky but we cannot use LockAcquire() AFAIK, and that should be fine with all the lock tags at hand. I have stressed quite a lot a primary/standby setup with flavors of the tests from Ben upthread, and I am not seeing the cache problems anymore on the standby, with concurrent CREATE/DROP INDEX or REINDEX sequences. > I did not succeed with my tests on this way. But probably I was > doing something wrong. Or, perhaps, these tests were encountering > second problem. There is a second thing going on with amcheck here. REINDEX CONCURRENTLY in isolation is working correctly when you run bt_index_check() in parallel of it (aka attach gdb to the backend process running REINDEX CONCURRENTLY and run bt_index_check() repeatedly in parallel at the various steps with a \watch or a new connection each time), but depending on your transaction order you may finish with an invalid index given as input of bt_index_check_internal(), roughly: - REINDEX CONCURRENTLY begins, goes up to the transaction doing the swap where the old index is marked as indisvalid. - Execute bt_index_check(), that refers to the old index OID, still valid at the moment where amcheck begins checking this index, before the Relation of the parent table is opened. - The transaction doing the swap commits, making visible that indisvalid is false for the old index. - bt_index_check() continues, opens the parent Relation, and complains about an invalid index. amcheck could be made more robust here, by calling RelationGetIndexList() after opening the Relation of the parent to check if it is still listed in the returned list as it would look at the relcache and discard any invalid indexes on the way. Returning an error for an invalid index seems less helpful to me here than doing nothing, particularly if you begin doing a full check of the indexes based on the contents of pg_index. -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andres Freund
Date:
Hi, On 2022-04-21 12:54:00 +0900, Michael Paquier wrote: > Based on the analysis of upthread, I have stuck with logging standby > snapshots once we are done in WaitForOlderSnapshots() so as we make sure > that a standby does not attempt to look at the data of any running > transactions it should wait for and the log of AELs at the end of > WaitForLockersMultiple() to avoid the access of the lockers too early. Why is this necessary? The comments in WaitForOlderSnapshots() don't really explain it. This is pretty darn expensive. > amcheck could be made more robust here, by calling > RelationGetIndexList() after opening the Relation of the parent to > check if it is still listed in the returned list as it would look at > the relcache and discard any invalid indexes on the way. That seems like a weird approach. Why can't the check just be done on the relcache entry of the index itself? If that doesn't work, something is still broken in cache invalidation. > diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c > index cd30f15eba..704d8f3744 100644 > --- a/src/backend/commands/indexcmds.c > +++ b/src/backend/commands/indexcmds.c > @@ -475,6 +475,16 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress) > if (progress) > pgstat_progress_update_param(PROGRESS_WAITFOR_DONE, i + 1); > } > + > + /* > + * Take a snapshot of running transactions and write this to WAL. With > + * this information, a standby is able to know that it should not access > + * too soon any information we expect to wait for with a concurrent build. > + * > + * Skip the log of this information if disabled. > + */ > + if (XLogStandbyInfoActive()) > + LogStandbySnapshot(); > } "should not access too soon" - no idea what that means / guarantees. The "Skip the log" bit is redundant with the code. Greetings, Andres Freund
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Andrey Borodin
Date:
Here are few more observations from me. > 21 апр. 2022 г., в 08:54, Michael Paquier <michael@paquier.xyz> написал(а): > > There is a second thing going on with amcheck here. I've removed amcheck from the test and replaced it with a simple IndexOnlyScan. I'm not proposing to commit this test anyway,so I left it in amcheck. The test reliably fails in 30s or earlier. 2022-04-23 10:51:06.872 +05 [36192] 004_cic_standby.pl LOG: statement: select i from tbl where i = 236116; 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl ERROR: could not open relation with OID 16671 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl STATEMENT: select i from tbl where i = 236116; Adding AEL's seem to solve the problem. If I run test further, approximately after 40s it will fail with: 2022-04-23 10:47:47.669 +05 [34225] 004_cic_standby.pl LOG: statement: select i from tbl where i = 310032; 2022-04-23 10:47:47.670 +05 [34225] 004_cic_standby.pl FATAL: terminating connection due to conflict with recovery 2022-04-23 10:47:47.670 +05 [34225] 004_cic_standby.pl DETAIL: User was holding a relation lock for too long. Is it kind of deadlock? I did not observe any changes in the behaviour of the test adding LogStandbySnapshot(). Can we trigger its necessity somehow?Probably, by solving "conflict with recovery" issue and running test for a longer time? Thanks! Best regards, Andrey Borodin.
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Noah Misch
Date:
Drive-by comment: On Sat, Apr 23, 2022 at 11:02:04AM +0500, Andrey Borodin wrote: > I've removed amcheck from the test and replaced it with a simple IndexOnlyScan. I'm not proposing to commit this test anyway,so I left it in amcheck. > The test reliably fails in 30s or earlier. > > 2022-04-23 10:51:06.872 +05 [36192] 004_cic_standby.pl LOG: statement: select i from tbl where i = 236116; > 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl ERROR: could not open relation with OID 16671 > 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl STATEMENT: select i from tbl where i = 236116; > > Adding AEL's seem to solve the problem. If this thread proceeds with the current LogAccessExclusiveLock() design, I think it must include a way to disable the new behavior. That requirement would apply to any design that delays standby WAL apply in a way that hot_standby_feedback can't mitigate. We have the vacuum_truncate reloption because AEL is disruptive to standby freshness. Adding a new kind of unavoidable AEL would be similarly-painful. It would nicer to fix this such that, with hot_standby_feedback, the fix delays CIC on the primary instead of delaying apply on the standby. If that's too hard, AEL plus option-to-disable sounds fine.
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Thu, Apr 21, 2022 at 09:17:10AM -0700, Andres Freund wrote: > On 2022-04-21 12:54:00 +0900, Michael Paquier wrote: >> Based on the analysis of upthread, I have stuck with logging standby >> snapshots once we are done in WaitForOlderSnapshots() so as we make sure >> that a standby does not attempt to look at the data of any running >> transactions it should wait for and the log of AELs at the end of >> WaitForLockersMultiple() to avoid the access of the lockers too early. > > Why is this necessary? The comments in WaitForOlderSnapshots() don't really > explain it. This is pretty darn expensive. I think that my brain here thought about the standby attempting to access tuples that could have been deleted before the reference snapshot was taken for the index validation in phase 3, and that logging the current snapshot would help in detecting conflicts for that. But that won't help much unless the conflicts themselves are logged in some way. Is that right? It looks like this comes down to log more information when we wait for some of the snapshots in VirtualXactLock(). That would be new. >> amcheck could be made more robust here, by calling >> RelationGetIndexList() after opening the Relation of the parent to >> check if it is still listed in the returned list as it would look at >> the relcache and discard any invalid indexes on the way. > > That seems like a weird approach. Why can't the check just be done on the > relcache entry of the index itself? If that doesn't work, something is still > broken in cache invalidation. As far as I looked at (aka checking step by step the concurrent REINDEX and DROP INDEX flows), the cache looks fine with the invalidations when the index's indisvalid is switched on/off (I am taking into account the end of index_concurrently_build() that should not need one). FWIW, I would be tempted to do something like the attached for amcheck, where we skip invalid indexes. Thoughts welcome. -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Michael Paquier
Date:
On Sat, Apr 23, 2022 at 08:03:47PM -0700, Noah Misch wrote: > If this thread proceeds with the current LogAccessExclusiveLock() design, I > think it must include a way to disable the new behavior. That requirement > would apply to any design that delays standby WAL apply in a way that > hot_standby_feedback can't mitigate. We have the vacuum_truncate reloption > because AEL is disruptive to standby freshness. Adding a new kind of > unavoidable AEL would be similarly-painful. I am not sure that we should add more reloptions that bypass AEL requirements FWIW, because the standby would still be opened to cache errors if you don't log something, would it not?. I agree that vacuum_truncate is handy, but there was a lot of drawback with it as well when it was introduced as there were arguments that we'd better remove the use of the AEL used for the end-of-vacuum truncation. > It would nicer to fix this such that, with hot_standby_feedback, the fix > delays CIC on the primary instead of delaying apply on the standby. If that's > too hard, AEL plus option-to-disable sounds fine. Perhaps. Still, that would be mostly what WaitForOlderSnapshots() already does, except that we'd roughly just use the limitXmin from the snapshot of the transaction currently running? -- Michael
Attachment
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
From
Noah Misch
Date:
On Mon, Apr 25, 2022 at 02:11:26PM +0900, Michael Paquier wrote: > On Sat, Apr 23, 2022 at 08:03:47PM -0700, Noah Misch wrote: > > If this thread proceeds with the current LogAccessExclusiveLock() design, I > > think it must include a way to disable the new behavior. That requirement > > would apply to any design that delays standby WAL apply in a way that > > hot_standby_feedback can't mitigate. We have the vacuum_truncate reloption > > because AEL is disruptive to standby freshness. Adding a new kind of > > unavoidable AEL would be similarly-painful. > > I am not sure that we should add more reloptions that bypass AEL > requirements FWIW, because the standby would still be opened to cache > errors if you don't log something, would it not? Correct. Even so, a mandatory AEL when replaying CIC would be worse. > > It would nicer to fix this such that, with hot_standby_feedback, the fix > > delays CIC on the primary instead of delaying apply on the standby. If that's > > too hard, AEL plus option-to-disable sounds fine. > > Perhaps. Still, that would be mostly what WaitForOlderSnapshots() > already does, except that we'd roughly just use the limitXmin from the > snapshot of the transaction currently running? I don't know.