Thread: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

PG Bug reporting form

Date:

08 February 2022, 20:41:54

The following bug has been logged on the website:

Bug reference:      17401
Logged by:          Ben Chobot
Email address:      bench@silentmedia.com
PostgreSQL version: 12.9
Operating system:   Linux (Ubuntu)
Description:

This bug is is almost identical to BUG #17389, which I filed blaming
pg_repack; however, further testing shows the same symptoms using vanilla
REINDEX TABLE CONCURRENTLY.

1. Put some data in a table with a single btree index:
create table public.simple_test (id int primary key);
insert into public.simple_test(id) (select generate_series(1,1000));

2. Set up streaming replication to a secondary db.

3. In a loop on the primary, concurrently REINDEX that table:
while `true`; do psql -tAc "select now(),relfilenode from pg_class where
relname='simple_test_pkey'" >> log; psql -tAc "reindex table concurrently
public.simple_test"; done

4. In a loop on the secondary, have psql query the secondary db for an
indexed value of that table:
while `true`; do psql -tAc "select count(*) from simple_test where id='3';
select relfilenode from pg_class where relname='simple_test_pkey'" || break;
done; date

With those 4 steps, the client on the replica will reliably fail to open the
OID of the index within 30 minutes of looping. ("ERROR:  could not open
relation with OID 6715827") When we run the same client loop on the primary
instead of the replica, or if we reindex without the CONCURRENTLY clause,
then the client loop will run for hours without fail, but neither of those
workarounds are options for us in production.

Like I said before, this isn't a new problem - we've seen it since at least
9.5 - but pre-12 we saw it using pg_repack, which is an easy (and
reasonable) scapegoat. But now that we've upgraded to 12 and are still
seeing it using vanilla concurrent reindexing, it seems more clear this is
an actual postgres bug?

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Peter Geoghegan

Date:

08 February 2022, 21:23:34

On Tue, Feb 8, 2022 at 1:01 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> With those 4 steps, the client on the replica will reliably fail to open the
> OID of the index within 30 minutes of looping. ("ERROR:  could not open
> relation with OID 6715827") When we run the same client loop on the primary
> instead of the replica, or if we reindex without the CONCURRENTLY clause,
> then the client loop will run for hours without fail, but neither of those
> workarounds are options for us in production.

I find the idea that we'd fail to WAL-log information that is needed
during Hot Standby (to prevent this race condition) plausible.
Michael?

-- 
Peter Geoghegan

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

09 February 2022, 01:35:57

On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote:
> I find the idea that we'd fail to WAL-log information that is needed
> during Hot Standby (to prevent this race condition) plausible.
> Michael?

Yeah, REINDEX relies on some existing index definition, so it feels
like we are missing a piece related to invalid indexes in all that.  A
main difference is the lock level, as exclusive locks are getting
logged so the standby can react and wait on that.  The 30-minute mark
is interesting.  Ben, did you change any replication-related GUCs that
could influence that?  Say, wal_receiver_timeout, hot_standby_feedback
or max_standby_streaming_delay?
--
Michael

Attachment

signature.asc

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Ben Chobot

Date:

09 February 2022, 04:28:39

Michael Paquier wrote on 2/8/22 5:35 PM:
> On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote:
>> I find the idea that we'd fail to WAL-log information that is needed
>> during Hot Standby (to prevent this race condition) plausible.
>> Michael?
> Yeah, REINDEX relies on some existing index definition, so it feels
> like we are missing a piece related to invalid indexes in all that.  A
> main difference is the lock level, as exclusive locks are getting
> logged so the standby can react and wait on that.  The 30-minute mark
> is interesting.  Ben, did you change any replication-related GUCs that
> could influence that?  Say, wal_receiver_timeout, hot_standby_feedback
> or max_standby_streaming_delay?

Oh, to be clear, the 30 minute mark is more "the loop has always failed 
this far into it" and sometimes that's 5 minutes and sometimes it's 
more, but I've never seen it take more than somewhere in the 20s. I was 
thinking it was just because of the race condition, but, to answer your 
question, yes, we have tuned some replication parameters. Here are the 
ones you asked about; did you want to see the value of any others?

=# show wal_receiver_timeout ;
  wal_receiver_timeout
──────────────────────
  1min
(1 row)

04:26:27 db: postgres@postgres, pid:29507
=# show hot_standby_feedback ;
  hot_standby_feedback
──────────────────────
  on
(1 row)

04:26:40 db: postgres@postgres, pid:29507
=# show max_standby_streaming_delay ;
  max_standby_streaming_delay
─────────────────────────────
  10min
(1 row)

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

09 February 2022, 05:25:11


> 9 февр. 2022 г., в 01:41, PG Bug reporting form <noreply@postgresql.org> написал(а):
>
> PostgreSQL version: 12.9

I thought we fixed this in 12.9. But apparently not.
Gitlab was observing similar problem on 12.7 here [0]. And I was convinced that it's manifestation of relcache bug
fixedin fdd965d07 [1]. I think we could adapt current amcheck tests for replication checks too. It cannot do
heapallindexedon standby, but probably can catch relcache bugs. 

Best regards, Andrey Borodin.

[0] https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6120#note_796846035
[1] https://github.com/postgres/postgres/commit/fdd965d074d46765c295223b119ca437dbcac973

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Peter Geoghegan

Date:

09 February 2022, 05:30:07

On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> It cannot do heapallindexed on standby, but probably can catch relcache bugs.

Why not? In general heapallindexed verification works just fine on a standby.

You may be thinking of the extra verification options that are applied
to the index structure itself when the user opts to call
bt_index_parent_check() (rather than bt_index_check). But that's
unrelated.

-- 
Peter Geoghegan

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

09 February 2022, 06:52:41

> 9 февр. 2022 г., в 10:30, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> It cannot do heapallindexed on standby, but probably can catch relcache bugs.
>
> Why not? In general heapallindexed verification works just fine on a standby.

Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomes
muchsmaller. 

Best regards, Andrey Borodin.

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Peter Geoghegan

Date:

09 February 2022, 07:13:12

On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences
becomesmuch smaller.
 

Actually, calling bt_index_check() + heapallindexed acquires only an
AccessShareLock on both the table and the index.

Again, it sounds like you're talking about bt_index_parent_check()
(with or without heapallindexed). That does acquire a ShareLock, which
will just throw an error on a standby.

-- 
Peter Geoghegan

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

10 February 2022, 16:52:23


> On 9 Feb 2022, at 12:13, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences
becomesmuch smaller. 
>
> Actually, calling bt_index_check() + heapallindexed acquires only an
> AccessShareLock on both the table and the index.
>
> Again, it sounds like you're talking about bt_index_parent_check()
> (with or without heapallindexed). That does acquire a ShareLock, which
> will just throw an error on a standby.

You are right.

So, I’ve composed dirty test by reverting 7f580aa5d [0] to resurrect pgbench_background().
Now I observe in standby log:

2022-02-10 21:46:38.909 +05 [84272] 004_cic_standby.pl LOG:  statement: SELECT bt_index_check('idx',false);
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl ERROR:  cannot check index "idx_ccold"
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl DETAIL:  Index is not valid.
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl STATEMENT:  SELECT bt_index_check('idx',false);

I think it’s somewhat unexpected. But is it a real problem?

Thanks!


[0] https://github.com/postgres/postgres/commit/7f580aa5d88a9b03d66fcb9a1d7c4fcd69d9e126

Attachment

0001-Try-to-check-indexes-on-standby.patch

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Ben Chobot

Date:

10 February 2022, 17:18:41

Andrey Borodin wrote on 2/10/22 8:52 AM:
> I think it’s somewhat unexpected. But is it a real problem?

If you're asking me, then I can't speak to your standby log, but I'd say 
the bug I raised is a problem, yes. :)

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

10 February 2022, 17:41:00

> On 10 Feb 2022, at 22:18, Ben Chobot <bench@silentmedia.com> wrote:
>
> Andrey Borodin wrote on 2/10/22 8:52 AM:
>> I think it’s somewhat unexpected. But is it a real problem?
>
> If you're asking me, then I can't speak to your standby log, but I'd say the bug I raised is a problem, yes. :)

Well, not exactly. I had attached the patch with the test that reproduces some problem with dropped index on standby.
In this test we have a standby and check query on it at some point finds out that it’s working with invalid index. I’m
notsure that this is reproduction of your exact problem. 

I see that we are swapping indexes on primary in a following way:

index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
CommitTransactionCommand(); // WAL-logged

But I do not know is it expected to work on standby, since relcache invalidation is not logged. Maybe Peter or someone
moreexperienced than me can explain how this is expected to work. 

Thanks!

Best regards, Andrey Borodin.

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andres Freund

Date:

11 February 2022, 00:12:40

Hi,

On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote:
> index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
> CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
> CommitTransactionCommand(); // WAL-logged
>
> But I do not know is it expected to work on standby, since relcache
> invalidation is not logged. Maybe Peter or someone more experienced than me
> can explain how this is expected to work.

What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It
should be emitted at the commit, like other invalidation messages?

Here's the WAL for a REINDEX CONCURRENTLY, leaving out the actual catalog
changes (i.e. heap/heap2/btree rmgr):
rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 0/0155F3F8, prev 0/0155F3B8, desc: CREATE
base/5/16399
rmgr: Standby     len (rec/tot):     42/    42, tx:        731, lsn: 0/0155F428, prev 0/0155F3F8, desc: LOCK xid 731 db
5rel 16399

rmgr: Transaction len (rec/tot):    194/   194, tx:        731, lsn: 0/0155F960, prev 0/0155F918, desc: COMMIT
2022-02-1015:53:56.173778 PST; inval msgs: catcache 53 catcache 52 catcache 7 catcache 6 catcache 32 relcache 16399
relcache16392 snapshot 2608 relcache 16392

rmgr: Transaction len (rec/tot):    146/   146, tx:        732, lsn: 0/01561BB0, prev 0/01561B70, desc: COMMIT
2022-02-1015:53:56.192747 PST; inval msgs: catcache 53 catcache 52 catcache 32 relcache 16392 relcache 16399 relcache
16399

It might be worth writing a test that replays WAL records one-by-one, and that
checks at which record things start to break. If the problem is not
reproducible that way, we have a problem "within" a single WAL record
replay. If it is reproducible, it's likely a problem in the way the primary
generates WAL.

That would be generally very useful tooling for detailed testing of sequences
of WAL records.

I'm pretty sure the problem is on the primary. Looking through
ReindexRelationConcurrently() I think I found at least two problems:

1) We don't WAL log snapshot conflicts, afaict (i.e. the
WaitForOlderSnapshots() in phase 3). Which I think means that the new index
can end up being used "too soon" in pre-existing snapshots on the standby.

I don't think this is the problem this thread is about, but it's definitely a
problem.

2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting
for sessions to see the results of Phase 4, 5, we can mark the index as dead
(phase 5) and drop it (phase 6), while there are ongoing accesses.

I think this is the likely cause of the reported bug.

Both seems like a significant issue for several of the CONCURRENTLY commands,
not just REINDEX.

Greetings,

Andres Freund

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

11 February 2022, 01:38:34

On Thu, Feb 10, 2022 at 04:12:40PM -0800, Andres Freund wrote:
> I'm pretty sure the problem is on the primary. Looking through
> ReindexRelationConcurrently() I think I found at least two problems:
>
> 1) We don't WAL log snapshot conflicts, afaict (i.e. the
> WaitForOlderSnapshots() in phase 3). Which I think means that the new index
> can end up being used "too soon" in pre-existing snapshots on the standby.
>
> I don't think this is the problem this thread is about, but it's definitely a
> problem.
>
> 2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting
> for sessions to see the results of Phase 4, 5, we can mark the index as dead
> (phase 5) and drop it (phase 6), while there are ongoing accesses.
>
> I think this is the likely cause of the reported bug.

Yep, I was planning to play with this problem from next week, FWIW,
just lacked time/energy to do so.  And the release was shipping
anyway, so there is plenty of time.

My impression is that we don't really need to change the WAL format
based on the existing APIs we already have, or that in the worst case
it would be possible to make things backward-compatible enough that it
would not be a worry as long as the standbys are updated before the
primaries.
--
Michael

Attachment

signature.asc

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andres Freund

Date:

11 February 2022, 02:18:12

Hi,

On 2022-02-11 10:38:34 +0900, Michael Paquier wrote:
> My impression is that we don't really need to change the WAL format
> based on the existing APIs we already have, or that in the worst case
> it would be possible to make things backward-compatible enough that it
> would not be a worry as long as the standbys are updated before the
> primaries.

I think it's a question of how "concurrently" we want concurrently to be on
hot standby nodes.

If we are OK with "not at all", we can just lock an AEL at the end of
WaitForLockersMultiple(). It's a bit harder to cause the necessary snapshot
conflict, but we could e.g. extend xl_running_xacts and use the record length
to keep both older primary -> newer replica, and the reverse working.

If we want to actually make concurrently work concurrently on the replica,
we'd have to work harder. That doesn't strike me as feasible for backpatching.

The most important bit bit would be to make WaitForOlderSnapshots() wait for
the xmin of hot_standby_feedback walsenders and/or physical slot xmin
horizons. For phase 5 & 6 we would probably have to do something like waiting
for snapshot conflicts vs walsenders/slots in addition to the
WaitForLockersMultiple() for conflicts on the primary.

Greetings,

Andres Freund

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

12 February 2022, 17:03:38

> On 11 Feb 2022, at 05:12, Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote:
>> index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
>> CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
>> CommitTransactionCommand(); // WAL-logged
>>
>> But I do not know is it expected to work on standby, since relcache
>> invalidation is not logged. Maybe Peter or someone more experienced than me
>> can explain how this is expected to work.
>
> What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It
> should be emitted at the commit, like other invalidation messages?
>

Thank you for the explanation!
FWIW here’s more clean TAP-test reproduction. Sometimes I see
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR:  could not open relation with OID 16401
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT:  SELECT bt_index_check('idx',true);

In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report.

Best regards, Andrey Borodin.

Attachment

0001-Test-B-tree-index-on-standby-during-REINDEX-CONCURRE.patch

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

08 March 2022, 15:44:31

> On 12 Feb 2022, at 20:03, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Thank you for the explanation!
> FWIW here’s more clean TAP-test reproduction. Sometimes I see
> 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR:  could not open relation with OID 16401
> 2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT:  SELECT bt_index_check('idx',true);
>
> In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report.

I tried to dig into this again. And once again I simply cannot understand how it is supposed to work... I'd appreciate
if someone explained it to me. 
When I do SELECT bt_index_check('idx',true) in fact I'm doing following:
1. regclassin() lookups 'idx', converts it into oid without taking any lock
2. bt_index_check() gets this oid and lookups index again.
Do we rely on catalog snapshot here? Or how do we know that this oid is still of the index named 'idx' on standby? When
backendwill understand that this oid is no more of the interest and get an error somehow? I think it must happen during
index_open().

BTW is anyone working on this bug?

Best regards, Andrey Borodin.

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

20 April 2022, 07:14:54

On Tue, Mar 08, 2022 at 06:44:31PM +0300, Andrey Borodin wrote:
> I tried to dig into this again. And once again I simply cannot
> understand how it is supposed to work... I'd appreciate if  someone
> explained it to me.
> When I do SELECT bt_index_check('idx',true) in fact I'm doing following:
> 1. regclassin() lookups 'idx', converts it into oid without taking any lock
> 2. bt_index_check() gets this oid and lookups index again.
> Do we rely on catalog snapshot here? Or how do we know that this oid
> is still of the index named 'idx' on standby? When backend will
> understand that this oid is no more of the interest and get an error
> somehow? I think it must happen during index_open().

Yeah, I can see that, and I think that it is a second issue, rather
independent of the cache lookup errors that we are seeing on a
standby.  As far as I have tested, it is possible to see this error as
well with a sequence of CREATE/DROP INDEX CONCURRENTLY run repeatedly
on a primary with the standby doing a scan of pg_index, like that for
example in a tight loop:
SELECT bt_index_check(indexrelid) FROM pg_index WHERE indisvalid AND
  indexrelid > 16000;

It is a matter of seconds to see bt_index_check() complain that one of
the so-said index is not valid, but the scan of pg_index was told to
not select such an index, so we are missing something with the
relation cache.

> BTW is anyone working on this bug?

It took me some time to come back to this thread, but I do now that we
are done with the commit fest business.

Speaking of the cache errors when working directly on the relations
rebuilt, I got a patch that seems to work properly, with a mix of
standby snapshots and AELs logged.  I have been running a set of
REINDEX INDEX/TABLE CONCURRENTLY on the primary through more than 100k
relfilenodes without the standby complaining, and I should be able to
get this one fixed before the next minor release.
--
Michael

Attachment

signature.asc

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

20 April 2022, 18:23:43


> On 20 Apr 2022, at 12:14, Michael Paquier <michael@paquier.xyz> wrote:
>
> I got a patch that seems to work properly,
Cool! Can you please share it?

> with a mix of
> standby snapshots and AELs logged.
I did not succeed with my tests on this way. But probably I was doing something wrong. Or, perhaps, these tests were
encounteringsecond problem. 

Best regards, Andrey Borodin.

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

21 April 2022, 03:54:00

On Wed, Apr 20, 2022 at 11:23:43PM +0500, Andrey Borodin wrote:
> Cool! Can you please share it?

Sure.  I am finishing for this part with the attached, that would be
backpatchable.  Based on the analysis of upthread, I have stuck with
logging standby snapshots once we are done in WaitForOlderSnapshots()
so as we make sure that a standby does not attempt to look at the data
of any running transactions it should wait for and the log of AELs at
the end of WaitForLockersMultiple() to avoid the access of the lockers
too early.  I have to admit that a direct use of
LogAccessExclusiveLock() feels a bit funky but we cannot use
LockAcquire() AFAIK, and that should be fine with all the lock tags at
hand.

I have stressed quite a lot a primary/standby setup with flavors of
the tests from Ben upthread, and I am not seeing the cache problems
anymore on the standby, with concurrent CREATE/DROP INDEX or REINDEX
sequences.

> I did not succeed with my tests on this way. But probably I was
> doing something wrong. Or, perhaps, these tests were encountering
> second problem.

There is a second thing going on with amcheck here.  REINDEX
CONCURRENTLY in isolation is working correctly when you run
bt_index_check() in parallel of it (aka attach gdb to the backend
process running REINDEX CONCURRENTLY and run bt_index_check()
repeatedly in parallel at the various steps with a \watch or a new
connection each time), but depending on your transaction order you may
finish with an invalid index given as input of
bt_index_check_internal(), roughly:
- REINDEX CONCURRENTLY begins, goes up to the transaction doing the
swap where the old index is marked as indisvalid.
- Execute bt_index_check(), that refers to the old index OID, still
valid at the moment where amcheck begins checking this index, before
the Relation of the parent table is opened.
- The transaction doing the swap commits, making visible that
indisvalid is false for the old index.
- bt_index_check() continues, opens the parent Relation, and complains
about an invalid index.

amcheck could be made more robust here, by calling
RelationGetIndexList() after opening the Relation of the parent to
check if it is still listed in the returned list as it would look at
the relcache and discard any invalid indexes on the way.  Returning an
error for an invalid index seems less helpful to me here than doing
nothing, particularly if you begin doing a full check of the indexes
based on the contents of pg_index.
--
Michael

Attachment

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andres Freund

Date:

21 April 2022, 16:17:10

Hi,

On 2022-04-21 12:54:00 +0900, Michael Paquier wrote:
> Based on the analysis of upthread, I have stuck with logging standby
> snapshots once we are done in WaitForOlderSnapshots() so as we make sure
> that a standby does not attempt to look at the data of any running
> transactions it should wait for and the log of AELs at the end of
> WaitForLockersMultiple() to avoid the access of the lockers too early.

Why is this necessary? The comments in WaitForOlderSnapshots() don't really
explain it. This is pretty darn expensive.


> amcheck could be made more robust here, by calling
> RelationGetIndexList() after opening the Relation of the parent to
> check if it is still listed in the returned list as it would look at
> the relcache and discard any invalid indexes on the way.

That seems like a weird approach. Why can't the check just be done on the
relcache entry of the index itself? If that doesn't work, something is still
broken in cache invalidation.


> diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
> index cd30f15eba..704d8f3744 100644
> --- a/src/backend/commands/indexcmds.c
> +++ b/src/backend/commands/indexcmds.c
> @@ -475,6 +475,16 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
>          if (progress)
>              pgstat_progress_update_param(PROGRESS_WAITFOR_DONE, i + 1);
>      }
> +
> +    /*
> +     * Take a snapshot of running transactions and write this to WAL.  With
> +     * this information, a standby is able to know that it should not access
> +     * too soon any information we expect to wait for with a concurrent build.
> +     *
> +     * Skip the log of this information if disabled.
> +     */
> +    if (XLogStandbyInfoActive())
> +        LogStandbySnapshot();
>  }

"should not access too soon" - no idea what that means / guarantees.

The "Skip the log" bit is redundant with the code.


Greetings,

Andres Freund

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Andrey Borodin

Date:

23 April 2022, 06:02:04

Here are few more observations from me.

> 21 апр. 2022 г., в 08:54, Michael Paquier <michael@paquier.xyz> написал(а):
>
> There is a second thing going on with amcheck here.

I've removed amcheck from the test and replaced it with a simple IndexOnlyScan. I'm not proposing to commit this test
anyway,so I left it in amcheck. 
The test reliably fails in 30s or earlier.

2022-04-23 10:51:06.872 +05 [36192] 004_cic_standby.pl LOG:  statement: select i from tbl where i = 236116;
2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl ERROR:  could not open relation with OID 16671
2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl STATEMENT:  select i from tbl where i = 236116;

Adding AEL's seem to solve the problem. If I run test further, approximately after 40s it will fail with:

2022-04-23 10:47:47.669 +05 [34225] 004_cic_standby.pl LOG:  statement: select i from tbl where i = 310032;
2022-04-23 10:47:47.670 +05 [34225] 004_cic_standby.pl FATAL:  terminating connection due to conflict with recovery
2022-04-23 10:47:47.670 +05 [34225] 004_cic_standby.pl DETAIL:  User was holding a relation lock for too long.

Is it kind of deadlock?

I did not observe any changes in the behaviour of the test adding LogStandbySnapshot(). Can we trigger its necessity
somehow?Probably, by solving "conflict with recovery" issue and running test for a longer time? 

Thanks!

Best regards, Andrey Borodin.

Attachment

v2-0001-Test-B-tree-index-on-standby-during-REINDEX-CONCU.patch

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Noah Misch

Date:

24 April 2022, 03:03:47

Drive-by comment:

On Sat, Apr 23, 2022 at 11:02:04AM +0500, Andrey Borodin wrote:
> I've removed amcheck from the test and replaced it with a simple IndexOnlyScan. I'm not proposing to commit this test
anyway,so I left it in amcheck.

> The test reliably fails in 30s or earlier.
> 
> 2022-04-23 10:51:06.872 +05 [36192] 004_cic_standby.pl LOG:  statement: select i from tbl where i = 236116;
> 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl ERROR:  could not open relation with OID 16671
> 2022-04-23 10:51:06.890 +05 [36192] 004_cic_standby.pl STATEMENT:  select i from tbl where i = 236116;
> 
> Adding AEL's seem to solve the problem.

If this thread proceeds with the current LogAccessExclusiveLock() design, I
think it must include a way to disable the new behavior.  That requirement
would apply to any design that delays standby WAL apply in a way that
hot_standby_feedback can't mitigate.  We have the vacuum_truncate reloption
because AEL is disruptive to standby freshness.  Adding a new kind of
unavoidable AEL would be similarly-painful.

It would nicer to fix this such that, with hot_standby_feedback, the fix
delays CIC on the primary instead of delaying apply on the standby.  If that's
too hard, AEL plus option-to-disable sounds fine.

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

25 April 2022, 04:47:31

On Thu, Apr 21, 2022 at 09:17:10AM -0700, Andres Freund wrote:
> On 2022-04-21 12:54:00 +0900, Michael Paquier wrote:
>> Based on the analysis of upthread, I have stuck with logging standby
>> snapshots once we are done in WaitForOlderSnapshots() so as we make sure
>> that a standby does not attempt to look at the data of any running
>> transactions it should wait for and the log of AELs at the end of
>> WaitForLockersMultiple() to avoid the access of the lockers too early.
>
> Why is this necessary? The comments in WaitForOlderSnapshots() don't really
> explain it. This is pretty darn expensive.

I think that my brain here thought about the standby attempting to
access tuples that could have been deleted before the reference
snapshot was taken for the index validation in phase 3, and that
logging the current snapshot would help in detecting conflicts for
that.  But that won't help much unless the conflicts themselves are
logged in some way.  Is that right?  It looks like this comes down to
log more information when we wait for some of the snapshots in
VirtualXactLock().  That would be new.

>> amcheck could be made more robust here, by calling
>> RelationGetIndexList() after opening the Relation of the parent to
>> check if it is still listed in the returned list as it would look at
>> the relcache and discard any invalid indexes on the way.
>
> That seems like a weird approach. Why can't the check just be done on the
> relcache entry of the index itself? If that doesn't work, something is still
> broken in cache invalidation.

As far as I looked at (aka checking step by step the concurrent
REINDEX and DROP INDEX flows), the cache looks fine with the
invalidations when the index's indisvalid is switched on/off (I am
taking into account the end of index_concurrently_build() that should
not need one).

FWIW, I would be tempted to do something like the attached for
amcheck, where we skip invalid indexes.  Thoughts welcome.
--
Michael

Attachment

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Michael Paquier

Date:

25 April 2022, 05:11:26

On Sat, Apr 23, 2022 at 08:03:47PM -0700, Noah Misch wrote:
> If this thread proceeds with the current LogAccessExclusiveLock() design, I
> think it must include a way to disable the new behavior.  That requirement
> would apply to any design that delays standby WAL apply in a way that
> hot_standby_feedback can't mitigate.  We have the vacuum_truncate reloption
> because AEL is disruptive to standby freshness.  Adding a new kind of
> unavoidable AEL would be similarly-painful.

I am not sure that we should add more reloptions that bypass AEL
requirements FWIW, because the standby would still be opened to cache
errors if you don't log something, would it not?.  I agree that
vacuum_truncate is handy, but there was a lot of drawback with it as
well when it was introduced as there were arguments that we'd better
remove the use of the AEL used for the end-of-vacuum truncation.

> It would nicer to fix this such that, with hot_standby_feedback, the fix
> delays CIC on the primary instead of delaying apply on the standby.  If that's
> too hard, AEL plus option-to-disable sounds fine.

Perhaps.  Still, that would be mostly what WaitForOlderSnapshots()
already does, except that we'd roughly just use the limitXmin from the
snapshot of the transaction currently running?
--
Michael

Attachment

signature.asc

Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From

Noah Misch

Date:

25 April 2022, 05:22:55

On Mon, Apr 25, 2022 at 02:11:26PM +0900, Michael Paquier wrote:
> On Sat, Apr 23, 2022 at 08:03:47PM -0700, Noah Misch wrote:
> > If this thread proceeds with the current LogAccessExclusiveLock() design, I
> > think it must include a way to disable the new behavior.  That requirement
> > would apply to any design that delays standby WAL apply in a way that
> > hot_standby_feedback can't mitigate.  We have the vacuum_truncate reloption
> > because AEL is disruptive to standby freshness.  Adding a new kind of
> > unavoidable AEL would be similarly-painful.
> 
> I am not sure that we should add more reloptions that bypass AEL
> requirements FWIW, because the standby would still be opened to cache
> errors if you don't log something, would it not?

Correct.  Even so, a mandatory AEL when replaying CIC would be worse.

> > It would nicer to fix this such that, with hot_standby_feedback, the fix
> > delays CIC on the primary instead of delaying apply on the standby.  If that's
> > too hard, AEL plus option-to-disable sounds fine.
> 
> Perhaps.  Still, that would be mostly what WaitForOlderSnapshots()
> already does, except that we'd roughly just use the limitXmin from the
> snapshot of the transaction currently running?

I don't know.