BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
Date
Msg-id 17401-9df851bb16dde397@postgresql.org
Whole thread Raw
Responses Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica  (Andrey Borodin <x4mmm@yandex-team.ru>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17401
Logged by:          Ben Chobot
Email address:      bench@silentmedia.com
PostgreSQL version: 12.9
Operating system:   Linux (Ubuntu)
Description:

This bug is is almost identical to BUG #17389, which I filed blaming
pg_repack; however, further testing shows the same symptoms using vanilla
REINDEX TABLE CONCURRENTLY.

1. Put some data in a table with a single btree index:
create table public.simple_test (id int primary key);
insert into public.simple_test(id) (select generate_series(1,1000));

2. Set up streaming replication to a secondary db.

3. In a loop on the primary, concurrently REINDEX that table:
while `true`; do psql -tAc "select now(),relfilenode from pg_class where
relname='simple_test_pkey'" >> log; psql -tAc "reindex table concurrently
public.simple_test"; done

4. In a loop on the secondary, have psql query the secondary db for an
indexed value of that table:
while `true`; do psql -tAc "select count(*) from simple_test where id='3';
select relfilenode from pg_class where relname='simple_test_pkey'" || break;
done; date

With those 4 steps, the client on the replica will reliably fail to open the
OID of the index within 30 minutes of looping. ("ERROR:  could not open
relation with OID 6715827") When we run the same client loop on the primary
instead of the replica, or if we reindex without the CONCURRENTLY clause,
then the client loop will run for hours without fail, but neither of those
workarounds are options for us in production.

Like I said before, this isn't a new problem - we've seen it since at least
9.5 - but pre-12 we saw it using pg_repack, which is an easy (and
reasonable) scapegoat. But now that we've upgraded to 12 and are still
seeing it using vanilla concurrent reindexing, it seems more clear this is
an actual postgres bug?


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica