Thread: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem

The following bug has been logged on the website:

Bug reference:      18213
Logged by:          Fei Changhong
Email address:      feichanghong@qq.com
PostgreSQL version: 16.1
Operating system:   Operating system: centos 7,Kernel version: 5.10.13
Description:

We found that transactions with repeatable read isolation level on Standby
have a "nonrepeatable read" problem, that is, the results of two select are
different, and the hot_standby_feedback guc has been set to on. This problem
occurs when "create index concurrently" is executed on RW, and can be
reproduced through the following steps:
RW initialization data:
```
create table t(a int, b int, c char(1024));
create index on t(b);
insert into t select i, i, i from generate_series(1, 6)i;
```
Start a repeatable read transaction on Standby:
```
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select txid_current_snapshot();
select a, b from t where a = 5;
```
RW execute update and create index concurrently:
```
update t set b = -1 where a = 5;
create index CONCURRENTLY t1_idx_btree_id on t(a);
```
Standby executes select again (still within the existing repeatable read
transaction). We expect that the results of the two select should be the
same, but the second select did not get any data.
```
postgres=*# select a, b from t where a = 5;
  a | b
---+---
  5 | 5
(1 row)

postgres=*# set enable_seqscan to off;
SET
postgres=*# select a, b from t where a = 5;
  a | b
---+---
(0 rows)
```

The environment in which the problem occurs:
Operating system: centos 7
Kernel version: 5.10.134
PostgreSQL code branch/commit:
master/d053a879bb360fb72c46de2a013e741d3f7c9e8d
Client: psql

We have also analyzed the causes of this problem, as follows:
1. RW uses an MVCC snapshot when executing "create index concurrently", so
the updated old tuple cannot be accessed through the index if the update is
not a HOT.
2. RW did not consider the oldest xmin of the snapshot on Standby when
creating the index concurrently.
3. Even in a repeatable read transaction, catlog snapshot will get new one,
so the tuple (5, 5) cannot be accessed when selecting through the newly
created index, but seqscan can still access it.

We have several ideas to fix this problem, but they all have obvious
flaws:
1. In the WaitForOlderSnapshots function in the final stage of "create index
concurrently", wait for replication_slot_xmin to exceed limitXmin. In this
solution, when there are long transactions on Standby, "create index
concurrently" operations may be blocked.
2. When selecting an index path in Standby, snapshot->xmin must be greater
than the transaction ID that creates the index. This solution will make the
invalidate of the plancache very complicated, otherwise the process will not
be able to use the newly created index in the future.
3. Consider replication_slot_xmin when creating snapshots for scaning the
heap table, and add all the tuple that Standby may access to the index. This
solution is relatively hacky and will increase the space occupied by the
index.


Hi,

 Thanks for the report!
 
PG Bug reporting form <noreply@postgresql.org> writes:

> Standby executes select again (still within the existing repeatable read
> transaction). We expect that the results of the two select should be the
> same, but the second select did not get any data.
> ```
> postgres=*# select a, b from t where a = 5;
>   a | b
> ---+---
>   5 | 5
> (1 row)
>
> postgres=*# set enable_seqscan to off;
> SET
> postgres=*# select a, b from t where a = 5;
>   a | b
> ---+---
> (0 rows)
> ```

I can confirm this bug in the current master and continue with your test:

postgres=*# set enable_seqscan to off;
SET
postgres=*# select a, b from t where a = 5;
 a | b 
---+---
(0 rows)

postgres=*# set enable_seqscan to  on;
SET
postgres=*# select a, b from t where a = 5;
 a | b 
---+---
 5 | 5
(1 row)

Different plan yieds different result is bad. 

> We have several ideas to fix this problem, but they all have obvious
> flaws:
> 1. In the WaitForOlderSnapshots function in the final stage of "create index
> concurrently", wait for replication_slot_xmin to exceed limitXmin. In this
> solution, when there are long transactions on Standby, "create index
> concurrently" operations may be blocked.

+1 for this one. The current code doesn't take the advice from the
comments for validate_index, where it says..

 * ....  Also, we index only tuples that are valid
 * as of the start of the scan (see table_index_build_scan), whereas a normal
 * build takes care to include recently-dead tuples.  This is OK because
 * we won't mark the index valid until all transactions that might be able
 * to see those tuples are gone.  The reason for doing that is ...

In the current code, it doesn't consider the sessiones in standby.

I guess you have worked out a fix for this, if so, could you provide
one for review / test? 

-- 
Best Regards
Andy Fan






Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang until the transaction on standby is committed or aborted.
In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of time, which affected by the wal_receiver_status_interval parameter.
You can see attachment for the patch.


------------------ 原始邮件 ------------------
发件人: "zhihuifan1213" <zhihuifan1213@163.com>;
发送时间: 2023年11月29日(星期三) 中午1:26
收件人: "费长红"<feichanghong@qq.com>;"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
主题: Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem


Hi,

 Thanks for the report!
 
PG Bug reporting form <noreply@postgresql.org> writes:

> Standby executes select again (still within the existing repeatable read
> transaction). We expect that the results of the two select should be the
> same, but the second select did not get any data.
> ```
> postgres=*# select a, b from t where a = 5;
>   a | b
> ---+---
>   5 | 5
> (1 row)
>
> postgres=*# set enable_seqscan to off;
> SET
> postgres=*# select a, b from t where a = 5;
>   a | b
> ---+---
> (0 rows)
> ```

I can confirm this bug in the current master and continue with your test:

postgres=*# set enable_seqscan to off;
SET
postgres=*# select a, b from t where a = 5;
 a | b
---+---
(0 rows)

postgres=*# set enable_seqscan to  on;
SET
postgres=*# select a, b from t where a = 5;
 a | b
---+---
 5 | 5
(1 row)

Different plan yieds different result is bad.

> We have several ideas to fix this problem, but they all have obvious
> flaws:
> 1. In the WaitForOlderSnapshots function in the final stage of "create index
> concurrently", wait for replication_slot_xmin to exceed limitXmin. In this
> solution, when there are long transactions on Standby, "create index
> concurrently" operations may be blocked.

+1 for this one. The current code doesn't take the advice from the
comments for validate_index, where it says..

 * ....  Also, we index only tuples that are valid
 * as of the start of the scan (see table_index_build_scan), whereas a normal
 * build takes care to include recently-dead tuples.  This is OK because
 * we won't mark the index valid until all transactions that might be able
 * to see those tuples are gone.  The reason for doing that is ...

In the current code, it doesn't consider the sessiones in standby.

I guess you have worked out a fix for this, if so, could you provide
one for review / test?

--
Best Regards
Andy Fan
Attachment
"费长红" <feichanghong@qq.com> writes:

>
-------------------------------------------------------------------------------------------------------------------------
>
>  *  起个啥名好呢
>    feichanghong@qq.com
>
> Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang
> until the transaction on standby is committed or aborted.

I think this is acceptable since its behavior is same as we wait for the
transactions in primary.

> In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of
time,
> which affected by the wal_receiver_status_interval parameter.

The "wait" happens at the very last of index building, so the building
stages gives an enough time for standby to exceeds the xmin, so in a
real case, I think the overhead will be pretty low.

> You can see attachment for the patch.

I take a look at the that, function GetReplicationSlotXmin should be not
needed. You can use ProcArrayGetReplicationSlotXmin directly.

After searching the caller of WaitForOlderSnapshots, I think this bug
probably has impact on "deteach partition concurrently" / "reindex
concurrently" features.  All of them needs the same fix.

--
Best Regards
Andy Fan




Yes, the existing "ProcArrayGetReplicationSlotXmin" function has met the functional requirements and I have updated it in the attachment.



Best regards,
Changhong Fei

------------------ Original ------------------
From: "zhihuifan1213" <zhihuifan1213@163.com>;
Date: Wed, Nov 29, 2023 03:50 PM
To: "费长红"<feichanghong@qq.com>;
Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: 回复: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem


"费长红" <feichanghong@qq.com> writes:

> -------------------------------------------------------------------------------------------------------------------------
>
>  *  起个啥名好呢 
>    feichanghong@qq.com 

> Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang
> until the transaction on standby is committed or aborted.

I think this is acceptable since its behavior is same as we wait for the
transactions in primary.

> In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of time,
> which affected by the wal_receiver_status_interval parameter.

The "wait" happens at the very last of index building, so the building
stages gives an enough time for standby to exceeds the xmin, so in a
real case, I think the overhead will be pretty low.

> You can see attachment for the patch.

I take a look at the that, function GetReplicationSlotXmin should be not
needed. You can use ProcArrayGetReplicationSlotXmin directly.

After searching the caller of WaitForOlderSnapshots, I think this bug
probably has impact on "deteach partition concurrently" / "reindex
concurrently" features.  All of them needs the same fix.

--
Best Regards
Andy Fan
Attachment
At Tue, 5 Dec 2023 19:47:27 +0800, "费长红" <feichanghong@qq.com> wrote in 
> Yes, the existing "ProcArrayGetReplicationSlotXmin" function has met the functional requirements and I have updated
itin the attachment.
 


I'm not sure if it's valid to use slot xmin for this purpose. First
off, it overlooks standbys that don't use a slot. Also, inactive slots
can lead to unnecessary blocks on CREATE INDEX CONCURRENTLY. I'm not
entirely sure but we should avoid a situation that primary gets
blocked due to standby activities in asynchronous replication.
Finally, I honestly don't think we want to add another instance of
sleep-polling using pg_sleep() for this kind of wait.

We could add a variant of SyncRepWaitForLSN() for standby xmin for
synchronous replication.  However, as I mentioned earlier, I don't
think this approach is good for asynchronous replication. It might be
a good idea to handle it similarly to replication conflicts, by
canceling conflicting transactions on standbys. However, as of now, I
don't have a clear idea of the specific conditions that should trigger
conflist detection.

regards

-- 
Kyotaro Horiguchi
NTT Open Source Software Center