Thread: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18213 Logged by: Fei Changhong Email address: feichanghong@qq.com PostgreSQL version: 16.1 Operating system: Operating system: centos 7,Kernel version: 5.10.13 Description: We found that transactions with repeatable read isolation level on Standby have a "nonrepeatable read" problem, that is, the results of two select are different, and the hot_standby_feedback guc has been set to on. This problem occurs when "create index concurrently" is executed on RW, and can be reproduced through the following steps: RW initialization data: ``` create table t(a int, b int, c char(1024)); create index on t(b); insert into t select i, i, i from generate_series(1, 6)i; ``` Start a repeatable read transaction on Standby: ``` BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; select txid_current_snapshot(); select a, b from t where a = 5; ``` RW execute update and create index concurrently: ``` update t set b = -1 where a = 5; create index CONCURRENTLY t1_idx_btree_id on t(a); ``` Standby executes select again (still within the existing repeatable read transaction). We expect that the results of the two select should be the same, but the second select did not get any data. ``` postgres=*# select a, b from t where a = 5; a | b ---+--- 5 | 5 (1 row) postgres=*# set enable_seqscan to off; SET postgres=*# select a, b from t where a = 5; a | b ---+--- (0 rows) ``` The environment in which the problem occurs: Operating system: centos 7 Kernel version: 5.10.134 PostgreSQL code branch/commit: master/d053a879bb360fb72c46de2a013e741d3f7c9e8d Client: psql We have also analyzed the causes of this problem, as follows: 1. RW uses an MVCC snapshot when executing "create index concurrently", so the updated old tuple cannot be accessed through the index if the update is not a HOT. 2. RW did not consider the oldest xmin of the snapshot on Standby when creating the index concurrently. 3. Even in a repeatable read transaction, catlog snapshot will get new one, so the tuple (5, 5) cannot be accessed when selecting through the newly created index, but seqscan can still access it. We have several ideas to fix this problem, but they all have obvious flaws: 1. In the WaitForOlderSnapshots function in the final stage of "create index concurrently", wait for replication_slot_xmin to exceed limitXmin. In this solution, when there are long transactions on Standby, "create index concurrently" operations may be blocked. 2. When selecting an index path in Standby, snapshot->xmin must be greater than the transaction ID that creates the index. This solution will make the invalidate of the plancache very complicated, otherwise the process will not be able to use the newly created index in the future. 3. Consider replication_slot_xmin when creating snapshots for scaning the heap table, and add all the tuple that Standby may access to the index. This solution is relatively hacky and will increase the space occupied by the index.
Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
zhihuifan1213@163.com
Date:
Hi, Thanks for the report! PG Bug reporting form <noreply@postgresql.org> writes: > Standby executes select again (still within the existing repeatable read > transaction). We expect that the results of the two select should be the > same, but the second select did not get any data. > ``` > postgres=*# select a, b from t where a = 5; > a | b > ---+--- > 5 | 5 > (1 row) > > postgres=*# set enable_seqscan to off; > SET > postgres=*# select a, b from t where a = 5; > a | b > ---+--- > (0 rows) > ``` I can confirm this bug in the current master and continue with your test: postgres=*# set enable_seqscan to off; SET postgres=*# select a, b from t where a = 5; a | b ---+--- (0 rows) postgres=*# set enable_seqscan to on; SET postgres=*# select a, b from t where a = 5; a | b ---+--- 5 | 5 (1 row) Different plan yieds different result is bad. > We have several ideas to fix this problem, but they all have obvious > flaws: > 1. In the WaitForOlderSnapshots function in the final stage of "create index > concurrently", wait for replication_slot_xmin to exceed limitXmin. In this > solution, when there are long transactions on Standby, "create index > concurrently" operations may be blocked. +1 for this one. The current code doesn't take the advice from the comments for validate_index, where it says.. * .... Also, we index only tuples that are valid * as of the start of the scan (see table_index_build_scan), whereas a normal * build takes care to include recently-dead tuples. This is OK because * we won't mark the index valid until all transactions that might be able * to see those tuples are gone. The reason for doing that is ... In the current code, it doesn't consider the sessiones in standby. I guess you have worked out a fix for this, if so, could you provide one for review / test? -- Best Regards Andy Fan
回复: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
"费长红"
Date:
Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang until the transaction on standby is committed or aborted.
In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of time, which affected by the wal_receiver_status_interval parameter.
You can see attachment for the patch.
------------------ 原始邮件 ------------------
发件人: "zhihuifan1213" <zhihuifan1213@163.com>;
发送时间: 2023年11月29日(星期三) 中午1:26
收件人: "费长红"<feichanghong@qq.com>;"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
主题: Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
Hi,
Thanks for the report!
PG Bug reporting form <noreply@postgresql.org> writes:
> Standby executes select again (still within the existing repeatable read
> transaction). We expect that the results of the two select should be the
> same, but the second select did not get any data.
> ```
> postgres=*# select a, b from t where a = 5;
> a | b
> ---+---
> 5 | 5
> (1 row)
>
> postgres=*# set enable_seqscan to off;
> SET
> postgres=*# select a, b from t where a = 5;
> a | b
> ---+---
> (0 rows)
> ```
I can confirm this bug in the current master and continue with your test:
postgres=*# set enable_seqscan to off;
SET
postgres=*# select a, b from t where a = 5;
a | b
---+---
(0 rows)
postgres=*# set enable_seqscan to on;
SET
postgres=*# select a, b from t where a = 5;
a | b
---+---
5 | 5
(1 row)
Different plan yieds different result is bad.
> We have several ideas to fix this problem, but they all have obvious
> flaws:
> 1. In the WaitForOlderSnapshots function in the final stage of "create index
> concurrently", wait for replication_slot_xmin to exceed limitXmin. In this
> solution, when there are long transactions on Standby, "create index
> concurrently" operations may be blocked.
+1 for this one. The current code doesn't take the advice from the
comments for validate_index, where it says..
* .... Also, we index only tuples that are valid
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
* to see those tuples are gone. The reason for doing that is ...
In the current code, it doesn't consider the sessiones in standby.
I guess you have worked out a fix for this, if so, could you provide
one for review / test?
--
Best Regards
Andy Fan
Attachment
Re: 回复: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
zhihuifan1213@163.com
Date:
"费长红" <feichanghong@qq.com> writes: > ------------------------------------------------------------------------------------------------------------------------- > > * 起个啥名好呢 > feichanghong@qq.com > > Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang > until the transaction on standby is committed or aborted. I think this is acceptable since its behavior is same as we wait for the transactions in primary. > In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of time, > which affected by the wal_receiver_status_interval parameter. The "wait" happens at the very last of index building, so the building stages gives an enough time for standby to exceeds the xmin, so in a real case, I think the overhead will be pretty low. > You can see attachment for the patch. I take a look at the that, function GetReplicationSlotXmin should be not needed. You can use ProcArrayGetReplicationSlotXmin directly. After searching the caller of WaitForOlderSnapshots, I think this bug probably has impact on "deteach partition concurrently" / "reindex concurrently" features. All of them needs the same fix. -- Best Regards Andy Fan
Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
"费长红"
Date:
Yes, the existing "ProcArrayGetReplicationSlotXmin" function has met the functional requirements and I have updated it in the attachment.
Changhong Fei
------------------ Original ------------------
From: "zhihuifan1213" <zhihuifan1213@163.com>;
Date: Wed, Nov 29, 2023 03:50 PM
To: "费长红"<feichanghong@qq.com>;
Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: 回复: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
"费长红" <feichanghong@qq.com> writes:
> -------------------------------------------------------------------------------------------------------------------------
>
> * 起个啥名好呢
> feichanghong@qq.com
>
> Indeed, I simply implemented and verified the solution. In the above test, the "create index" command on RW will hang
> until the transaction on standby is committed or aborted.
I think this is acceptable since its behavior is same as we wait for the
transactions in primary.
> In addition, even if there is no select query on the standby, RW's "create index" command may wait for a period of time,
> which affected by the wal_receiver_status_interval parameter.
The "wait" happens at the very last of index building, so the building
stages gives an enough time for standby to exceeds the xmin, so in a
real case, I think the overhead will be pretty low.
> You can see attachment for the patch.
I take a look at the that, function GetReplicationSlotXmin should be not
needed. You can use ProcArrayGetReplicationSlotXmin directly.
After searching the caller of WaitForOlderSnapshots, I think this bug
probably has impact on "deteach partition concurrently" / "reindex
concurrently" features. All of them needs the same fix.
--
Best Regards
Andy Fan
Attachment
Re: BUG #18213: Standby's repeatable read isolation level transaction encountered a "nonrepeatable read" problem
From
Kyotaro Horiguchi
Date:
At Tue, 5 Dec 2023 19:47:27 +0800, "费长红" <feichanghong@qq.com> wrote in > Yes, the existing "ProcArrayGetReplicationSlotXmin" function has met the functional requirements and I have updated itin the attachment. I'm not sure if it's valid to use slot xmin for this purpose. First off, it overlooks standbys that don't use a slot. Also, inactive slots can lead to unnecessary blocks on CREATE INDEX CONCURRENTLY. I'm not entirely sure but we should avoid a situation that primary gets blocked due to standby activities in asynchronous replication. Finally, I honestly don't think we want to add another instance of sleep-polling using pg_sleep() for this kind of wait. We could add a variant of SyncRepWaitForLSN() for standby xmin for synchronous replication. However, as I mentioned earlier, I don't think this approach is good for asynchronous replication. It might be a good idea to handle it similarly to replication conflicts, by canceling conflicting transactions on standbys. However, as of now, I don't have a clear idea of the specific conditions that should trigger conflist detection. regards -- Kyotaro Horiguchi NTT Open Source Software Center