Thread: [PATCH] Replica sends an incorrect epoch in its hot standby feedbackto the Master
[PATCH] Replica sends an incorrect epoch in its hot standby feedbackto the Master
Hello Postgres hackers,
The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.
Problem
The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.
Scenario
The experimental setup is to start 2 write transactions and get their transaction ids, then create a checkpoint, and then have the oldest of the 2 transactions perform a write operation. The WAL replay of that write operation breaks the monotonically increasing property of “nextXid” tracked by the replica, and causes it to send a wrong epoch in its next feedback.
- Set the following parameters on the replica and the master.
- replica:
i. set hot_standby = on and hot_standby_feedback = on to provide hot standby feedback.
ii. set log_min_messages = DEBUG2 so that hot standby feedback[2] gets logged.
- master:
i. set checkpoint_timeout = 3600 (1 hour) to turn off automatic checkpoints during the experiment.
ii. set autovacuum = off to avoid interference from the vacuum process.
- Make sure no other read/write queries get executed on the replica and the master during this experiment, since they might change the nextXid on the replica. To achieve this, I set up a master and replica running Postgres 11 “locally” on my desktop, with the above parameter settings.
- Start a psql client to the master, and do the following:
a. Create a test table, say “test_epoch_table”. Execute “CREATE TABLE test_epoch_table(id INT, description CHAR(200));”.
b. Begin a transaction, say A, and get the current transaction ID. Execute “begin; select txid_current();”.
- Start a second psql client to the master. Begin a transaction, say B, and get the current transaction ID. Execute “begin; select txid_current();”.
- Start a third psql client to the master, and create a checkpoint. Execute “checkpoint;”.
- From transaction A, insert a tuple into “test_epoch_table”. Execute “insert into test_epoch_table(id, description) values(1, 'one');”
- Open the Postgres log file on the replica and look for the latest hot standby feedback. The log message will show that the replica sent an incorrect epoch of 1, instead of 0.
Analysis
The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3], when the WAL replay for the insertion at step 6 is executed at the replica.
For example, before step 3b is executed, suppose the nextXid at the replica and master is 100, and that the variable “latestObservedXid”, which tracks the latest observed Xid at the replica, is 99. Now, suppose the Xid for transaction A is 100, and that for transaction B is 101. After step 4 is executed, the latestObservedXid at the replica still remains at 99, since the WAL replay for RUNNING_XACTS doesn’t advance latestObservedXid, once the replica reaches the STANDBY_SNAPSHOT_READY state. See [4] for reference. After step 5 is executed, the WAL replay of the checkpoint advances nextXid at the replica to 102, but doesn’t invoke RecordKnownAssignedTransactionIds, and hence latestObservedXid still remains 99. After step 6 is executed, the WAL replay of the insertion invokes RecordKnownAssignedTransactionIds with an input Xid of 100, advances latestObservedXid to 100, and sets nextXid at the replica to 101 (at line [3]) breaking its monotonicity. The hot standby feedback, which is generated immediately after the WAL replay of the insertion, invokes the function “GetNextXidAndEpoch”, which incorrectly calculates the epoch [5].
Versions impacted by this problem: 11, 10, and 9.6
Attached is the git diff of the patch for version 11, which is a 1-line change excluding the curly braces. Patch for versions 10 and 9.6 involve the same change.
Please let me know if you have any questions.
Thanks,
Eka Palamadai
Amazon Web Services
[1] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/heap/pruneheap.c#L74
[2] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walreceiver.c#L1244
[3] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259
[4] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L693
[5] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/transam/xlog.c#L8444
[6] https://github.com/postgres/postgres/commit/2fc7af5e966043a412e8e69c135fae55a2db6d4f
Attachment
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote: > The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, sincethe commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it. I'm happy that that stuff is already fixing bugs we didn't know we had, but, yeah, it looks like it really only fixed it incidentally by moving all the duplicated "assign if higher" code into a function, not through the magical power of 64 bit xids. > The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where acheckpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causesthe master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at themaster immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X< t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replicareplays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary painto customers. Ouch. Thanks for this analysis! > The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wrapsaround to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3],when the WAL replay for the insertion at step 6 is executed at the replica. I haven't tried your repro or studied this closely yet, but yes, that assignment to nextXid does indeed look pretty fishy. Other similar code elsewhere always does a check like in your patch, before clobbering nextXid.
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patchto "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464. Thanks, Eka Palamadai Amazon Web Services On 2/11/20, 11:28 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote: On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote: > The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12,since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it. I'm happy that that stuff is already fixing bugs we didn't know we had, but, yeah, it looks like it really only fixed it incidentally by moving all the duplicated "assign if higher" code into a function, not through the magical power of 64 bit xids. > The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, wherea checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedbackcauses the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executedat the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple Tsuch that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently,when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causingunnecessary pain to customers. Ouch. Thanks for this analysis! > The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wrapsaround to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3],when the WAL replay for the insertion at step 6 is executed at the replica. I haven't tried your repro or studied this closely yet, but yes, that assignment to nextXid does indeed look pretty fishy. Other similar code elsewhere always does a check like in your patch, before clobbering nextXid.
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
On 3/2/20 10:33 AM, Juan José Santamaría Flecha wrote: > On Fri, Feb 21, 2020 at 8:15 PM Palamadai, Eka <ekanatha@amazon.com > <mailto:ekanatha@amazon.com>> wrote: > > Please, do not top post on this list. > > Thanks a lot for the feedback. Please let me know if you have any > further comments. Meanwhile, I have also added this patch to > "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464. > > > Apparently, there are a couple of duplicate entries in the commitfest > as: https://commitfest.postgresql.org/27/2463/ and > https://commitfest.postgresql.org/27/2462/ > > Could close those as "withdrawn"? I have marked the duplicate entries 2462 and 2463 as withdrawn. -- -David david@pgmasters.net
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote: > Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patchto "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464. Thanks for the excellent reproducer for this obscure bug. You said the problem exists in 9.6-11, but I'm also able to reproduce it in 9.5. That's the oldest supported release, but it probably goes back further. I confirmed that this patch fixes the immediate problem. I've attached a version of your patch with a commit message, to see if anyone has more feedback on this.
Attachment
Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master
On Wed, Mar 11, 2020 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote: > > Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added thispatch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464. > > Thanks for the excellent reproducer for this obscure bug. You said > the problem exists in 9.6-11, but I'm also able to reproduce it in > 9.5. That's the oldest supported release, but it probably goes back > further. I confirmed that this patch fixes the immediate problem. > I've attached a version of your patch with a commit message, to see if > anyone has more feedback on this. Pushed.