Thread: [PATCH] Replica sends an incorrect epoch in its hot standby feedbackto the Master

[PATCH] Replica sends an incorrect epoch in its hot standby feedbackto the Master

From
"Palamadai, Eka"
Date:

Hello Postgres hackers,

The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.

Problem

The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X,  and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.

 

Scenario

              The experimental setup is to start 2 write transactions and get their transaction ids, then create a checkpoint, and then have the oldest of the 2 transactions perform a write operation. The WAL replay of that write operation breaks the monotonically increasing property of “nextXid” tracked by the replica, and causes it to send a wrong epoch in its next feedback.

  1. Set the following parameters on the replica and the master.
    1. replica:

                                                                          i.      set hot_standby = on and hot_standby_feedback = on to provide hot standby feedback.

                                                                         ii.      set log_min_messages = DEBUG2 so that hot standby feedback[2] gets logged.

    1. master:

                                                                          i.      set checkpoint_timeout = 3600 (1 hour) to turn off automatic checkpoints during the experiment.

                                                                         ii.      set autovacuum = off to avoid interference from the vacuum process.

  1. Make sure no other read/write queries get executed on the replica and the master during this experiment, since they might change the nextXid on the replica. To achieve this, I set up a master and replica running Postgres 11 “locally” on my desktop, with the above parameter settings.   
  2. Start a psql client to the master, and do the following:

a.      Create a test table, say “test_epoch_table”. Execute “CREATE TABLE test_epoch_table(id INT, description CHAR(200));”.

b.      Begin a transaction, say A, and get the current transaction ID. Execute “begin; select txid_current();”.

  1. Start a second psql client to the master. Begin a transaction, say B, and get the current transaction ID. Execute “begin; select txid_current();”.
  2. Start a third psql client to the master, and create a checkpoint. Execute “checkpoint;”.
  3. From transaction A, insert a tuple into “test_epoch_table”. Execute “insert into test_epoch_table(id, description) values(1, 'one');”
  4. Open the Postgres log file on the replica and look for the latest hot standby feedback. The log message will show that the replica sent an incorrect epoch of 1, instead of 0.

 

Analysis

The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3], when the WAL replay for the insertion at step 6 is executed at the replica.

 

For example, before step 3b is executed, suppose the nextXid at the replica and master is 100, and that the variable “latestObservedXid”, which tracks the latest observed Xid at the replica, is 99. Now, suppose the Xid for transaction A is 100, and that for transaction B is 101. After step 4 is executed, the latestObservedXid at the replica still remains at 99, since the WAL replay for RUNNING_XACTS doesn’t advance latestObservedXid, once the replica reaches the STANDBY_SNAPSHOT_READY state. See [4] for reference. After step 5 is executed, the WAL replay of the checkpoint advances nextXid at the replica to 102, but doesn’t invoke RecordKnownAssignedTransactionIds, and hence latestObservedXid still remains 99.  After step 6 is executed, the WAL replay of the insertion invokes RecordKnownAssignedTransactionIds with an input Xid of 100, advances latestObservedXid to 100, and sets nextXid at the replica to 101 (at line [3]) breaking its monotonicity.  The hot standby feedback, which is generated immediately after the WAL replay of the insertion, invokes the function “GetNextXidAndEpoch”, which incorrectly calculates the epoch [5].

 

Versions impacted by this problem: 11, 10, and 9.6

 

Attached is the git diff of the patch for version 11, which is a 1-line change excluding the curly braces. Patch for versions 10 and 9.6 involve the same change.

 

Please let me know if you have any questions.

 

Thanks,
Eka Palamadai

Amazon Web Services

 

[1] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/heap/pruneheap.c#L74

[2] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walreceiver.c#L1244

[3] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259

[4] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L693

[5] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/transam/xlog.c#L8444

[6] https://github.com/postgres/postgres/commit/2fc7af5e966043a412e8e69c135fae55a2db6d4f

 

Attachment
On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote:
> The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12,
sincethe commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it. 

I'm happy that that stuff is already fixing bugs we didn't know we
had, but, yeah, it looks like it really only fixed it incidentally by
moving all the duplicated "assign if higher" code into a function, not
through the magical power of 64 bit xids.

> The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where
acheckpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback
causesthe master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at
themaster immediately thereafter, they may use a newer “oldest Xmin” Y > X,  and prematurely delete a tuple T such that
X< t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the
replicareplays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary
painto customers. 

Ouch.  Thanks for this analysis!

> The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it
wrapsaround to the next epoch. However, in the above sequence, this property is violated on the replica in the function
“RecordKnownAssignedTransactionIds”[3],when the WAL replay for the insertion at step 6 is executed at the replica. 

I haven't tried your repro or studied this closely yet, but yes, that
assignment to nextXid does indeed look pretty fishy.  Other similar
code elsewhere always does a check like in your patch, before
clobbering nextXid.



Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master

From
"Palamadai, Eka"
Date:
Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this
patchto "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.
 

Thanks,
Eka Palamadai
Amazon Web Services

On 2/11/20, 11:28 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

    On Fri, Feb 7, 2020 at 1:03 PM Palamadai, Eka <ekanatha@amazon.com> wrote:
    > The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version
12,since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.
 
    
    I'm happy that that stuff is already fixing bugs we didn't know we
    had, but, yeah, it looks like it really only fixed it incidentally by
    moving all the duplicated "assign if higher" code into a function, not
    through the magical power of 64 bit xids.
    
    > The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below,
wherea checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the
feedbackcauses the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were
executedat the master immediately thereafter, they may use a newer “oldest Xmin” Y > X,  and prematurely delete a tuple
Tsuch that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q.
Subsequently,when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q
causingunnecessary pain to customers.
 
    
    Ouch.  Thanks for this analysis!
    
    > The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it
wrapsaround to the next epoch. However, in the above sequence, this property is violated on the replica in the function
“RecordKnownAssignedTransactionIds”[3],when the WAL replay for the insertion at step 6 is executed at the replica.
 
    
    I haven't tried your repro or studied this closely yet, but yes, that
    assignment to nextXid does indeed look pretty fishy.  Other similar
    code elsewhere always does a check like in your patch, before
    clobbering nextXid.
    


Re: [PATCH] Replica sends an incorrect epoch in its hot standbyfeedback to the Master

From
Juan José Santamaría Flecha
Date:
On Fri, Feb 21, 2020 at 8:15 PM Palamadai, Eka <ekanatha@amazon.com> wrote:

Please, do not top post on this list.

Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this patch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.

Apparently, there are a couple of duplicate entries in the commitfest as: https://commitfest.postgresql.org/27/2463/ and https://commitfest.postgresql.org/27/2462/

Could close those as "withdrawn"?

Regards,

Juan José Santamaría Flecha
On 3/2/20 10:33 AM, Juan José Santamaría Flecha wrote:
> On Fri, Feb 21, 2020 at 8:15 PM Palamadai, Eka <ekanatha@amazon.com 
> <mailto:ekanatha@amazon.com>> wrote:
> 
> Please, do not top post on this list.
> 
>     Thanks a lot for the feedback. Please let me know if you have any
>     further comments. Meanwhile, I have also added this patch to
>     "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.
> 
> 
> Apparently, there are a couple of duplicate entries in the commitfest 
> as: https://commitfest.postgresql.org/27/2463/ and 
> https://commitfest.postgresql.org/27/2462/
> 
> Could close those as "withdrawn"?

I have marked the duplicate entries 2462 and 2463 as withdrawn.

-- 
-David
david@pgmasters.net



On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote:
> Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added this
patchto "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.
 

Thanks for the excellent reproducer for this obscure bug.  You said
the problem exists in 9.6-11, but I'm also able to reproduce it in
9.5.  That's the oldest supported release, but it probably goes back
further.  I confirmed that this patch fixes the immediate problem.
I've attached a version of your patch with a commit message, to see if
anyone has more feedback on this.

Attachment
On Wed, Mar 11, 2020 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Feb 22, 2020 at 6:10 AM Palamadai, Eka <ekanatha@amazon.com> wrote:
> > Thanks a lot for the feedback. Please let me know if you have any further comments. Meanwhile, I have also added
thispatch to "Commitfest 2020-03" at https://commitfest.postgresql.org/27/2464.
 
>
> Thanks for the excellent reproducer for this obscure bug.  You said
> the problem exists in 9.6-11, but I'm also able to reproduce it in
> 9.5.  That's the oldest supported release, but it probably goes back
> further.  I confirmed that this patch fixes the immediate problem.
> I've attached a version of your patch with a commit message, to see if
> anyone has more feedback on this.

Pushed.