Thread: Race condition with restore_command on streaming replica
Hi,
I've recently been seeing a race condition with the restore_command on replicas using streaming replication.
On the primary, we are archiving wal files to s3 compatible storage via pgBackRest. In the recovery.conf section of the postgresql.conf file on the replicas, we define the restore command as follows:
restore_command = '/usr/bin/pgbackrest --config /conf/postgresql/pgbackrest_restore.conf --stanza=formation archive-get %f "%p"'
We have a three member setup - m-0, m-1, m-2. Consider the case where m-0 is the Primary and m-1 and m-2 are replicas connected to the m-0.
When issuing a switchover (via Patroni) from m-0 to m-1, the connection from m-2 to m-0 is terminated. The restore_command on m-2 is run, and it looks for the .history file for the new timeline. If this happens before the history file is created and pushed to the archive, m-2 will look for the next wal file on the existing timeline in the archive. It will never be created as the source has moved on, so this m-2 hangs waiting on that file. The restore_command on the replica looking for this non-existent file is only run once. This seems like an odd state to be in. The replica is waiting on a new file, but it's not actually looking for it. Is this expected, or should the restore_command be polling for that file?
Restarting m-2 causes it to reconnect, find the .history file in the archive and replication picks up. But that is less than ideal.
This is Postgres 12.4, managed by Patroni. I reported this to Patroni first for feedback, and they pointed to it be being a Postgres issue - not a Patroni one.
Here is a summarized timeline of events from one instance, cluster is on Time Line 9
2020-11-04 14:51:11 m-0 is shutdown
2020-11-04 14:51:13.015 m-2 replication connection is terminated
2020-11-04 14:51:13.015 m-2 looks for 0000000A.history in archive, does not find it
2020-11-04 14:51:17,137 m-2 looks for 0000000900000000000000B6 in archive (this file won't ever exist)
2020-11-04 14:51:20,177 m-1 pushes 0000000A.history to archive
At this point, m-2 is stuck until reboot
There is a detailed bug report on the Patroni list with all the logs if relevant:
https://github.com/zalando/patroni/issues/1747#issue-736244675
On Thu, Nov 5, 2020 at 1:39 AM Brad Nicholson <bradn@ca.ibm.com> wrote: > > Hi, > > I've recently been seeing a race condition with the restore_command on replicas using streaming replication. > > On the primary, we are archiving wal files to s3 compatible storage via pgBackRest. In the recovery.conf section of thepostgresql.conf file on the replicas, we define the restore command as follows: > > restore_command = '/usr/bin/pgbackrest --config /conf/postgresql/pgbackrest_restore.conf --stanza=formation archive-get%f "%p"' > > We have a three member setup - m-0, m-1, m-2. Consider the case where m-0 is the Primary and m-1 and m-2 are replicas connectedto the m-0. > > When issuing a switchover (via Patroni) from m-0 to m-1, the connection from m-2 to m-0 is terminated. The restore_commandon m-2 is run, and it looks for the .history file for the new timeline. If this happens before the historyfile is created and pushed to the archive, m-2 will look for the next wal file on the existing timeline in the archive.It will never be created as the source has moved on, so this m-2 hangs waiting on that file. The restore_commandon the replica looking for this non-existent file is only run once. This seems like an odd state to be in.The replica is waiting on a new file, but it's not actually looking for it. Is this expected, or should the restore_commandbe polling for that file? I am not sure how Patroni does it internally, can you explain the scenario in more detail? Suppose you are executing the promote on m-1 and if the promotion is successful it will switch the timeline and it will create the timeline history file. Now, once the promotion is successful if we change the primary_conninfo on the m-2 then it will restart the walsender and look for the latest .history file which it should find either from direct streaming or through the restore_command. If you are saying that m-2 tried to look for the history file before m-1 created it then it seems like you change the primary_conninfo on m-2 before the m-1 promotion got completed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> wrote on 2020/11/15 04:47:12 AM:
> I am not sure how Patroni does it internally, can you explain the
> scenario in more detail? Suppose you are executing the promote on m-1
> and if the promotion is successful it will switch the timeline and it
> will create the timeline history file. Now, once the promotion is
> successful if we change the primary_conninfo on the m-2 then it will
> restart the walsender and look for the latest .history file which it
> should find either from direct streaming or through the
> restore_command. If you are saying that m-2 tried to look for the
> history file before m-1 created it then it seems like you change the
> primary_conninfo on m-2 before the m-1 promotion got completed.
Patroni first issues a fast shutdown to m-0. Once that completes, it issues a promote on m-1.
In this case, m-2 is not connecting directly to to m-0 or m-1 via a hostname, it is connecting to a virtual layer that routes the connectivity to the current primary. This does not change.
Brad.
Given a 3 node cluster M0, M1 and M2.And M0 is the primary with M1 and M2 as the replicas - we will call this "T0" (Timeline 0).And a "switch over" command is issued to terminate M0 and promote M1 as the new primary.Expected Behavior:When M0 is terminated and M1 is promoted to primaryThen a ".history" file would be written to a shared archive repo (S3) to indicate T0 has ended and T1 is created with m1 as the new primary.And "restore_command" is executed on the replica M2 to replay any remaining "wal" files from T0 as well as switch to streaming from M1 as the new primary henceforth.
Given the "switch over" process is run asynchronously.When "restore_comand" is run on M2 before M1 creates a .history fileThen M2 will understand this as - No new timeline is created, I'll wait for the next wal file from T0.But the next wal file from T0 will NEVER come because it's been terminated and M2 hangs there looking for something that will never exist.
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
cp: cannot stat ‘/var/lib/pgsql/12/backups/archives/00000002.history’: No such file or directory
cp: cannot stat ‘/var/lib/pgsql/12/backups/archives/00000001000000010000004E’: No such file or directory
cp: cannot stat ‘/var/lib/pgsql/12/backups/archives/00000003.history’: No such file or directory
LOG: restored log file "00000002.history" from archive
LOG: new target timeline is 2. # switches to timeline 2 after reading the .history file
cp: cannot stat ‘/var/lib/pgsql/12/backups/archives/00000002000000010000004E’: No such file or directory. # this is happening because the primary conn info needs to be changed to point to the m1 and this requires a db restart to get picked up.
cp: cannot stat ‘/var/lib/pgsql/12/backups/archives/00000001000000010000004E’: No such file or directory
Dilip Kumar <dilipbalaut@gmail.com> wrote on 2020/11/15 04:47:12 AM:
> I am not sure how Patroni does it internally, can you explain the
> scenario in more detail? Suppose you are executing the promote on m-1
> and if the promotion is successful it will switch the timeline and it
> will create the timeline history file. Now, once the promotion is
> successful if we change the primary_conninfo on the m-2 then it will
> restart the walsender and look for the latest .history file which it
> should find either from direct streaming or through the
> restore_command. If you are saying that m-2 tried to look for the
> history file before m-1 created it then it seems like you change the
> primary_conninfo on m-2 before the m-1 promotion got completed.
Patroni first issues a fast shutdown to m-0. Once that completes, it issues a promote on m-1.
In this case, m-2 is not connecting directly to to m-0 or m-1 via a hostname, it is connecting to a virtual layer that routes the connectivity to the current primary. This does not change.
Brad.