Thread: Help with synchronous replication automation

Help with synchronous replication automation

From
Todd Nine
Date:
Hi all,
  I'm in the process of writing an application/api that will perform similar functionality to RDS on Kuberentes using Postgres.  For our first use case, I have the following setup.

Master (0) --(synchronous replica) -> Standby 1

                 --(synchronous replica) -> Standby 2


I have the following configuration in my postgresql.conf on the master.

synchronous_standby_names = '1,2'

And the following connections in standby 1 and standby 2, respectively.


standby_mode = on
  primary_conninfo = 'host=postgres-toddtest-write port=5432 user=postgres application_name=1'
  trigger_file = '/tmp/postgresql.trigger.5432'
  recovery_target_timeline=latest


standby_mode = on
  primary_conninfo = 'host=postgres-toddtest-write port=5432 user=postgres application_name=2'
  trigger_file = '/tmp/postgresql.trigger.5432'
  recovery_target_timeline=latest

Replication appears to be working.  In my master, I see the following in my logs.  

LOG:  database system is ready to accept connections
LOG:  standby "1" is now the synchronous standby with priority 1
LOG:  standby "2" is now the synchronous standby with priority 2
LOG:  standby "2" is now the synchronous standby with priority 2
LOG:  standby "1" is now the synchronous standby with priority 1


However, when I run the following on my slave nodes, the lag time seems enormous.

SELECT  now() - pg_last_xact_replay_timestamp() AS time_lag;
    time_lag
----------------
 00:21:26.33019

As a result, I have a few questions I can't seem to find the answers to in the documentation, any help would be greatly appreciated.


1)  In this doc, it states ". If the standby is the first matching standby, as specified in synchronous_standby_names on the primary, the reply messages from that standby will be used to wake users waiting for confirmation that the commit record has been received."


My understanding is that means if Standby 1 has successfully fsynced that commit to disk, the server will return a response to the client as a successful commit.  What happens to Standby 2?  I'm assuming it's sent the same WAL entry asynchronously, but wanted to be sure.  

2) How can I validate the replication latency in standbys that are NOT the current hot standby?  For instance, I want to deploy or upgrade the PG nodes automatically.  I would add more hot standbys, wait for them to catch up.  I would then fail over to one of them as the new master node.  Once this is complete and working, I'd remove the old master and replicas.

In order to automate 2), I need some way to verify that the standby that's just been created is up to date, as well as receiving the latest traffic.

Thanks in advance!
Todd


Re: Help with synchronous replication automation

From
Keith
Date:


On Thu, Oct 13, 2016 at 5:39 PM, Todd Nine <todd.nine@gmail.com> wrote:
Hi all,
  I'm in the process of writing an application/api that will perform similar functionality to RDS on Kuberentes using Postgres.  For our first use case, I have the following setup.

Master (0) --(synchronous replica) -> Standby 1

                 --(synchronous replica) -> Standby 2


I have the following configuration in my postgresql.conf on the master.

synchronous_standby_names = '1,2'

And the following connections in standby 1 and standby 2, respectively.


standby_mode = on
  primary_conninfo = 'host=postgres-toddtest-write port=5432 user=postgres application_name=1'
  trigger_file = '/tmp/postgresql.trigger.5432'
  recovery_target_timeline=latest


standby_mode = on
  primary_conninfo = 'host=postgres-toddtest-write port=5432 user=postgres application_name=2'
  trigger_file = '/tmp/postgresql.trigger.5432'
  recovery_target_timeline=latest

Replication appears to be working.  In my master, I see the following in my logs.  

LOG:  database system is ready to accept connections
LOG:  standby "1" is now the synchronous standby with priority 1
LOG:  standby "2" is now the synchronous standby with priority 2
LOG:  standby "2" is now the synchronous standby with priority 2
LOG:  standby "1" is now the synchronous standby with priority 1


However, when I run the following on my slave nodes, the lag time seems enormous.

SELECT  now() - pg_last_xact_replay_timestamp() AS time_lag;
    time_lag
----------------
 00:21:26.33019

As a result, I have a few questions I can't seem to find the answers to in the documentation, any help would be greatly appreciated.


1)  In this doc, it states ". If the standby is the first matching standby, as specified in synchronous_standby_names on the primary, the reply messages from that standby will be used to wake users waiting for confirmation that the commit record has been received."


My understanding is that means if Standby 1 has successfully fsynced that commit to disk, the server will return a response to the client as a successful commit.  What happens to Standby 2?  I'm assuming it's sent the same WAL entry asynchronously, but wanted to be sure.  

I believe you have this correct. I'd still get confirmation on this from someone else, though. As of 9.6, you also have the option of requiring a minimum number of synchronous standbys that must respond. Before 9.6, it just went down the list in order and the first one to respond was all that was necessary for the master to confirm the transaction.

https://www.postgresql.org/docs/9.6/static/warm-standby.html#SYNCHRONOUS-REPLICATION
 

2) How can I validate the replication latency in standbys that are NOT the current hot standby?  For instance, I want to deploy or upgrade the PG nodes automatically.  I would add more hot standbys, wait for them to catch up.  I would then fail over to one of them as the new master node.  Once this is complete and working, I'd remove the old master and replicas.

In order to automate 2), I need some way to verify that the standby that's just been created is up to date, as well as receiving the latest traffic.

Thanks in advance!
Todd



To more reliably get the status of the slaves, you'll want to query from the master for the byte lag vs querying the slave for the last transaction replay. If the master is getting no writes, then checking for replay on the slave will give a false report that it is falling behind simply because it's run no transactions. I've explained this in more detail and have some example queries on my blog

https://www.keithf4.com/monitoring_streaming_slave_lag/

It's still a good monitor to have, though, and I recommend monitoring both byte lag from the master and replay from the slave. Just set your monitoring alerts appropriately

Re: [MASSMAIL]Re: Help with synchronous replication automation

From
"Gilberto Castillo"
Date:

Use 9.6

With synchronous_commit = 'remote_apply'


> On Thu, Oct 13, 2016 at 5:39 PM, Todd Nine <todd.nine@gmail.com> wrote:
>
>> Hi all,
>>   I'm in the process of writing an application/api that will perform
>> similar functionality to RDS on Kuberentes using Postgres.  For our
>> first
>> use case, I have the following setup.
>>
>> Master (0) --(synchronous replica) -> Standby 1
>>
>>                  --(synchronous replica) -> Standby 2
>>
>>
>> I have the following configuration in my postgresql.conf on the master.
>>
>> synchronous_standby_names = '1,2'
>>
>> And the following connections in standby 1 and standby 2, respectively.
>>
>>
>> standby_mode = on
>>   primary_conninfo = 'host=postgres-toddtest-write port=5432
>> user=postgres
>> application_name=1'
>>   trigger_file = '/tmp/postgresql.trigger.5432'
>>   recovery_target_timeline=latest
>>
>>
>> standby_mode = on
>>   primary_conninfo = 'host=postgres-toddtest-write port=5432
>> user=postgres
>> application_name=2'
>>   trigger_file = '/tmp/postgresql.trigger.5432'
>>   recovery_target_timeline=latest
>>
>> Replication appears to be working.  In my master, I see the following in
>> my logs.
>>
>> LOG:  database system is ready to accept connections
>> LOG:  standby "1" is now the synchronous standby with priority 1
>> LOG:  standby "2" is now the synchronous standby with priority 2
>> LOG:  standby "2" is now the synchronous standby with priority 2
>> LOG:  standby "1" is now the synchronous standby with priority 1
>>
>>
>> However, when I run the following on my slave nodes, the lag time seems
>> enormous.
>>
>> SELECT  now() - pg_last_xact_replay_timestamp() AS time_lag;
>>     time_lag
>> ----------------
>>  00:21:26.33019
>>
>> As a result, I have a few questions I can't seem to find the answers to
>> in
>> the documentation, any help would be greatly appreciated.
>>
>>
>> 1)  In this doc, it states ". If the standby is the first matching
>> standby, as specified in synchronous_standby_names on the primary, the
>> reply messages from that standby will be used to wake users waiting for
>> confirmation that the commit record has been received."
>>
>> https://www.postgresql.org/docs/9.5/static/warm-standby.html
>>
>> My understanding is that means if Standby 1 has successfully fsynced
>> that
>> commit to disk, the server will return a response to the client as a
>> successful commit.  What happens to Standby 2?  I'm assuming it's sent
>> the
>> same WAL entry asynchronously, but wanted to be sure.
>>
>
> I believe you have this correct. I'd still get confirmation on this from
> someone else, though. As of 9.6, you also have the option of requiring a
> minimum number of synchronous standbys that must respond. Before 9.6, it
> just went down the list in order and the first one to respond was all that
> was necessary for the master to confirm the transaction.
>
> https://www.postgresql.org/docs/9.6/static/warm-standby.html#SYNCHRONOUS-REPLICATION
>
>
>>
>> 2) How can I validate the replication latency in standbys that are NOT
>> the
>> current hot standby?  For instance, I want to deploy or upgrade the PG
>> nodes automatically.  I would add more hot standbys, wait for them to
>> catch
>> up.  I would then fail over to one of them as the new master node.  Once
>> this is complete and working, I'd remove the old master and replicas.
>>
>> In order to automate 2), I need some way to verify that the standby
>> that's
>> just been created is up to date, as well as receiving the latest
>> traffic.
>>
>> Thanks in advance!
>> Todd
>>
>>
>>
> To more reliably get the status of the slaves, you'll want to query from
> the master for the byte lag vs querying the slave for the last transaction
> replay. If the master is getting no writes, then checking for replay on
> the
> slave will give a false report that it is falling behind simply because
> it's run no transactions. I've explained this in more detail and have some
> example queries on my blog
>
> https://www.keithf4.com/monitoring_streaming_slave_lag/
>
> It's still a good monitor to have, though, and I recommend monitoring both
> byte lag from the master and replay from the slave. Just set your
> monitoring alerts appropriately
>


--
Saludos,
Gilberto Castillo
ETECSA, La Habana, Cuba