Thread: replication lag despite corrective config

replication lag despite corrective config

From
Wyatt Alt
Date:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt

Re: replication lag despite corrective config

From
Wyatt Alt
Date:
Sorry, I see now there was a similar question a few days ago:

Two ideas proposed (aside from disconnects):
* Autovacuum is truncating a page on the master and taking an AccessExclusiveLock on the table in use on the replica
* A "pin conflict", which I'm still unfamiliar with.

The user's response says they are in the first bucket, but the argument relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I need to understand pin conflicts better, but the likely scenario Andrew outlined doesn't apply to me. My offending queries were doing bitmap heap scans on a 300GB table.

Reading the thread I see Andres ask for the "precise conflict" the user gets -- is there a way I can get that without a datadir? And to re-frame the original question, are there causes of replication lag that max_standby_streaming_delay would not be expected to prevent, that would be resolved by killing long standby queries? If so, what's the best way to confirm?

Wyatt

On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt.alt@gmail.com> wrote:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt

Re: replication lag despite corrective config

From
Rene Romero Benavides
Date:
Not sure about the root cause but I can make these observations and raise some questions:
1) 9.6.6 is five bug fix versions behind
2) 300GB is so big a table, wouldn't make sense to you to partition it ?
2a) or if it's partitioned, doesn't the time of creation or dropping of new partitions match the time of the conflict?
3) can you track long running transactions on the master?
4) what are the isolation levels on master / replica?
5) check for active locks in the replica, I guess you should see some blocked transactions during big delays, I've seen this in the past when standby_feedback is turned off.
6) any out of the ordinary messages in the replica's logs? any evidence that has been canceling statements?
7) are master and replica exactly the same in terms of resources and main parameters?
8) how is performance in both nodes while the big delay is happening? IO / cpu load / etc.

Also, check this out:


Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:
Sorry, I see now there was a similar question a few days ago:

Two ideas proposed (aside from disconnects):
* Autovacuum is truncating a page on the master and taking an AccessExclusiveLock on the table in use on the replica
* A "pin conflict", which I'm still unfamiliar with.

The user's response says they are in the first bucket, but the argument relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I need to understand pin conflicts better, but the likely scenario Andrew outlined doesn't apply to me. My offending queries were doing bitmap heap scans on a 300GB table.

Reading the thread I see Andres ask for the "precise conflict" the user gets -- is there a way I can get that without a datadir? And to re-frame the original question, are there causes of replication lag that max_standby_streaming_delay would not be expected to prevent, that would be resolved by killing long standby queries? If so, what's the best way to confirm?

Wyatt

On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt.alt@gmail.com> wrote:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt


--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

Re: replication lag despite corrective config

From
Laurenz Albe
Date:
On Mon, 2018-11-19 at 17:46 -0800, Wyatt Alt wrote:
> I've been struggling to eliminate replication lag on a Postgres 9.6.6
> instance on Amazon RDS. I believe the lag is caused by early cleanup
> conflicts from vacuums on the master, because I can reliably resolve
> it by killing long-running queries on the standby. I most recently
> saw ten hours of lag on Saturday and addressed it this way.
> 
> The standby is running with
> hot_standby_feedback = on
> max_standby_streaming_delay = 5min
> max_standby_archive_delay = 30s

With these settings, any conflicting query will be canceled after five
minutes.  Perhaps your actual settings are different.

What do you get for

SELECT * FROM pg_settings WHERE name = 'max_standby_streaming_delay';

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: replication lag despite corrective config

From
Wyatt Alt
Date:
On 11/19/18 11:09 PM, Laurenz Albe wrote:
> With these settings, any conflicting query will be canceled after five
> minutes.  Perhaps your actual settings are different.
>
> What do you get for
>
> SELECT * FROM pg_settings WHERE name = 'max_standby_streaming_delay';
Hi Laurenz, thanks for backing up my understanding.

The settings are as I reported:

db=> show max_standby_streaming_delay;
  max_standby_streaming_delay
-----------------------------
  5min

db=> show hot_standby_feedback;
hot_standby_feedback
----------------------
  on

db=> show max_standby_archive_delay;
max_standby_archive_delay
---------------------------
  30s

db=> select * from pg_settings where name = 'max_standby_streaming_delay';
-[ RECORD 1 ]---+----------------------------------------------------------
name            | max_standby_streaming_delay
setting         | 300000
unit            | ms
category        | Replication / Standby Servers
short_desc      | Sets the maximum delay before canceling queries when a 
<snip>
extra_desc      | <<NULL>>
context         | sighup
vartype         | integer
source          | configuration file
min_val         | -1
max_val         | 2147483647
enumvals        | <<NULL>>
boot_val        | 30000
reset_val       | 300000
sourcefile      | <<NULL>>
sourceline      | <<NULL>>
pending_restart | f

In case it's of additional interest, the full configs for the replica 
and the primary are here:

https://gist.github.com/wkalt/a33e7628c63bfc71a6d0ee85103302d9

and here: https://gist.github.com/wkalt/a9dd5ceb0b01b50615c51063ee972117.



Re: replication lag despite corrective config

From
Wyatt Alt
Date:

Hi Rene,

On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
Not sure about the root cause but I can make these observations and raise some questions:
1) 9.6.6 is five bug fix versions behind
Valid point to raise.
2) 300GB is so big a table, wouldn't make sense to you to partition it ?
2a) or if it's partitioned, doesn't the time of creation or dropping of new partitions match the time of the conflict?
Partitioning is in the works, but none at the moment.

3) can you track long running transactions on the master?
4) what are the isolation levels on master / replica?

Transaction times on the master max out around two minutes. On the replica they are much longer -- numerous 1 - 2 hour transactions per day, and occasional ones as long as 10 - 20 hours. Isolation levels are read committed everywhere.

5) check for active locks in the replica, I guess you should see some blocked transactions during big delays, I've seen this in the past when standby_feedback is turned off.
6) any out of the ordinary messages in the replica's logs? any evidence that has been canceling statements?

I'll make a note to record the active locks next time. I haven't seen anything unusual in the logs during these incidents, but have observed statements getting canceled at other times, which is why I think the config mostly works.

7) are master and replica exactly the same in terms of resources and main parameters?
8) how is performance in both nodes while the big delay is happening? IO / cpu load / etc.

This brings up a good detail I forgot to mention originally. During the last incident, IO utilization on the replica was near 100%, and had been for several hours, which I believe was due to the long queries I canceled.  Now that I think about it, I wonder if the lag may have arisen from IO contention between the query and WAL replay, rather than a query conflict.

Thanks, interesting reading.

Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:
Sorry, I see now there was a similar question a few days ago:

Two ideas proposed (aside from disconnects):
* Autovacuum is truncating a page on the master and taking an AccessExclusiveLock on the table in use on the replica
* A "pin conflict", which I'm still unfamiliar with.

The user's response says they are in the first bucket, but the argument relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I need to understand pin conflicts better, but the likely scenario Andrew outlined doesn't apply to me. My offending queries were doing bitmap heap scans on a 300GB table.

Reading the thread I see Andres ask for the "precise conflict" the user gets -- is there a way I can get that without a datadir? And to re-frame the original question, are there causes of replication lag that max_standby_streaming_delay would not be expected to prevent, that would be resolved by killing long standby queries? If so, what's the best way to confirm?

Wyatt

On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt.alt@gmail.com> wrote:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt


--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

Re: replication lag despite corrective config

From
Rene Romero Benavides
Date:
You're welcome.
Since last Saturday when you addressed the 10 hour delay, with the new settings, have you seen more of such delay incidents? what the previous settings were?
Beware that hot_standby_feedback = on and such long queries in the replica can increase bloat in the master, are you measuring bloat? if so, do you notice a significant increase?

Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:

Hi Rene,

On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
Not sure about the root cause but I can make these observations and raise some questions:
1) 9.6.6 is five bug fix versions behind
Valid point to raise.
2) 300GB is so big a table, wouldn't make sense to you to partition it ?
2a) or if it's partitioned, doesn't the time of creation or dropping of new partitions match the time of the conflict?
Partitioning is in the works, but none at the moment.

3) can you track long running transactions on the master?
4) what are the isolation levels on master / replica?

Transaction times on the master max out around two minutes. On the replica they are much longer -- numerous 1 - 2 hour transactions per day, and occasional ones as long as 10 - 20 hours. Isolation levels are read committed everywhere.

5) check for active locks in the replica, I guess you should see some blocked transactions during big delays, I've seen this in the past when standby_feedback is turned off.
6) any out of the ordinary messages in the replica's logs? any evidence that has been canceling statements?

I'll make a note to record the active locks next time. I haven't seen anything unusual in the logs during these incidents, but have observed statements getting canceled at other times, which is why I think the config mostly works.

7) are master and replica exactly the same in terms of resources and main parameters?
8) how is performance in both nodes while the big delay is happening? IO / cpu load / etc.

This brings up a good detail I forgot to mention originally. During the last incident, IO utilization on the replica was near 100%, and had been for several hours, which I believe was due to the long queries I canceled.  Now that I think about it, I wonder if the lag may have arisen from IO contention between the query and WAL replay, rather than a query conflict.

Thanks, interesting reading.

Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:
Sorry, I see now there was a similar question a few days ago:

Two ideas proposed (aside from disconnects):
* Autovacuum is truncating a page on the master and taking an AccessExclusiveLock on the table in use on the replica
* A "pin conflict", which I'm still unfamiliar with.

The user's response says they are in the first bucket, but the argument relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I need to understand pin conflicts better, but the likely scenario Andrew outlined doesn't apply to me. My offending queries were doing bitmap heap scans on a 300GB table.

Reading the thread I see Andres ask for the "precise conflict" the user gets -- is there a way I can get that without a datadir? And to re-frame the original question, are there causes of replication lag that max_standby_streaming_delay would not be expected to prevent, that would be resolved by killing long standby queries? If so, what's the best way to confirm?

Wyatt

On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt.alt@gmail.com> wrote:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt


--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/



--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

Re: replication lag despite corrective config

From
Rene Romero Benavides
Date:
How big have been the delays after the new settings? I guess significantly lower than before, right? how much have they decreased?

Am Mi., 21. Nov. 2018 um 13:18 Uhr schrieb Rene Romero Benavides <rene.romero.b@gmail.com>:
You're welcome.
Since last Saturday when you addressed the 10 hour delay, with the new settings, have you seen more of such delay incidents? what the previous settings were?
Beware that hot_standby_feedback = on and such long queries in the replica can increase bloat in the master, are you measuring bloat? if so, do you notice a significant increase?

Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:

Hi Rene,

On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
Not sure about the root cause but I can make these observations and raise some questions:
1) 9.6.6 is five bug fix versions behind
Valid point to raise.
2) 300GB is so big a table, wouldn't make sense to you to partition it ?
2a) or if it's partitioned, doesn't the time of creation or dropping of new partitions match the time of the conflict?
Partitioning is in the works, but none at the moment.

3) can you track long running transactions on the master?
4) what are the isolation levels on master / replica?

Transaction times on the master max out around two minutes. On the replica they are much longer -- numerous 1 - 2 hour transactions per day, and occasional ones as long as 10 - 20 hours. Isolation levels are read committed everywhere.

5) check for active locks in the replica, I guess you should see some blocked transactions during big delays, I've seen this in the past when standby_feedback is turned off.
6) any out of the ordinary messages in the replica's logs? any evidence that has been canceling statements?

I'll make a note to record the active locks next time. I haven't seen anything unusual in the logs during these incidents, but have observed statements getting canceled at other times, which is why I think the config mostly works.

7) are master and replica exactly the same in terms of resources and main parameters?
8) how is performance in both nodes while the big delay is happening? IO / cpu load / etc.

This brings up a good detail I forgot to mention originally. During the last incident, IO utilization on the replica was near 100%, and had been for several hours, which I believe was due to the long queries I canceled.  Now that I think about it, I wonder if the lag may have arisen from IO contention between the query and WAL replay, rather than a query conflict.

Thanks, interesting reading.

Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt.alt@gmail.com>:
Sorry, I see now there was a similar question a few days ago:

Two ideas proposed (aside from disconnects):
* Autovacuum is truncating a page on the master and taking an AccessExclusiveLock on the table in use on the replica
* A "pin conflict", which I'm still unfamiliar with.

The user's response says they are in the first bucket, but the argument relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I need to understand pin conflicts better, but the likely scenario Andrew outlined doesn't apply to me. My offending queries were doing bitmap heap scans on a 300GB table.

Reading the thread I see Andres ask for the "precise conflict" the user gets -- is there a way I can get that without a datadir? And to re-frame the original question, are there causes of replication lag that max_standby_streaming_delay would not be expected to prevent, that would be resolved by killing long standby queries? If so, what's the best way to confirm?

Wyatt

On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt.alt@gmail.com> wrote:
I've been struggling to eliminate replication lag on a Postgres 9.6.6 instance on Amazon RDS. I believe the lag is caused by early cleanup conflicts from vacuums on the master, because I can reliably resolve it by killing long-running queries on the standby. I most recently saw ten hours of lag on Saturday and addressed it this way.

The standby is running with
hot_standby_feedback = on
max_standby_streaming_delay = 5min
max_standby_archive_delay = 30s

I am not using replication slots on the primary due to reported negative interactions with pg_repack on large tables.

My rationale for the first two settings is that hot_standby_feedback should address my issues almost all the time, but that max_standby_streaming_delay would sometimes be necessary as a fallback, for instance in cases of a transient connection loss between the standby and primary. I believe these settings are mostly working, because lag is less frequent than it was when I configured them.

My questions are,
* Am I overlooking anything in my configuration?
* What would explain lag caused by query conflicts given the max_standby_streaming_delay setting? Shouldn't those queries be getting killed?
* Is there any particular diagnostic info I should be collecting on the next occurrence, to help me figure out the cause? Note that as I'm on RDS, I don't have direct access to the datadir -- just psql.

Thanks for any advice!
Wyatt


--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/



--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/



--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/