Thread: Hot standby read slaves exceed max delay on WAL segment. Replication lag.

Hot standby read slaves exceed max delay on WAL segment. Replication lag.

From
Shaun Duncan
Date:
This is on a production 9.0.15 install with 1 master, 5 hot standby read only slaves.

We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding max_standby_archive_delay (60s) and max_standby_streaming_delay (60s) and not applying changes. The slave will get the first segment and hang (we've seen up to 30m before removing the slave from our read pool to catch up) and get further and further behind the master. Furthermore, we have seen that after a slave has caught up, putting it back into the read pool will mean will almost immediately start to see this happen again. It acts as if we had max_standby_* set to -1.

I'm just looking for some ideas, hints, or suggestions as to what might be going on here or what we might be doing wrong.

Thanks,
Shaun

Re: Hot standby read slaves exceed max delay on WAL segment. Replication lag.

From
Venkata Balaji Nagothi
Date:

On Thu, Mar 20, 2014 at 4:12 AM, Shaun Duncan <shaun.duncan@gmail.com> wrote:

This is on a production 9.0.15 install with 1 master, 5 hot standby read only slaves.

We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding max_standby_archive_delay (60s) and max_standby_streaming_delay (60s) and not applying changes. The slave will get the first segment and hang (we've seen up to 30m before removing the slave from our read pool to catch up) and get further and further behind the master. Furthermore, we have seen that after a slave has caught up, putting it back into the read pool will mean will almost immediately start to see this happen again. It acts as if we had max_standby_* set to -1.

I'm just looking for some ideas, hints, or suggestions as to what might be going on here or what we might be doing wrong.


Have you noticed if the read requests on standby database are taking long and loading up the server ? If yes, any idea why the queries on standby are taking long to complete ? 

WAL replay would be paused ( until the read requests are served ) at the time of conflicting queries on the standby site.

Also, Do you see any delay in master sending the WALs to standby ?


Venkata Balaji N

Sr. Database Administrator
Fujitsu Australia




Re: Hot standby read slaves exceed max delay on WAL segment. Replication lag.

From
Venkata Balaji Nagothi
Date:
On Thu, Mar 20, 2014 at 8:19 AM, Shaun Duncan <shaun.duncan@gmail.com> wrote:


On Mar 19, 2014 5:08 PM, "Venkata Balaji Nagothi" <vbnpgc@gmail.com> wrote:
>
>
> On Thu, Mar 20, 2014 at 4:12 AM, Shaun Duncan <shaun.duncan@gmail.com> wrote:
>
>> This is on a production 9.0.15 install with 1 master, 5 hot standby read only slaves.
>>
>> We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding max_standby_archive_delay (60s) and max_standby_streaming_delay (60s) and not applying changes. The slave will get the first segment and hang (we've seen up to 30m before removing the slave from our read pool to catch up) and get further and further behind the master. Furthermore, we have seen that after a slave has caught up, putting it back into the read pool will mean will almost immediately start to see this happen again. It acts as if we had max_standby_* set to -1.
>>
>> I'm just looking for some ideas, hints, or suggestions as to what might be going on here or what we might be doing wrong.
>>
>
> Have you noticed if the read requests on standby database are taking long and loading up the server ? If yes, any idea why the queries on standby are taking long to complete ? 
>
> WAL replay would be paused ( until the read requests are served ) at the time of conflicting queries on the standby site.
>
> Also, Do you see any delay in master sending the WALs to standby ?
>
>
> Venkata Balaji N
>
> Sr. Database Administrator
> Fujitsu Australia
>
>
>
>

Yes we have seen queries against the read slave degrade severely while this happens. We have at least optimized some inefficient queries. But I would think (correct me if I am wrong) that queries would be canceled after WAL replay has exceeded the max delay. This is something we haven't seen. It behaves as if it is waiting for reads to complete.

 
No. Generally queries continue to run on Standby without any issues, except in the below 2 situations -

  • When conflicting queries (like DROP TABLE, DROP DATABASE etc..)  are executed on master whilst standby is serving the read requests. This behavior can be controlled to some extent by setting max_standby_*_delay parameters (60s in your case). If there are any read requests in process at standby database,the conflicting queries would wait for 60s - if the read requests (Generally SELECTs) do not complete, then they are forcibly cancelled
           In-short max_standby_*_delay parameters only effect the replication behavior when conflicting queries are coming to standby. 
  • When master is waiting to apply VACUUM related changes (removing dead rows) at the standby site. This behavior can be controlled to some extent by "vacuum_defer_cleanup_age". This implies number of transactions executed on the master database.

If there are no conflicting queries from primary waiting to be applied at standby database - then queries would continue to run without any issues until 
statement_timeout ( if configured ) threshold is reached.

Do you see any messages like "ERROR: canceling statement due to conflict... " in standby postgres logs ?


Hope this helps !

Venkata Balaji N

Sr.Database Administrator
Fujitsu Australia


On Wed, Mar 19, 2014 at 11:31 PM, Venkata Balaji Nagothi <vbnpgc@gmail.com> wrote:
On Thu, Mar 20, 2014 at 8:19 AM, Shaun Duncan <shaun.duncan@gmail.com> wrote:


On Mar 19, 2014 5:08 PM, "Venkata Balaji Nagothi" <vbnpgc@gmail.com> wrote:
>
>
> On Thu, Mar 20, 2014 at 4:12 AM, Shaun Duncan <shaun.duncan@gmail.com> wrote:
>
>> This is on a production 9.0.15 install with 1 master, 5 hot standby read only slaves.
>>
>> We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding max_standby_archive_delay (60s) and max_standby_streaming_delay (60s) and not applying changes. The slave will get the first segment and hang (we've seen up to 30m before removing the slave from our read pool to catch up) and get further and further behind the master. Furthermore, we have seen that after a slave has caught up, putting it back into the read pool will mean will almost immediately start to see this happen again. It acts as if we had max_standby_* set to -1.
>>
>> I'm just looking for some ideas, hints, or suggestions as to what might be going on here or what we might be doing wrong.
>>
>
> Have you noticed if the read requests on standby database are taking long and loading up the server ? If yes, any idea why the queries on standby are taking long to complete ? 
>
> WAL replay would be paused ( until the read requests are served ) at the time of conflicting queries on the standby site.
>
> Also, Do you see any delay in master sending the WALs to standby ?
>
>
> Venkata Balaji N
>
> Sr. Database Administrator
> Fujitsu Australia
>
>
>
>

Yes we have seen queries against the read slave degrade severely while this happens. We have at least optimized some inefficient queries. But I would think (correct me if I am wrong) that queries would be canceled after WAL replay has exceeded the max delay. This is something we haven't seen. It behaves as if it is waiting for reads to complete.

 
No. Generally queries continue to run on Standby without any issues, except in the below 2 situations -

  • When conflicting queries (like DROP TABLE, DROP DATABASE etc..)  are executed on master whilst standby is serving the read requests. This behavior can be controlled to some extent by setting max_standby_*_delay parameters (60s in your case). If there are any read requests in process at standby database,the conflicting queries would wait for 60s - if the read requests (Generally SELECTs) do not complete, then they are forcibly cancelled
           In-short max_standby_*_delay parameters only effect the replication behavior when conflicting queries are coming to standby. 
  • When master is waiting to apply VACUUM related changes (removing dead rows) at the standby site. This behavior can be controlled to some extent by "vacuum_defer_cleanup_age". This implies number of transactions executed on the master database.

If there are no conflicting queries from primary waiting to be applied at standby database - then queries would continue to run without any issues until 
statement_timeout ( if configured ) threshold is reached.

Do you see any messages like "ERROR: canceling statement due to conflict... " in standby postgres logs ?


Hope this helps !

Venkata Balaji N

Sr.Database Administrator
Fujitsu Australia



Yes, we do see statements being canceled from time to time, just not when we experience a lag situation with WALs not being applied. Is it just entirely possible that there is too much work for one of the slaves to do (hence the degraded query throughput we see) that it is unable to apply WALs? In other words, there are no conflicts that occur, but WALs cannot be applied an a timely manner? There's a noticeable pattern that in times of lag, the overall number of slow queries dramatically increases (i.e. from a few hundred to tens of thousands).

Thanks! 

Re: Hot standby read slaves exceed max delay on WAL segment. Replication lag.

From
Venkata Balaji Nagothi
Date:
Yes, we do see statements being canceled from time to time, just not when we experience a lag situation with WALs not being applied. Is it just entirely possible that there is too much work for one of the slaves to do (hence the degraded query throughput we see) that it is unable to apply WALs? In other words, there are no conflicts that occur, but WALs cannot be applied an a timely manner? There's a noticeable pattern that in times of lag, the overall number of slow queries dramatically increases (i.e. from a few hundred to tens of thousands).

1. If SQL statements are getting cancelled time-to-time - then max_standby_*_delay parameters are working as expected and there is no issue with them.

2. You need to look at the lag issue, which can occur for number of other reasons.

How do you come to a conclusion that there is a lag between master and slave ? Did you check the last xlog position on master and slave databases ?

Lag occurs even when the load is very high. Do you see any high IO wait on Master or Slave ?

I did not understand when you said "from few hundred to tens of thousands" - These are number of WALs or SQLs you are referring to ?

Do you see any other messages related to lag in Standby logfile (like " WAL archive file not found.. cannot be restored..." etc)

Same is the situation across all the slaves ?

Venkata Balaji N

Sr. Database Administrator
Fujitsu Australia