Thread: PostgreSQL 9.2 high replication lag

PostgreSQL 9.2 high replication lag

From
Lucas
Date:
Hello all,

I have a cluster of PostgreSQL 9.2 servers (yes, I know... we're working on a migration to PG 13) that has one master and one slave.
image.png

I know that the streaming replication is expected to have delays, but this is getting worse and worse everyday. You can see the replication gap is reaching >10 minutes quite often.

The servers are hosted in AWS as EC2 instances and they're both in different AZs. I also know that the streaming replication was first introduced in PG 9.0, so we're using a very old version of it.. I'm sure it has improved over the years...
What can I look for to make this lag not so high? Is there anything I could change in my postgresql.conf files? Any tips?

Thanks!

---
Regards,

Lucas

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.


Attachment

Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:
Okay.

Looks like I found the "issue". A while ago we set max_standby_streaming_delay to -1, because we were getting to many errors "ERROR: canceling statement due to conflict with recovery.".

According to the documentation, max_standby_streaming_delay is a configuration parameter determining how long a standby server should wait before canceling queries that conflict with pending WAL entries received via streaming replication.

My question then is: Which queries, if the slave can only receive SELECTs?

---
Regards,

Lucas

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, July 14th, 2021 at 1:09 PM, Lucas <root@sud0.nz> wrote:
Hello all,

I have a cluster of PostgreSQL 9.2 servers (yes, I know... we're working on a migration to PG 13) that has one master and one slave.
image.png

I know that the streaming replication is expected to have delays, but this is getting worse and worse everyday. You can see the replication gap is reaching >10 minutes quite often.

The servers are hosted in AWS as EC2 instances and they're both in different AZs. I also know that the streaming replication was first introduced in PG 9.0, so we're using a very old version of it.. I'm sure it has improved over the years...
What can I look for to make this lag not so high? Is there anything I could change in my postgresql.conf files? Any tips?

Thanks!

---
Regards,

Lucas

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.



Attachment

Re: PostgreSQL 9.2 high replication lag

From
"David G. Johnston"
Date:
On Tue, Jul 13, 2021 at 8:01 PM Lucas <root@sud0.nz> wrote:
According to the documentation, max_standby_streaming_delay is a configuration parameter determining how long a standby server should wait before canceling queries that conflict with pending WAL entries received via streaming replication.

My question then is: Which queries, if the slave can only receive SELECTs?

The select queries...

Follow the link in the docs for max_stanbdy_steaming_delay to:


for the details.

David J.

Re: PostgreSQL 9.2 high replication lag

From
Laurenz Albe
Date:
On Tue, 2021-07-13 at 20:14 -0700, David G. Johnston wrote:
> On Tue, Jul 13, 2021 at 8:01 PM Lucas <root@sud0.nz> wrote:
> > According to the documentation, max_standby_streaming_delay is a configuration parameter
> > determining how long a standby server should wait before canceling queries that conflict
> >  with pending WAL entries received via streaming replication.
> > 
> > My question then is: Which queries, if the slave can only receive SELECTs?
> 
> The select queries...
> 
> Follow the link in the docs for max_stanbdy_steaming_delay to:
> 
> https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT
> 
> for the details.

Perhaps you might find the following article interesting, where I tried to
discuss this topic in some depth:

https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:


‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, July 14th, 2021 at 5:30 PM, Laurenz Albe laurenz.albe@cybertec.at wrote:

On Tue, 2021-07-13 at 20:14 -0700, David G. Johnston wrote:

On Tue, Jul 13, 2021 at 8:01 PM Lucas root@sud0.nz wrote:

According to the documentation, max_standby_streaming_delay is a configuration parameter

determining how long a standby server should wait before canceling queries that conflict

with pending WAL entries received via streaming replication.

My question then is: Which queries, if the slave can only receive SELECTs?

The select queries...

Follow the link in the docs for max_stanbdy_steaming_delay to:

https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

for the details.

Perhaps you might find the following article interesting, where I tried to

discuss this topic in some depth:

https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/


Thanks!

I have changed max_standby_streaming_delay from -1 to 120s. However, I can still see the replication being delayed for up to 3 minutes quite frequent.

I will try to see if I can find what kind of query is causing that... Maybe via pg_stat_database_conflicts?
Attachment

Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, July 15th, 2021 at 10:34 PM, Lucas <root@sud0.nz> wrote:


‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Wednesday, July 14th, 2021 at 5:30 PM, Laurenz Albe laurenz.albe@cybertec.at wrote:

On Tue, 2021-07-13 at 20:14 -0700, David G. Johnston wrote:

On Tue, Jul 13, 2021 at 8:01 PM Lucas root@sud0.nz wrote:

According to the documentation, max_standby_streaming_delay is a configuration parameter

determining how long a standby server should wait before canceling queries that conflict

with pending WAL entries received via streaming replication.

My question then is: Which queries, if the slave can only receive SELECTs?

The select queries...

Follow the link in the docs for max_stanbdy_steaming_delay to:

https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

for the details.

Perhaps you might find the following article interesting, where I tried to

discuss this topic in some depth:

https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/


Thanks!

I have changed max_standby_streaming_delay from -1 to 120s. However, I can still see the replication being delayed for up to 3 minutes quite frequent.

I will try to see if I can find what kind of query is causing that... Maybe via pg_stat_database_conflicts?

Screen Shot 2021-07-16 at 10.32.07 AM.png

At 10pm (red mark) yesterday I changed the max_standby_streaming_delay = -1 to max_standby_streaming_delay = 120s.

However, it hasn't changed and the replication lag is still high all the time.

I'm deploying a new slave server but it will be in the same AZ as its master. And will monitor there and replication lag.

Please, let me know if you guys have any other suggestions.
Attachment

Re: PostgreSQL 9.2 high replication lag

From
Luca Ferrari
Date:
On Fri, Jul 16, 2021 at 12:38 AM Lucas <root@sud0.nz> wrote:
> However, it hasn't changed and the replication lag is still high all the time.
>

Since the documentation states that max_standby_streaming_delay
"[...]it is the maximum total time allowed to apply WAL data once it
has been received from the primary server[...]", could it be the
trivial explaination that the master is serving WALs with an high
delay?

Luca



Re: PostgreSQL 9.2 high replication lag

From
Ninad Shah
Date:
Version 9.2 is very old and has a lot of issues related to streaming replication. Additionally, it is already EOL.

Furthermore, max_standby_streaming_delay has no relation with the streaming lag. It's associated with queries conflicting with data. This parameter will not speed up the WAL apply operation.

Kindly consider upgrading to a supported version.


Regards,
Ninad Shah

On Fri, 23 Jul 2021 at 15:18, Luca Ferrari <fluca1978@gmail.com> wrote:
On Fri, Jul 16, 2021 at 12:38 AM Lucas <root@sud0.nz> wrote:
> However, it hasn't changed and the replication lag is still high all the time.
>

Since the documentation states that max_standby_streaming_delay
"[...]it is the maximum total time allowed to apply WAL data once it
has been received from the primary server[...]", could it be the
trivial explaination that the master is serving WALs with an high
delay?

Luca


Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:


Version 9.2 is very old and has a lot of issues related to streaming replication. Additionally, it is already EOL.

Agreed.


Furthermore, max_standby_streaming_delay has no relation with the streaming lag. It's associated with queries conflicting with data. This parameter will not speed up the WAL apply operation.

Yes, I know. But this parameter determines how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries. So, there could be lots of queries conflicting with about-to-be-applied WAL entries.
Again, I was not sure about what is causing the delay so I needed to try any possibility.

Kindly consider upgrading to a supported version.

Yes. We have a project in progress to upgrade.

On top of that, I can say that I have restarted the slave and the replication is now normal.
image.png

You can see above that the replication delay is not passing 3 minutes, which is the max_standby_streaming_delay, but the actual average is 5 seconds. So, looks like rebooting the EC2 instance resolved the issue, but I do not know why.
Attachment

Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:
Hello guys.

After setting max_standby_streaming_delay to 120s it got a lot better. But the replication delay is still happening quite often, except this time goes up to 120s only.
image.png

I know that max_standby_streaming_delay stands for how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries, but this is causing some headache and I was wondering if you guys have other suggestions?

What I could do next? Do you think by upgrading PostgreSQL from 9.2 to 13 will somehow improve that?

---
Regards,

Lucas

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Sunday, July 25th, 2021 at 8:33 PM, Lucas <root@sud0.nz> wrote:



Version 9.2 is very old and has a lot of issues related to streaming replication. Additionally, it is already EOL.

Agreed.


Furthermore, max_standby_streaming_delay has no relation with the streaming lag. It's associated with queries conflicting with data. This parameter will not speed up the WAL apply operation.

Yes, I know. But this parameter determines how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries. So, there could be lots of queries conflicting with about-to-be-applied WAL entries.
Again, I was not sure about what is causing the delay so I needed to try any possibility.

Kindly consider upgrading to a supported version.

Yes. We have a project in progress to upgrade.

On top of that, I can say that I have restarted the slave and the replication is now normal.
image.png

You can see above that the replication delay is not passing 3 minutes, which is the max_standby_streaming_delay, but the actual average is 5 seconds. So, looks like rebooting the EC2 instance resolved the issue, but I do not know why.

Attachment

Re: PostgreSQL 9.2 high replication lag

From
Ben Madin
Date:
I realise you may have other complications, but we have upgraded numerous v9.6 servers (on AWS) to v12 in the last year, with basically no problems in the upgrade, and many less problems with streaming. It might take less time to complete the upgrade??

cheers

Ben


On Wed, 14 Jul 2021 at 09:10, Lucas <root@sud0.nz> wrote:
Hello all,

I have a cluster of PostgreSQL 9.2 servers (yes, I know... we're working on a migration to PG 13) that has one master and one slave.
image.png

I know that the streaming replication is expected to have delays, but this is getting worse and worse everyday. You can see the replication gap is reaching >10 minutes quite often.

The servers are hosted in AWS as EC2 instances and they're both in different AZs. I also know that the streaming replication was first introduced in PG 9.0, so we're using a very old version of it.. I'm sure it has improved over the years...
What can I look for to make this lag not so high? Is there anything I could change in my postgresql.conf files? Any tips?

Thanks!

---
Regards,

Lucas

This message is encrypted. Both the Public Key and the GPG encrypted message are included in this email so that you can verify its origin.




--

Ausvet Logo

Dr Ben Madin
 
BVMS MVPHMgmt PhD MANZCVS GAICD
Managing Director
Skype: benmadin
Address: 
5 Shuffrey Street
Fremantle, WA 6160
Australia
Attachment

Re: PostgreSQL 9.2 high replication lag

From
Laurenz Albe
Date:
On Fri, 2021-08-20 at 01:33 +0000, Lucas wrote:
> After setting max_standby_streaming_delay to 120s it got a lot better.
>  But the replication delay is still happening quite often, except this time goes up to 120s only.

That's exactly what this parameter should do.

If you don't want the delays, either reduce the value (and get more canceled queries)
or try to reduce the number of conflicts, for example by setting "hot_standby_feedback = on".

Note that you will never be able to completely get rid of replication colflicts;
for example, there are buffer pin conflicts or lock conflicts caused by autovacuum
truncation.

See this article for more:
https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/

If you want a standby that has no apply delays and no canceled queries is usually
not possible.  Consider using two standby servers for these two purposes.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, August 20th, 2021 at 4:56 PM, Ben Madin <ben@ausvet.com.au> wrote:

I realise you may have other complications, but we have upgraded numerous v9.6 servers (on AWS) to v12 in the last year, with basically no problems in the upgrade, and many less problems with streaming. It might take less time to complete the upgrade??


Hi Ben. Thanks for your reply.

I have a Bucardo replication between my EC2 instance PG 9.2 database and RDS running PG 13. We have completed a couple of tests already (including a stress/load test). But the business thinks is too risky to move all customers to RDS at once. They wanted to do it gradually which complicates my life as I would need a bidirectional replication in place.

So, I'm still trying to convince them to migrate all customers at once.... :(

Lucas
Attachment

Re: PostgreSQL 9.2 high replication lag

From
Lucas
Date:

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐


On Friday, August 20th, 2021 at 5:29 PM, Laurenz Albe laurenz.albe@cybertec.at wrote:

On Fri, 2021-08-20 at 01:33 +0000, Lucas wrote:

After setting max_standby_streaming_delay to 120s it got a lot better.

But the replication delay is still happening quite often, except this time goes up to 120s only.

That's exactly what this parameter should do.

If you don't want the delays, either reduce the value (and get more canceled queries)

or try to reduce the number of conflicts, for example by setting "hot_standby_feedback = on".


Yes, I already have the hot_standby_feedback = on set to on on all slaves.

Note that you will never be able to completely get rid of replication colflicts;

for example, there are buffer pin conflicts or lock conflicts caused by autovacuum

truncation.

See this article for more:

https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/

If you want a standby that has no apply delays and no canceled queries is usually

not possible. Consider using two standby servers for these two purposes.


Thanks for sharing this. I feel relief a bit to know that replication conflicts will always "be there". Since I started this email thread, we have deployed a couple of extra slaves to share the load between them. This has helped a lot with the replication delay, but it is still there... 

I think I'll end up lowering max_standby_streaming_delay and dealing with conflits when they happen. Let me ask you; Is there a way to know what kind of conflicts are being responsible for the replication delay? How could I check this?

Thanks
Lucas
Attachment