Thread: Replication Lag Causes

Replication Lag Causes

From

Mike Wilson

Date:

01 November 2014, 22:33:21

I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads. My instances are in a master-slave streaming replication setup and are stable and in general perform very well. The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively. I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:

* What causes streaming replication lag to increase?

* What parameters can be tuned to reduce streaming replication lag?

* Can a loaded slave affect lag adversely?

* Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times. At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.

For example, the master is a very beefy Solaris:

* 4 Recent Intel Zeons (16 physical cores)

* 256 GB of ECC RAM

* 12 TB of ZFS (spindle and SSD internal storage)

* DB on disk size is 2TB

* ZFS ARC cache of roughly 250G.

* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:

shared_buffers = 2GB

work_mem = 128MB

max_connections = 1700 (supports roughly 100 web servers)

wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)

wal_sender_timeout = 60s

replication_timeout=(not set)

wal_receiver_status_interval=10s

max_wal_senders=6

* wal archiving is off

* 98% of the queries on the master complete in under 500ms.

* No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds. When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly. Load on the slave is generally below 1 even when the master is heavily loaded. The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc). DML statements increase proportionally when under load though. The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation. BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it. I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help. In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely. I was wondering if this was expected and/or some design consideration? Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times? I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.

Mike Wilson

Predicate Logic Consulting

Re: Replication Lag Causes

From

Valentine Gogichashvili

Date:

01 November 2014, 23:14:51

Hello Mike,

what kind of load does the slave get?

what does the recovery process do on the slave during the times when lag is being observed? Does it use 100% of the CPU?

WAL can be replayed by only one process, so no need to increase the max_wal_senders.

Cheers,

-- Valentine Gogichashvili

On Sun, Nov 2, 2014 at 1:33 AM, Mike Wilson <mfwilson@gmail.com> wrote:

I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads. My instances are in a master-slave streaming replication setup and are stable and in general perform very well. The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively. I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:
* What causes streaming replication lag to increase?
* What parameters can be tuned to reduce streaming replication lag?
* Can a loaded slave affect lag adversely?
* Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times. At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.

For example, the master is a very beefy Solaris:
* 4 Recent Intel Zeons (16 physical cores)
* 256 GB of ECC RAM
* 12 TB of ZFS (spindle and SSD internal storage)
* DB on disk size is 2TB
* ZFS ARC cache of roughly 250G.
* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
shared_buffers = 2GB
work_mem = 128MB
max_connections = 1700 (supports roughly 100 web servers)
wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
wal_sender_timeout = 60s
replication_timeout=(not set)
wal_receiver_status_interval=10s
max_wal_senders=6
* wal archiving is off
* 98% of the queries on the master complete in under 500ms.
* No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds. When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly. Load on the slave is generally below 1 even when the master is heavily loaded. The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc). DML statements increase proportionally when under load though. The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation. BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it. I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help. In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely. I was wondering if this was expected and/or some design consideration? Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times? I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.

Mike Wilson
Predicate Logic Consulting

Re: Replication Lag Causes

From

Mike Wilson

Date:

02 November 2014, 20:58:46

Load on the slave is relatively light. It averages about 1.0 due to some data ware house select queries running against it frequently. Previously only the load on the master seems to have affected our replication lag no matter what the slave was doing.

In thinking about this a bit more, the load on the master does cause increasing lag but only if the query mix begins to change to more DML than SELECTS. Basically, the amount of DML is what really appears to cause the replication to lag. This is an OLTP system backing a rather heavy commercial website where memberships are sold and when the purchase traffic increases that is when we start to see extreme lag develop on the slave.

CPU utilization on the slave during extreme lag is similar to normal operation even if the slave is lagging more than usual.

Thanks for the info on max_wal_senders. That’s good to know.

Mike Wilson

On Nov 1, 2014, at 4:14 PM, Valentine Gogichashvili <valgog@gmail.com> wrote:

Hello Mike,

what kind of load does the slave get?

what does the recovery process do on the slave during the times when lag is being observed? Does it use 100% of the CPU?

WAL can be replayed by only one process, so no need to increase the max_wal_senders.

Cheers,

-- Valentine Gogichashvili

On Sun, Nov 2, 2014 at 1:33 AM, Mike Wilson <mfwilson@gmail.com> wrote:
I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads. My instances are in a master-slave streaming replication setup and are stable and in general perform very well. The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively. I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:
* What causes streaming replication lag to increase?
* What parameters can be tuned to reduce streaming replication lag?
* Can a loaded slave affect lag adversely?
* Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times. At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.

For example, the master is a very beefy Solaris:
* 4 Recent Intel Zeons (16 physical cores)
* 256 GB of ECC RAM
* 12 TB of ZFS (spindle and SSD internal storage)
* DB on disk size is 2TB
* ZFS ARC cache of roughly 250G.
* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
shared_buffers = 2GB
work_mem = 128MB
max_connections = 1700 (supports roughly 100 web servers)
wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
wal_sender_timeout = 60s
replication_timeout=(not set)
wal_receiver_status_interval=10s
max_wal_senders=6
* wal archiving is off
* 98% of the queries on the master complete in under 500ms.
* No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds. When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly. Load on the slave is generally below 1 even when the master is heavily loaded. The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc). DML statements increase proportionally when under load though. The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation. BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it. I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help. In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely. I was wondering if this was expected and/or some design consideration? Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times? I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.

Mike Wilson
Predicate Logic Consulting

Re: Replication Lag Causes

From

Mike Wilson

Date:

02 November 2014, 21:16:40

Thanks for the information Greg.

Unfortunately modifying the application stack this close to the holiday season won’t be an option so I’m left with:

1) Trying to optimize the settings I have for the query mix I have.

2) Optimize any long running DML queries (if any) to prevent lag due to locks.

3) Getting a better understanding of “what” causes lag.

#3 will probably be central to at least minimizing lag during heavy DML load. If anyone has a good resource to describe when a slave would start to lag potentially that would help me hunt for the cause. I know long running DML on the master may cause lag but I’m uncertain as to the specifics of why. During periods of lag we do have more DML than usual running against the master but the queries themselves are very quick although there might be 20-30 DML operations per second against some of our central tables that store user account information. Even under heavy DML the queries still return in under a second. Possibly a large volume of of short running DML cause replication lag issues for large tables (~20M)?

Thanks again for your help. BDR looks interesting but probably too cutting edge for my client.

Mike Wilson

On Nov 2, 2014, at 12:33 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

Hi Mike,

Sounds very familiar. Our master fans out to 16 slaves (cascading) and we had great success with segregating database queries to different slaves and some based on network latency. I'd suggest, if possible, alter the application to use the slave for simple SELECT's and FUNCTION's performing SELECT-like only work while limiting those applications and queries that perform DML to the master (obviously). If the load on the slave increases too much, spin up another slave. I'd mention from experience that it could be the load on the slave that is giving the appearance of replication lag. This is what led us to having (1) slave per application.

There is also the BDR multi-master available in 9.4beta if you're wanting to live on the edge.

-Greg

On Sat, Nov 1, 2014 at 4:33 PM, Mike Wilson <mfwilson@gmail.com> wrote:
I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads. My instances are in a master-slave streaming replication setup and are stable and in general perform very well. The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively. I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:
* What causes streaming replication lag to increase?
* What parameters can be tuned to reduce streaming replication lag?
* Can a loaded slave affect lag adversely?
* Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times. At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.

For example, the master is a very beefy Solaris:
* 4 Recent Intel Zeons (16 physical cores)
* 256 GB of ECC RAM
* 12 TB of ZFS (spindle and SSD internal storage)
* DB on disk size is 2TB
* ZFS ARC cache of roughly 250G.
* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
shared_buffers = 2GB
work_mem = 128MB
max_connections = 1700 (supports roughly 100 web servers)
wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
wal_sender_timeout = 60s
replication_timeout=(not set)
wal_receiver_status_interval=10s
max_wal_senders=6
* wal archiving is off
* 98% of the queries on the master complete in under 500ms.
* No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds. When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly. Load on the slave is generally below 1 even when the master is heavily loaded. The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc). DML statements increase proportionally when under load though. The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation. BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it. I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help. In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely. I was wondering if this was expected and/or some design consideration? Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times? I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.

Mike Wilson
Predicate Logic Consulting

Re: Replication Lag Causes

From

Felipe Santos

Date:

03 November 2014, 14:34:51

2014-11-02 19:16 GMT-02:00 Mike Wilson <mfwilson@gmail.com>:

Thanks for the information Greg.

Unfortunately modifying the application stack this close to the holiday season won’t be an option so I’m left with:
1) Trying to optimize the settings I have for the query mix I have.
2) Optimize any long running DML queries (if any) to prevent lag due to locks.
3) Getting a better understanding of “what” causes lag.

#3 will probably be central to at least minimizing lag during heavy DML load. If anyone has a good resource to describe when a slave would start to lag potentially that would help me hunt for the cause. I know long running DML on the master may cause lag but I’m uncertain as to the specifics of why. During periods of lag we do have more DML than usual running against the master but the queries themselves are very quick although there might be 20-30 DML operations per second against some of our central tables that store user account information. Even under heavy DML the queries still return in under a second. Possibly a large volume of of short running DML cause replication lag issues for large tables (~20M)?

Thanks again for your help. BDR looks interesting but probably too cutting edge for my client.

Mike Wilson

On Nov 2, 2014, at 12:33 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

Hi Mike,

Sounds very familiar. Our master fans out to 16 slaves (cascading) and we had great success with segregating database queries to different slaves and some based on network latency. I'd suggest, if possible, alter the application to use the slave for simple SELECT's and FUNCTION's performing SELECT-like only work while limiting those applications and queries that perform DML to the master (obviously). If the load on the slave increases too much, spin up another slave. I'd mention from experience that it could be the load on the slave that is giving the appearance of replication lag. This is what led us to having (1) slave per application.

There is also the BDR multi-master available in 9.4beta if you're wanting to live on the edge.

-Greg

On Sat, Nov 1, 2014 at 4:33 PM, Mike Wilson <mfwilson@gmail.com> wrote:
I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads. My instances are in a master-slave streaming replication setup and are stable and in general perform very well. The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively. I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:
* What causes streaming replication lag to increase?
* What parameters can be tuned to reduce streaming replication lag?
* Can a loaded slave affect lag adversely?
* Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times. At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.

For example, the master is a very beefy Solaris:
* 4 Recent Intel Zeons (16 physical cores)
* 256 GB of ECC RAM
* 12 TB of ZFS (spindle and SSD internal storage)
* DB on disk size is 2TB
* ZFS ARC cache of roughly 250G.
* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
shared_buffers = 2GB
work_mem = 128MB
max_connections = 1700 (supports roughly 100 web servers)
wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
wal_sender_timeout = 60s
replication_timeout=(not set)
wal_receiver_status_interval=10s
max_wal_senders=6
* wal archiving is off
* 98% of the queries on the master complete in under 500ms.
* No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds. When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly. Load on the slave is generally below 1 even when the master is heavily loaded. The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc). DML statements increase proportionally when under load though. The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation. BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it. I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help. In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely. I was wondering if this was expected and/or some design consideration? Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times? I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.

Mike Wilson
Predicate Logic Consulting

Hi Mike,

"Basically, the amount of DML is what really appears to cause the replication to lag."

You said that you are using streaming replication, and streaming means reading the logfiles and applying changes on the slave. Since DML is also logged on the logfiles, maybe you are locking yourself by trying to read and write to the same file(s).

You can try changing your replication type from streaming to Archive-WAL, that way you will send archived WALs from the master to the slave, while letting the "hot" WAL available for your DML operations only.

I've changed my replication type in the past and had no significant increase in lag.

Cheers

Re: Replication Lag Causes

From

Stuart Bishop

Date:

04 November 2014, 12:34:30

On 2 November 2014 05:33, Mike Wilson <mfwilson@gmail.com> wrote:

> Any recommendations would be very helpful.

Try using ionice and renice to increase the priority of the WAL sender
process on the master. If it helps, you are lagging because not enough
resources are being used by the sender process (rather than the slave
having trouble, for example). Lowering the number of concurrent
connections in your pgbouncer connection pool could help here.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/