Thread: BUG #7801: Streaming failover checkpoints much slower than master, pg_xlog space problems during db load

The following bug has been logged on the website:

Bug reference:      7801
Logged by:          Brian Krznarich
Email address:      briank@openroadtech.com
PostgreSQL version: 9.2.2
Operating system:   RHEL6
Description:        =


I was told to resubmit this- the original submission (7777) may have been
lost.

Simply stated, pg_xlog grows out of control on a streaming-replication
backup server with a high volume of writes on the master server. This occurs
only with checkpoint_completion_target>0 and very large (eg. 8GB)
shared_buffers. pg_xlog on the master stays a fixed size (1.2G for me). =


Detail: I have two new/fast servers using streaming replication, with 9.2.2
from the PostgreSQL yum repository. The servers are connected via 10G
network, so the failover receives data as fast as it is created.

It appears that to trigger this problem you need large shared_buffers (mine
is set to 8GB) and checkpoint_completion_target > 0. (.8 or .9 will do). I
have checkpoint_timeout=3D5min. (though setting this to 30s *does not*
alleviate the problem).  The failover machine doesn't seem to be concerned
with checkpoint_timeout or checkpoint_segments.

When doing large writes (like pg_restore) the failover machine's pg_xlog
does not clear files. The reason appears to be that it is not
checkpointing(verified by log_checkpoints=3Don and examining the log).  If I
run 'checkpoint;' manually on the failover in psql(hot_standby=3Don), it
immediately cleans up pg_xlog(after a pause to actually perform the
checkpoint).

To generate the data,you can just run =

pgbench -i -s1000
on the master server, then watch pg_xlog grow on the failover.

This produces a ~10GB of data, then creates indices.  If I run du on the
failover after the data is loaded, I see:  =


# du -s base pg_xlog | sort -n
10928276        pg_xlog
13144536        base

I have a 10GB pg_xlog.  The load completes in less than 5 minutes, which may
be relevant here, since eventually the failover will perform a checkpoint. =

On the primary server, pg_xlog caps at 1.2GB. =


If checkpoint_completion_target=3D0, or shared_buffers=3D256MB, pg_xlog gro=
ws to
around 1GB and stays there.

This guess may be off target, but it appears that the database only
checkpoints every approximately checkpoint_completion_target*
checkpoint_timeout seconds.  So, for a .9 completion target and the default
5min timeout, the failover will go for 4+ minutes without executing a single
checkpoint, regardless of how much data arrives (verified with
log_checkpoints=3Don). =


All the while the master server is spitting out 'checkpoints are occurring
too frequently' messages, which I assume is to be expected during a reload.

This means that the failover machine needs 2x the database size to not crash
during a database reload.  We run VMs with limited disk allocation where
this is not the case.  We periodically need to dump/restore, and I'm
concerned this simple operation will crash our failover machines.

I can avoid the problem by setting checkpoint_completion_target=3D0, which I
will do for now.

Here are settings that differ from the default initdb install:

shared_buffers =3D 8GB           =

work_mem =3D 64MB                 =

maintenance_work_mem =3D 256MB
max_stack_depth =3D 8MB  =

wal_level =3D hot_standby =

synchronous_commit =3D off  =

wal_buffers =3D 64MB          =

checkpoint_segments =3D 10  =

checkpoint_completion_target =3D 0   =

max_wal_senders =3D 2   =

hot_standby =3D on          =

effective_cache_size =3D 8GB
log_checkpoints =3D on
On 8 January 2013 19:24,  <briank@openroadtech.com> wrote:

> Simply stated, pg_xlog grows out of control on a streaming-replication
> backup server with a high volume of writes on the master server. This occurs
> only with checkpoint_completion_target>0 and very large (eg. 8GB)
> shared_buffers. pg_xlog on the master stays a fixed size (1.2G for me).

All of this appears to be working as designed.

It will issue a restartpoint every checkpoint_timeout seconds on the standby.

checkpoint_segments is ignored on standby.

> Detail: I have two new/fast servers using streaming replication, with 9.2.2
> from the PostgreSQL yum repository. The servers are connected via 10G
> network, so the failover receives data as fast as it is created.
>
> It appears that to trigger this problem you need large shared_buffers (mine
> is set to 8GB) and checkpoint_completion_target > 0. (.8 or .9 will do). I
> have checkpoint_timeout=5min. (though setting this to 30s *does not*
> alleviate the problem).  The failover machine doesn't seem to be concerned
> with checkpoint_timeout or checkpoint_segments.
>
> When doing large writes (like pg_restore) the failover machine's pg_xlog
> does not clear files. The reason appears to be that it is not
> checkpointing(verified by log_checkpoints=on and examining the log).  If I
> run 'checkpoint;' manually on the failover in psql(hot_standby=on), it
> immediately cleans up pg_xlog(after a pause to actually perform the
> checkpoint).
>
> To generate the data,you can just run
> pgbench -i -s1000
> on the master server, then watch pg_xlog grow on the failover.
>
> This produces a ~10GB of data, then creates indices.  If I run du on the
> failover after the data is loaded, I see:
>
> # du -s base pg_xlog | sort -n
> 10928276        pg_xlog
> 13144536        base
>
> I have a 10GB pg_xlog.  The load completes in less than 5 minutes, which may
> be relevant here, since eventually the failover will perform a checkpoint.
> On the primary server, pg_xlog caps at 1.2GB.
>
> If checkpoint_completion_target=0, or shared_buffers=256MB, pg_xlog grows to
> around 1GB and stays there.
>
> This guess may be off target, but it appears that the database only
> checkpoints every approximately checkpoint_completion_target*
> checkpoint_timeout seconds.  So, for a .9 completion target and the default
> 5min timeout, the failover will go for 4+ minutes without executing a single
> checkpoint, regardless of how much data arrives (verified with
> log_checkpoints=on).
>
> All the while the master server is spitting out 'checkpoints are occurring
> too frequently' messages, which I assume is to be expected during a reload.
>
> This means that the failover machine needs 2x the database size to not crash
> during a database reload.  We run VMs with limited disk allocation where
> this is not the case.  We periodically need to dump/restore, and I'm
> concerned this simple operation will crash our failover machines.
>
> I can avoid the problem by setting checkpoint_completion_target=0, which I
> will do for now.
>
> Here are settings that differ from the default initdb install:
>
> shared_buffers = 8GB
> work_mem = 64MB
> maintenance_work_mem = 256MB
> max_stack_depth = 8MB
> wal_level = hot_standby
> synchronous_commit = off
> wal_buffers = 64MB
> checkpoint_segments = 10
> checkpoint_completion_target = 0
> max_wal_senders = 2
> hot_standby = on
> effective_cache_size = 8GB
> log_checkpoints = on


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 1/8/2013 2:48 PM, Simon Riggs wrote:
> On 8 January 2013 19:24,  <briank@openroadtech.com> wrote:
>
>> Simply stated, pg_xlog grows out of control on a streaming-replication
>> backup server with a high volume of writes on the master server. This oc=
curs
>> only with checkpoint_completion_target>0 and very large (eg. 8GB)
>> shared_buffers. pg_xlog on the master stays a fixed size (1.2G for me).
> All of this appears to be working as designed.
>
> It will issue a restartpoint every checkpoint_timeout seconds on the stan=
dby.
>
> checkpoint_segments is ignored on standby.
The documentation does not seem to agree with the last point.
"In standby mode, a restartpoint is also triggered if=20
checkpoint_segments log segments have been replayed since last=20
restartpoint and at least one checkpoint record has been replayed."

This is precisely the problem.  The failover should not go=20
checkpoint_timeout*checkpoint_completion_target seconds without=20
executing a restartpoint, in spite of the fact that thousands of WAL=20
segments are stacking up in pg_xlog.

With checkpoint_completion_target=3D0, the standby server will happily=20
execute restartpoints much faster than checkpoint_timeout if it is=20
necessary.  Once  checkpoint_completion_target>0, no attention is paid=20
to the backlog of WAL data.

I honestly do not understand postgresql well enough to understand why=20
large vs. small shared_buffers changes this behavior, but it does.   If=20
shared_buffers is not extremely large, it seems postgresql is forced to=20
execute restartpoints more frequently?

In general it seems like it should be safe to use the same=20
postgresql.conf on the master and the standby server, but this would=20
clearly be an exception.   One wouldn't expect a 10GB pg_xlog on a=20
standby where the master has no such problem.

Thank you for your assistance.

Brian=