Handling peak WAL data on the hot standby - Mailing list pgsql-admin

From Drazen Kacar
Subject Handling peak WAL data on the hot standby
Date
Msg-id CAFxrd3tYSyy91WcmmnmW2b=i7JeZgCS+OjYLqP9_DWNVp9B5+Q@mail.gmail.com
Whole thread Raw
Responses Re: Handling peak WAL data on the hot standby
List pgsql-admin
Hi,

I have Postgres 9.3.9 primary and 9.3.9 hot standby which is used for
read-only queries. This is the configuration for WAL files and checkpoints
(it's the same on primary and standby):

wal_level = hot_standby
checkpoint_segments = 10
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9
wal_keep_segments = 50

Under normal circumstances (when there is no large write activity) there
are 72 WAL files in pg_xlog, amounting to about 1.2 GB. pg_xlog
files are on a separate partition on both systems. Each partition is
about 10GB large.

WAL files from the primary are shipped to backup server (by the
archive_command) and the standby is configured to use them by
restore_command in its recovery.conf. The backup server is a long term
storage for WAL files (which may be used for PITR) and there's enough disk
space there for pretty much anything. Also, there were no problems with
network connection or anything else.

Replication settings on the standby are:

hot_standby = on
max_standby_archive_delay = 0s
max_standby_streaming_delay = 0s
hot_standby_feedback = on

This all works nice and well when there are no huge write peaks. Queries on
the standby are cancelled from time to time, but that's not a problem.

Several days ago I had an application upgrade which caused huge write
activity. Some of it was caused by "alter table ... add column ... default
..." which caused rewriting of the large table(s). After that there was a
pretty large write activity from reshuffling the data from the old to the
new data model. All in all, a large number of WAL files were generated.

All that caused standby's pg_xlog partition to fill up. This is an excerpt
from its logs:

2015-10-05 17:40:01 UTC [SESS:5591ab92.1148] [DB: ]: LOG:  restartpoint starting: xlog

2015-10-05 17:46:19 UTC [SESS:55f1568e.4e15] [DB: ]: FATAL:  could not write to file "pg_xlog/xlogtemp.19989": No space left on device
Connecting to xdcbackup-01...
Couldn't stat remote file: No such file or directory
File "/home/upbackup-db/xdcprod-db-1/arch/00000001000003770000007D" not found.
2015-10-05 17:46:20 UTC [SESS:5612b76c.7fee] [DB: ]: LOG:  started streaming WAL from primary at 377/7C000000 on timeline 1
2015-10-05 17:46:20 UTC [SESS:5612b76c.7fee] [DB: ]: FATAL:  could not write to file "pg_xlog/xlogtemp.32750": No space left on device

... (this repeats several times) ...

Connecting to xdcbackup-01...
Couldn't write to "pg_xlog/RECOVERYXLOG": No space left on device
Fetching /home/upbackup-db/xdcprod-db-1/arch/00000001000003770000007D to pg_xlog/RECOVERYXLOG
2015-10-05 17:46:55 UTC [SESS:5612b78f.801a] [DB: ]: LOG:  started streaming WAL from primary at 377/7C000000 on timeline 1
2015-10-05 17:46:55 UTC [SESS:5612b78f.801a] [DB: ]: FATAL:  could not write to file "pg_xlog/xlogtemp.32794": No space left on device

Lines which are not prefixed by date are from the restore_command script.

Then the FATAL errors from Postgres were repeating until:

2015-10-05 17:49:01 UTC [SESS:5591ab92.1148] [DB: ]: LOG:  restartpoint complete: wrote 47105 buffers (18.0%); 0 transaction log file(s) added, 149 removed, 21 recycled; write=539.164 s, sync=0.443 s, total=539.833 s; sync files=1011, longest=0.219 s, average=0.000 s
2015-10-05 17:49:01 UTC [SESS:5591ab92.1148] [DB: ]: LOG:  recovery restart point at 375/FC267970
2015-10-05 17:49:01 UTC [SESS:5591ab92.1148] [DB: ]: DETAIL:  last completed transaction was at log time 2015-10-05 19:45:38.753446+02
2015-10-05 17:49:01 UTC [SESS:5591ab92.1148] [DB: ]: LOG:  restartpoint starting: xlog

Connecting to xdcbackup-01...
Fetching /home/upbackup-db/xdcprod-db-1/arch/00000001000003770000007D to pg_xlog/RECOVERYXLOG
2015-10-05 17:49:04 UTC [SESS:5591ab91.1145] [DB: ]: LOG:  restored log file "00000001000003770000007D" from archive

And all was well after that.

I'd like to know how to prevent this kind of failures in the future.

The documentation
(http://www.postgresql.org/docs/9.3/static/wal-configuration.html) says:

In archive recovery or standby mode, the server periodically performs
restartpoints, which are similar to checkpoints in normal operation: the
server forces all its state to disk, updates the pg_control file to
indicate that the already-processed WAL data need not be scanned again,
and then recycles any old log segment files in the pg_xlog directory.
Restartpoints can't be performed more frequently than checkpoints in the
master because restartpoints can only be performed at checkpoint records.
A restartpoint is triggered when a checkpoint record is reached if at
least checkpoint_timeout seconds have passed since the last restartpoint.
In standby mode, a restartpoint is also triggered if at least
checkpoint_segments log segments have been replayed since the last
restartpoint.

These are checkpoint messages from the primary's log around that time:

2015-10-05 17:37:55 UTC [DB: [unknown]] LOG:  checkpoint starting: xlog
2015-10-05 17:42:36 UTC [DB: [unknown]] LOG:  checkpoint complete: wrote 18532 buffers (3.5%); 0 transaction log file(s) added, 0 removed, 12 recycled; write=279.716 s, sync=0.543
s, total=280.317 s; sync files=424, longest=0.082 s, average=0.001 s
2015-10-05 17:42:36 UTC [DB: [unknown]] LOG:  checkpoint starting: xlog
2015-10-05 17:42:41 UTC [DB: [unknown]] LOG:  checkpoint complete: wrote 27620 buffers (5.3%); 0 transaction log file(s) added, 0 removed, 20 recycled; write=1.497 s, sync=3.839 s,
total=5.475 s; sync files=694, longest=1.566 s, average=0.005 s
2015-10-05 17:42:41 UTC [DB: [unknown]] LOG:  checkpoints are occurring too frequently (5 seconds apart)
2015-10-05 17:42:41 UTC [DB: [unknown]] HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2015-10-05 17:42:41 UTC [DB: [unknown]] LOG:  checkpoint starting: xlog
2015-10-05 17:42:46 UTC [DB: [unknown]] LOG:  checkpoint complete: wrote 40235 buffers (7.7%); 0 transaction log file(s) added, 0 removed, 12 recycled; write=1.545 s, sync=3.455 s,
total=5.116 s; sync files=23, longest=3.411 s, average=0.150 s
2015-10-05 17:42:46 UTC [DB: [unknown]] LOG:  checkpoints are occurring too frequently (5 seconds apart)
2015-10-05 17:42:46 UTC [DB: [unknown]] HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2015-10-05 17:42:46 UTC [DB: [unknown]] LOG:  checkpoint starting: xlog

...

2015-10-05 17:47:45 UTC [DB: walreceiver] ERROR:  requested WAL segment 00000001000003770000007C has already been removed
2015-10-05 17:47:49 UTC [DB: walreceiver] ERROR:  requested WAL segment 00000001000003770000007C has already been removed

Standby's restartpoint start message is at 17:40:01 and completion is at
17:49:01, which looks consistent with checkpoint_timeout = 10min and
checkpoint_completion_target = 0.9.

However, in between it ran out of space on the partition. "No space left
on device" error comes from Postgres, so it can know that it's out of
space. I suppose it could ignore checkpoint_completion_target under
these circumstances and just perform the restartpoint as fast as possible,
which would then enable it to delete unneeded WAL files and free the space
on pg_xlog partition.

But it didn't do that. I'd like to know if this is even possible and
whether there's some kind of configuration setting that I've missed.

Another thing I'm wondering about is what would happen if the same problem
happened again right after pg_xlog space was released on standby (assuming
that the primary continues to generate huge amount of WALs). Would
I have partition filled up again and another 9 minute wait for the space to be
reclaimed again?

In general, is there anything I can do to prevent longish wait for pg_xlog
space to be freed on the standby? Some mechanism for throttling the amount
of WAL generated on the primary would be perfectly fine. Or some mechanism
to force restartpoint to complete immediately when pg_xlog runs out of
space on the standby. Or anything else, probably.

There is wal_writer_delay option, but I have hard time figuring out how to
correlate that time based setting to the (unknown) amount of WAL data that
will be generated.

I suppose the same thing could happen if you're moving several million
rows (or tens of millions rows) from one table to another on the live
system. How are people dealing with that?

pgsql-admin by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: Locking: PostgreSQL Vs. Oracle
Next
From: Craig James
Date:
Subject: Re: Handling peak WAL data on the hot standby