Thank you for your reply. Yes, archiving is enabled, and this is probably the cause of the problem. I guess this can be worked around by storing archived WAL files on a separate storage, to avoid IO contention.
On Mon, Feb 2, 2015 at 8:34 PM, Yaser Raja <yrraja@gmail.com> wrote:
Do you have WAL Archiving enabled (archive_mode)? If yes, then that might be the cause of this WAL files buildup. When archiving is enabled, the WAL files are deleted from pg_xlog directory only after they have been successfully archived via the command specified by "archive_command".
If archive_command starts to fail or the number of WAL files being archived per minute is less (due to IO load, network, compression etc) than the number of new WAL files being generated per minute to pg_xlog then the WAL files will start to increase irrespective of the value set for wal_keep_segments.
I have a database set up with streaming replication over two nodes, and use repmgr for managing the setup. As per repmgr instructions, I have set keep_wal_segments to 5000, resulting in about 80GB of WAL files in pg_xlog. I've set up pg_xlog on it's own 100GB volume.
Mostly, the disk space used by the WAL files are a constant 80GB, with the files rotating out as new ones are written. This is how I understand it's supposed to work. However, on a couple of occasions, the total disk usage on the pg_xlog volume has grown, and filled up the available space (with the predictable consequences). This seems to have happened under periods of heavy IO load on the underlying disk system.
The Postgres cluster is spread out over different volumes, but all the volumes are on the same SAN, so heavy load on one volume is heavy load on all the volumes. Am I right in suspecting that IO bottleneck is the cause of WAL files expanding over the keep_wal_segments value, or are there circumstances where the WAL might actually grow over the set value?