I have a database set up with streaming replication over two nodes, and use repmgr for managing the setup. As per repmgr instructions, I have set keep_wal_segments to 5000, resulting in about 80GB of WAL files in pg_xlog. I've set up pg_xlog on it's own 100GB volume.
Mostly, the disk space used by the WAL files are a constant 80GB, with the files rotating out as new ones are written. This is how I understand it's supposed to work. However, on a couple of occasions, the total disk usage on the pg_xlog volume has grown, and filled up the available space (with the predictable consequences). This seems to have happened under periods of heavy IO load on the underlying disk system.
The Postgres cluster is spread out over different volumes, but all the volumes are on the same SAN, so heavy load on one volume is heavy load on all the volumes. Am I right in suspecting that IO bottleneck is the cause of WAL files expanding over the keep_wal_segments value, or are there circumstances where the WAL might actually grow over the set value?