On Mon, Apr 24, 2017 at 9:17 PM, Andres Freund <andres@anarazel.de> wrote:
Questions (answer as many you can comfortably answer): - How many MB/s, segments/s do you see on busier servers? - What generates the bulk of WAL on your servers (9.5+ can use pg_xlogdump --stats to compute that)? - Are you seeing WAL writes being a bottleneck?OA - What kind of backup methods are you using and is the WAL volume a problem? - What kind of replication are you using and is the WAL volume a problem? - What are your settings for wal_compression, max_wal_size (9.5+) / checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers? - Could you quickly describe your workload?
* Postgresql 9.3
* 1500+ db servers
* Daily peak for busy databases: 75 WALs switched per second (less than 10% of the servers experience this)
* Avg per db: 2 WALs/s * Mainly generated by large batch sync processes that occur throughout the day, and by a legacy archiving process to purge older data (potentially many millions of cascading deletes). *Half the servers have (encrypted) pg_dump backups, WAL volume hasn't proved to be a problem there, though dump size is a problem for a few of the larger databases (less than 1TB).
* Inter-data-centre replication is all streaming, across DC's (over the WAN) WAL shipping is over compressed SSH tunnels.
Occasionally the streaming replication falls behind, but more commonly it is the cross-DC log shipping that becomes a problem. Some of the servers will generate 50+ GBs of WAL in a matter of minutes and that backs up immediately on the masters. Occasionally this has a knock-on effect for other servers and slows down their log shipping due to network saturation.