Thread: [PERFORM] Questionaire: Common WAL write rates on busy servers.
Hi, I've lately seen more and more installations where the generation of write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious whether that's primarily a "sampling error" of mine, or whether that's indeed more common. The primary reason I'm curious is that I'm pondering a few potential optimizations, and would like to have some guidance which are more and which are less important. Questions (answer as many you can comfortably answer): - How many MB/s, segments/s do you see on busier servers? - What generates the bulk of WAL on your servers (9.5+ can use pg_xlogdump --stats to compute that)? - Are you seeing WAL writes being a bottleneck?OA - What kind of backup methods are you using and is the WAL volume a problem? - What kind of replication are you using and is the WAL volume a problem? - What are your settings for wal_compression, max_wal_size (9.5+) / checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers? - Could you quickly describe your workload? Feel free to add any information you think is pertinent ;) Greetings, Andres Freund
Hi Andres.
25 апр. 2017 г., в 7:17, Andres Freund <andres@anarazel.de> написал(а):Hi,
I've lately seen more and more installations where the generation of
write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious
whether that's primarily a "sampling error" of mine, or whether that's
indeed more common.
The primary reason I'm curious is that I'm pondering a few potential
optimizations, and would like to have some guidance which are more and
which are less important.
Questions (answer as many you can comfortably answer):
- How many MB/s, segments/s do you see on busier servers?
Nearly one WAL (16 MB) per second most of the time and 3 WALs per second in the beginning of checkpoint (due to full_page_writes).
- What generates the bulk of WAL on your servers (9.5+ can use
pg_xlogdump --stats to compute that)?
Here is the output from a couple of our masters (and that is actually two hours before peak load):
$ pg_xlogdump --stats 0000000100012B2800000089 0000000100012B3000000088 | fgrep -v 0.00
Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Heap2 55820638 ( 21.31) 1730485085 ( 22.27) 1385795249 ( 13.28) 3116280334 ( 17.12)
Heap 74366993 ( 28.39) 2288644932 ( 29.46) 5880717650 ( 56.34) 8169362582 ( 44.87)
Btree 84655827 ( 32.32) 2243526276 ( 28.88) 3170518879 ( 30.38) 5414045155 ( 29.74)
-------- -------- -------- --------
Total 261933790 7769663301 [42.67%] 10437031778 [57.33%] 18206695079 [100%]
$
$ pg_xlogdump --stats 000000010000D17F000000A5 000000010000D19100000004 | fgrep -v 0.00
Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Heap2 13676881 ( 18.95) 422289539 ( 19.97) 15319927851 ( 25.63) 15742217390 ( 25.44)
Heap 22284283 ( 30.88) 715293050 ( 33.83) 17119265188 ( 28.64) 17834558238 ( 28.82)
Btree 27640155 ( 38.30) 725674896 ( 34.32) 19244109632 ( 32.19) 19969784528 ( 32.27)
Gin 6580760 ( 9.12) 172246586 ( 8.15) 8091332009 ( 13.54) 8263578595 ( 13.35)
-------- -------- -------- --------
Total 72172983 2114133847 [3.42%] 59774634680 [96.58%] 61888768527 [100%]
$
- Are you seeing WAL writes being a bottleneck?OA
We do sometimes see WALWriteLock in pg_stat_activity.wait_event, but not too often.
- What kind of backup methods are you using and is the WAL volume a
problem?
We use fork of barman project. In most cases that’s not a problem.
- What kind of replication are you using and is the WAL volume a
problem?
Physical streaming replication. We used to have problems with network bandwidth (1 Gbit/s was consumed by transferring WAL to two replicas and one archive) but that became better after 1. upgrading to 9.5 and turning wal_compression on, 2. changing archive command to doing parallel compression and sending WALs to archive, 3. increasing checkpoint_timeout.
- What are your settings for wal_compression, max_wal_size (9.5+) /
checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?
xdb301e/postgres M # SELECT name, current_setting(name) FROM pg_settings
WHERE name IN ('max_wal_size', 'checkpoint_timeout', 'wal_compression', 'wal_buffers');
name | current_setting
--------------------+-----------------
checkpoint_timeout | 1h
max_wal_size | 128GB
wal_buffers | 16MB
wal_compression | on
(4 rows)
Time: 0.938 ms
xdb301e/postgres M #
- Could you quickly describe your workload?
OLTP workload with 80% reads and 20% writes.
Feel free to add any information you think is pertinent ;)
Well, we actually workarounded issues with WAL write rate by increasing checkpoint_timeout to maximum possible (in 9.6 it can be even more). The downside of this change is recovery time. Thanks postgres for its stability but sometimes you can waste ~ 10 minutes just to restart postgres for upgrading to new minor version and that’s not really cool.
Greetings,
Andres Freund
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Mon, Apr 24, 2017 at 9:17 PM, Andres Freund <andres@anarazel.de> wrote:
Questions (answer as many you can comfortably answer):
- How many MB/s, segments/s do you see on busier servers?
- What generates the bulk of WAL on your servers (9.5+ can use
pg_xlogdump --stats to compute that)?
- Are you seeing WAL writes being a bottleneck?OA
- What kind of backup methods are you using and is the WAL volume a
problem?
- What kind of replication are you using and is the WAL volume a
problem?
- What are your settings for wal_compression, max_wal_size (9.5+) /
checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?
- Could you quickly describe your workload?
* Postgresql 9.3
* 1500+ db servers
* Daily peak for busy databases: 75 WALs switched per second (less than 10% of the servers experience this)
* Avg per db: 2 WALs/s
* Mainly generated by large batch sync processes that occur throughout the day, and by a legacy archiving process to purge older data (potentially many millions of cascading deletes).
*Half the servers have (encrypted) pg_dump backups, WAL volume hasn't proved to be a problem there, though dump size is a problem for a few of the larger databases (less than 1TB).
* Mainly generated by large batch sync processes that occur throughout the day, and by a legacy archiving process to purge older data (potentially many millions of cascading deletes).
*Half the servers have (encrypted) pg_dump backups, WAL volume hasn't proved to be a problem there, though dump size is a problem for a few of the larger databases (less than 1TB).
* Inter-data-centre replication is all streaming, across DC's (over the WAN) WAL shipping is over compressed SSH tunnels.
Occasionally the streaming replication falls behind, but more commonly it is the cross-DC log shipping that becomes a problem. Some of the servers will generate 50+ GBs of WAL in a matter of minutes and that backs up immediately on the masters. Occasionally this has a knock-on effect for other servers and slows down their log shipping due to network saturation.
* checkpoint_segments: 64, checkpoint_timeout: 5 mins, wal_buffers: 16MB
Workload:
70% of servers are generally quiet, with occasional bursty reads and writes.
20% are medium use, avg a few hundred transactions/second
10% average around 5k txns/s, with bursts up to 25k txns/s for several minutes.
All servers have about 80% reads / 20% writes, though those numbers flip during big sync jobs and when the purging maintenance kicks off.
On Tue, Apr 25, 2017 at 1:17 AM, Andres Freund <andres@anarazel.de> wrote: > Questions (answer as many you can comfortably answer): > - How many MB/s, segments/s do you see on busier servers? ~20MB/s with FPW compression, with peaks of ~35MB/s. Writes become the bottleneck without compression and it tops at about 40-50MB/s, WAL archiving cannot keep up beyond that point. > - What generates the bulk of WAL on your servers (9.5+ can use > pg_xlogdump --stats to compute that)? Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Transaction 30 ( 0.00) 960 ( 0.00) 0 ( 0.00) 960 ( 0.00) Storage 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) CLOG 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Database 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Tablespace 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) MultiXact 110 ( 0.01) 7456 ( 0.02) 0 ( 0.00) 7456 ( 0.00) RelMap 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Standby 2 ( 0.00) 368 ( 0.00) 0 ( 0.00) 368 ( 0.00) Heap2 2521 ( 0.22) 78752 ( 0.24) 4656133 ( 2.82) 4734885 ( 2.39) Heap 539419 ( 46.52) 15646903 ( 47.14) 98720258 ( 59.87) 114367161 ( 57.73) Btree 606573 ( 52.31) 15872182 ( 47.82) 57514798 ( 34.88) 73386980 ( 37.05) Hash 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Gin 2866 ( 0.25) 134330 ( 0.40) 4012251 ( 2.43) 4146581 ( 2.09) Gist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) Sequence 7970 ( 0.69) 1450540 ( 4.37) 0 ( 0.00) 1450540 ( 0.73) SPGist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) BRIN 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) CommitTs 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) ReplicationOrigin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) -------- -------- -------- -------- Total 1159491 33191491 [16.76%] 164903440 [83.24%] 198094931 [100%] > - Are you seeing WAL writes being a bottleneck?OA Sometimes, more so without FPW compression > - What kind of backup methods are you using and is the WAL volume a > problem? > - What kind of replication are you using and is the WAL volume a > problem? Streaming to hot standby + WAL archiving, delayed standby as backup and PITR. Backups are regular filesystem-level snapshots of the delayed standby (with postgres down to get consistent snapshots). WAL volume getting full during periods where the hot standby lags behind (or when we have to stop it to create consistent snapshots) are an issue indeed, and we've had to provision significant storage to be able to absorb those peaks (1TB of WAL) We bundle WAL segments into groups of 256 segments for archiving and recovery to minimize the impact of TCP slow start. We further gzip segments before transfer with pigz, and we use mostly rsync (with a wrapper script that takes care of durability and error handling) to move segments around. Getting the archive/recovery scripts to handle the load hasn't been trivial. > - What are your settings for wal_compression, max_wal_size (9.5+) / > checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers? wal_compression = on max_wal_size = 12GB min_wal_size = 2GB checkpoint_timeout = 30min wal_buffers = -1 (16MB effective) > - Could you quickly describe your workload? Steady stream of (preaggregated) input events plus upserts into ~12 partitioned aggregate "matviews" (within quotes since they're manually maintained up to date). Input rate is approximately 9000 rows/s without counting the upserts onto the aggregate matviews. Old information is regularly compressed and archived into less detailed partitions for a steady size of about 5TB.
Hi, On 2017-04-24 21:17:43 -0700, Andres Freund wrote: > I've lately seen more and more installations where the generation of > write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious > whether that's primarily a "sampling error" of mine, or whether that's > indeed more common. > > The primary reason I'm curious is that I'm pondering a few potential > optimizations, and would like to have some guidance which are more and > which are less important. > > Questions (answer as many you can comfortably answer): > - How many MB/s, segments/s do you see on busier servers? > - What generates the bulk of WAL on your servers (9.5+ can use > pg_xlogdump --stats to compute that)? > - Are you seeing WAL writes being a bottleneck?OA > - What kind of backup methods are you using and is the WAL volume a > problem? > - What kind of replication are you using and is the WAL volume a > problem? > - What are your settings for wal_compression, max_wal_size (9.5+) / > checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers? > - Could you quickly describe your workload? Ok, based on the, few, answers I've got so far, my experience is indeed skewed. A number of the PG users I interacted with over the last couple years had WAL write ranges somewhere in the range of 500MB/s to 2.2GB/s (max I'veseen). At that point WAL insertion became a major bottleneck, even if storage was more than fast enough to keep up. To address these we'd need some changes, but the feedback so far suggest that it's not yet a widespread issue... - Andres
On 04/27/2017 08:59 AM, Andres Freund wrote: > > Ok, based on the, few, answers I've got so far, my experience is indeed > skewed. A number of the PG users I interacted with over the last couple > years had WAL write ranges somewhere in the range of 500MB/s to 2.2GB/s > (max I'veseen). At that point WAL insertion became a major bottleneck, > even if storage was more than fast enough to keep up. To address these > we'd need some changes, but the feedback so far suggest that it's not > yet a widespread issue... I would agree it isn't yet a widespread issue. The only people that are likely going to see this are going to be on bare metal. We should definitely plan on that issue for say 11. I do have a question though, where you have seen this issue is it with synchronous_commit on or off? Thanks, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote: > On 04/27/2017 08:59 AM, Andres Freund wrote: > > > > > Ok, based on the, few, answers I've got so far, my experience is indeed > > skewed. A number of the PG users I interacted with over the last couple > > years had WAL write ranges somewhere in the range of 500MB/s to 2.2GB/s > > (max I'veseen). At that point WAL insertion became a major bottleneck, > > even if storage was more than fast enough to keep up. To address these > > we'd need some changes, but the feedback so far suggest that it's not > > yet a widespread issue... > > I would agree it isn't yet a widespread issue. I'm not yet sure about that actually. I suspect a large percentage of people with such workloads aren't lingering lots on the lists. > The only people that are likely going to see this are going to be on bare > metal. We should definitely plan on that issue for say 11. "plan on that issue" - heh. We're talking about major engineering projects here ;) > I do have a question though, where you have seen this issue is it with > synchronous_commit on or off? Both. Whether that matters or not really depends on the workload. If you have bulk writes, it doesn't really matter much. - Andres
On 04/27/2017 09:34 AM, Andres Freund wrote: > On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote: >> On 04/27/2017 08:59 AM, Andres Freund wrote: >> >> I would agree it isn't yet a widespread issue. > > I'm not yet sure about that actually. I suspect a large percentage of > people with such workloads aren't lingering lots on the lists. That would probably be true. I was thinking of it more as the "most new users are in the cloud" and the "cloud" is going to be rare that a cloud user is going to be able to hit that level of writes. (at least not without spending LOTS of money) > > >> The only people that are likely going to see this are going to be on bare >> metal. We should definitely plan on that issue for say 11. > > "plan on that issue" - heh. We're talking about major engineering > projects here ;) Sorry, wasn't trying to make light of the effort. :D > > >> I do have a question though, where you have seen this issue is it with >> synchronous_commit on or off? > > Both. Whether that matters or not really depends on the workload. If you > have bulk writes, it doesn't really matter much. Sure, o.k. Thanks, Andres > > - Andres > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 2017-04-27 10:29:48 -0700, Joshua D. Drake wrote: > On 04/27/2017 09:34 AM, Andres Freund wrote: > > On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote: > > > On 04/27/2017 08:59 AM, Andres Freund wrote: > > > > > > > I would agree it isn't yet a widespread issue. > > > > I'm not yet sure about that actually. I suspect a large percentage of > > people with such workloads aren't lingering lots on the lists. > > That would probably be true. I was thinking of it more as the "most new > users are in the cloud" and the "cloud" is going to be rare that a cloud > user is going to be able to hit that level of writes. (at least not without > spending LOTS of money) You can get pretty decent NVMe SSD drives on serveral cloud providers these days, without immediately bancrupting you. Sure, it's instance storage, but with a decent replication and archival setup, that's not necessarily an issue. It's not that hard to get to the point where postgres can't keep up with storage, at least for some workloads. - Andres
Hi, On 04/25/2017 06:17 AM, Andres Freund wrote: > Hi, > > I've lately seen more and more installations where the generation of > write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious > whether that's primarily a "sampling error" of mine, or whether > that's indeed more common. > I see those cases too. To some degree it's a sampling bias. People generally don't call us to look at the 99% of systems that perform fine, so we tend to see the the 1% of systems under pressure. That doesn't make that observation irrelevant, though. Those demanding systems are one of the things that pushes us forward. > > The primary reason I'm curious is that I'm pondering a few potential > optimizations, and would like to have some guidance which are more > and which are less important. > I think any optimization you do will improve at least some of those busy systems. > Questions (answer as many you can comfortably answer): > - How many MB/s, segments/s do you see on busier servers? That depends on the cause (see the next point). > - What generates the bulk of WAL on your servers (9.5+ can use > pg_xlogdump --stats to compute that)? a) systems doing large batches - bulk loads/updates/deletes, one or few sessions doing a lot - easily high hundreds of MB/s (on a separate device) b) OLTP systems doing a lot of tiny/small transactions - many concurrent sessions - often end up much more limited by WAL, due to locking etc. - often the trouble is random updates all over the place, causing amplification due to FPI (PK on UUID is a great way to cause this unnecessarily even on apps with naturally tiny working set) > - Are you seeing WAL writes being a bottleneck?OA On the write-intensive systems, yes. Often the CPUs are waiting for WAL I/O to complete during COMMIT. > - What kind of backup methods are you using and is the WAL volume a > problem? The large and busy systems can easily produce so much WAL, that the basebackup is not the largest part of the backup. That is somewhat solvable by using other means of obtaining the basebackup snapshot (e.g. by taking some sort of filesystem / SAN / ... snapshot). That reduces the amount of WAL needed to make the basebackup consistent, but it doesn't solve the WAL archiving issue. > - What kind of replication are you using and is the WAL volume a > problem? Generally streaming replication, and yes, the amount of WAL may be an issue, partially because the standby is a single-process thing. And as it has to process something generated by N sessions on the primary, that can't end well. Interestingly enough, FPIs can actually make it way faster, because the standby does not need to read the data from disk during recovery. > - What are your settings for wal_compression, max_wal_size (9.5+) / > checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers? I'd say the usual practice is to tune for timed checkpoints, say 30+ minutes apart (or more). wal_compression is typically 'off' (i.e. the default value). > - Could you quickly describe your workload? Pretty much a little bit of everything, depending on the customer. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/27/2017 06:34 PM, Andres Freund wrote: > On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote: >> On 04/27/2017 08:59 AM, Andres Freund wrote: >> >>> >>> Ok, based on the, few, answers I've got so far, my experience is >>> indeed skewed. A number of the PG users I interacted with over >>> the last couple years had WAL write ranges somewhere in the range >>> of 500MB/s to 2.2GB/s (max I'veseen). At that point WAL insertion >>> became a major bottleneck, even if storage was more than fast >>> enough to keep up. To address these we'd need some changes, but >>> the feedback so far suggest that it's not yet a widespread >>> issue... >> >> I would agree it isn't yet a widespread issue. > > I'm not yet sure about that actually. I suspect a large percentage > of people with such workloads aren't lingering lots on the lists. > To a certain extent, this is a self-fulfilling prophecy. If you know you'll have such a busy system, you probably do some research and testing first, before choosing the database. If we don't perform well enough, you pick something else. Which removes the data point. Obviously, there are systems that start small and get busier and busier over time. And those are the ones we see. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/27/2017 07:35 PM, Andres Freund wrote: > On 2017-04-27 10:29:48 -0700, Joshua D. Drake wrote: >> On 04/27/2017 09:34 AM, Andres Freund wrote: >>> On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote: >>>> On 04/27/2017 08:59 AM, Andres Freund wrote: >>>> >> >>>> I would agree it isn't yet a widespread issue. >>> >>> I'm not yet sure about that actually. I suspect a large >>> percentage of people with such workloads aren't lingering lots on >>> the lists. >> >> That would probably be true. I was thinking of it more as the >> "most new users are in the cloud" and the "cloud" is going to be >> rare that a cloud user is going to be able to hit that level of >> writes. (at least not without spending LOTS of money) > > You can get pretty decent NVMe SSD drives on serveral cloud > providers these days, without immediately bancrupting you. Sure, it's > instance storage, but with a decent replication and archival setup, > that's not necessarily an issue. > > It's not that hard to get to the point where postgres can't keep up > with storage, at least for some workloads. > I can confirm this observation. I bought the Intel 750 NVMe SSD last year, the device has 1GB DDR3 cache on it (power-loss protected), can do ~1GB/s of sustained O_DIRECT sequential writes. But when running pgbench, I can't push more than ~300MB/s of WAL to it, no matter what I do because of WALWriteLock. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-04-28 01:29:14 +0200, Tomas Vondra wrote: > I can confirm this observation. I bought the Intel 750 NVMe SSD last year, > the device has 1GB DDR3 cache on it (power-loss protected), can do ~1GB/s of > sustained O_DIRECT sequential writes. But when running pgbench, I can't push > more than ~300MB/s of WAL to it, no matter what I do because of > WALWriteLock. Hm, interesting. Even if you up wal_buffers to 128MB, use synchronous_commit = off, and play with wal_writer_delay/flush_after? - Andres
On 04/28/2017 01:34 AM, Andres Freund wrote: > On 2017-04-28 01:29:14 +0200, Tomas Vondra wrote: >> I can confirm this observation. I bought the Intel 750 NVMe SSD last year, >> the device has 1GB DDR3 cache on it (power-loss protected), can do ~1GB/s of >> sustained O_DIRECT sequential writes. But when running pgbench, I can't push >> more than ~300MB/s of WAL to it, no matter what I do because of >> WALWriteLock. > > Hm, interesting. Even if you up wal_buffers to 128MB, use > synchronous_commit = off, and play with wal_writer_delay/flush_after? > I think I've tried things like that, but let me do some proper testing. I'll report the numbers in a few days. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services