Re: rebased background worker reimplementation prototype - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: rebased background worker reimplementation prototype |
Date | |
Msg-id | 20190716191629.al652qwkf7nkx537@development Whole thread Raw |
In response to | Re: rebased background worker reimplementation prototype (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Jul 16, 2019 at 10:53:46AM -0700, Andres Freund wrote: >Hi, > >On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote: >> I've done a bit of benchmarking / testing on this, so let me report some >> basic results. I haven't done any significant code review, I've simply >> ran a bunch of pgbench runs on different systems with different scales. > >Thanks! > > >> System #1 >> --------- >> * CPU: Intel i5 >> * RAM: 8GB >> * storage: 6 x SATA SSD RAID0 (Intel S3700) >> * autovacuum_analyze_scale_factor = 0.1 >> * autovacuum_vacuum_cost_delay = 2 >> * autovacuum_vacuum_cost_limit = 1000 >> * autovacuum_vacuum_scale_factor = 0.01 >> * bgwriter_delay = 100 >> * bgwriter_lru_maxpages = 10000 >> * checkpoint_timeout = 30min >> * max_wal_size = 64GB >> * shared_buffers = 1GB > >What's the controller situation here? Can the full SATA3 bandwidth on >all of those drives be employed concurrently? > There's just an on-board SATA controller, so it might be a bottleneck. A single drive can do ~440 MB/s reads sequentially, and the whole RAID0 array (Linux sw raid) does ~1.6GB/s, so not exactly 6x that. But I don't think we're generating that many writes during the test. > >> System #2 >> --------- >> * CPU: 2x Xeon E5-2620v5 >> * RAM: 64GB >> * storage: 3 x 7.2k SATA RAID0, 1x NVMe >> * autovacuum_analyze_scale_factor = 0.1 >> * autovacuum_vacuum_cost_delay = 2 >> * autovacuum_vacuum_cost_limit = 1000 >> * autovacuum_vacuum_scale_factor = 0.01 >> * bgwriter_delay = 100 >> * bgwriter_lru_maxpages = 10000 >> * checkpoint_completion_target = 0.9 >> * checkpoint_timeout = 15min >> * max_wal_size = 32GB >> * shared_buffers = 8GB > >What type of NVMe disk is this? I'm mostly wondering whether it's fast >enough that there's no conceivable way that IO scheduling is going to >make a meaningful difference, given other bottlenecks in postgres. > >In some preliminary benchmark runs I've seen fairly significant gains on >SATA and SAS SSDs, as well as spinning rust, but I've not yet >benchmarked on a decent NVMe SSD. > Intel Optane 900P 280MB (model SSDPED1D280GA) [1]. [1] https://ssd.userbenchmark.com/SpeedTest/315555/INTEL-SSDPED1D280GA I think one of the main improvements in this generation of drives is good performance with low queue depth. See for example [2]. [2] https://www.anandtech.com/show/12136/the-intel-optane-ssd-900p-480gb-review/5 Not sure if that plays role here, but I've seen this to afffect prefetch and similar things. > >> For each config I've done tests with three scales - small (fits into >> shared buffers), medium (fits into RAM) and large (at least 2x the RAM). >> Aside from the basic metrics (throughput etc.) I've also sampled data >> about 5% of transactions, to be able to look at latency stats. >> >> The tests were done on master and patched code (both in the 'legacy' and >> new mode). > > > >> I haven't done any temporal analysis yet (i.e. I'm only looking at global >> summaries, not tps over time etc). > >FWIW, I'm working on a tool that generates correlated graphs of OS, PG, >pgbench stats. Especially being able to correlate the kernel's >'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very >valuable. Sampling wait events over time also is worthwhile. > Good to know, although I don't think it's difficult to fetch the data from sar and plot it. I might even already have ugly bash scripts doing that, somewhere. > >> When running on the 7.2k SATA RAID, the throughput improves with the >> medium scale - from ~340tps to ~439tps, which is a pretty significant >> jump. But on the large scale this disappears (in fact, it seems to be a >> bit lower than master/legacy cases). Of course, all this is just from a >> single run (although 4h, so noise should even out). > >Any chance there's an order-of-test factor here? In my tests I found two >related issues very important: 1) the first few tests are slower, >because WAL segments don't yet exist. 2) Some poor bugger of a later >test will get hit with anti-wraparound vacuums, even if otherwise not >necessary. > Not sure - I'll check, but I find it unlikely. I need to repeat the tests to have multiple runs. >The fact that the master and "legacy" numbers differ significantly >e.g. in the "xeon sata scale 1000" latency CDF does make me wonder >whether there's an effect like that. While there might be some small >performance difference due to different stats message sizes, and a few >additional branches, I don't see how it could be that noticable. > That's about the one case where things like anti-wraparound are pretty much impossible, because the SATA storage is so slow ... > >> I've also computed latency CDF (from the 5% sample) - I've attached this >> for the two interesting cases mentioned in the previous paragraph. This >> shows that with the medium scale the latencies move down (with the patch, >> both in the legacy and "new" modes), while on large scale the "new" mode >> moves a bit to the right to higher values). > >Hm. I can't yet explain that. > > >> And finally, I've looked at buffer stats, i.e. number of buffers written >> in various ways (checkpoing, bgwriter, backends) etc. Interestingly >> enough, these numbers did not change very much - especially on the flash >> storage. Maybe that's expected, though. > >Some of that is expected, e.g. because file extensions count as backend >writes, and are going to be roughly correlate with throughput, and not >much else. But they're more similar than I'd actually expect. > >I do see a pretty big difference in the number of bgwriter written >backends in the "new" case for scale 10000, on the nvme? > Right. >For the SATA SSD case, I wonder if the throughput bottleneck is WAL >writes. I see much more noticable differences if I enable >wal_compression or disable full_page_writes, because otherwise the bulk >of the volume is WAL data. But even in that case, I see a latency >stddev reduction with the new bgwriter around checkpoints. > I may try that during the next round of tests. > >> The one case where it did change is the "medium" scale on SATA storage, >> where the throughput improved with the patch. But the change is kinda >> strange, because the number of buffers evicted by the bgwriter decreased >> (and instead it got evicted by the checkpointer). Which might explain the >> higher throughput, because checkpointer is probably more efficient. > >Well, one problem with the current bgwriter implementation is that the >victim selection isn't good. Because it doesn't perform clock sweep, and >doesn't clean buffers with a usagecount, it'll often run until it finds >a dirty buffer that's pretty far ahead of the clock hand, and clean >those. But with a random test like pgbench it's somewhat likely that >those buffers will get re-dirtied before backends actually get to >reusing them (that's a problem with the new implementation too, the >window just is smaller). But I'm far from sure that that's the cause here. > OK. Time for more tests, I guess. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: