Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Date | |
Msg-id | CA+CSw_uL1JL_yGBs7R1-n_0efKK6JPGgcd9mn8OZLbMps0s7Gg@mail.gmail.com Whole thread Raw |
In response to | Re: Improvement of checkpoint IO scheduler for stable transaction responses (Greg Smith <greg@2ndQuadrant.com>) |
Responses |
Re: Improvement of checkpoint IO scheduler for stable transaction
responses
|
List | pgsql-hackers |
On Tue, Jul 16, 2013 at 9:17 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 7/16/13 12:46 PM, Ants Aasma wrote: > >> Spread checkpoints sprinkles the writes out over a long >> period and the general tuning advice is to heavily bound the amount of >> memory the OS willing to keep dirty. > > > That's arguing that you can make this feature be useful if you tune in a > particular way. That's interesting, but the goal here isn't to prove the > existence of some workload that a change is useful for. You can usually > find a test case that validates any performance patch as helpful if you > search for one. Everyone who has submitted a sorted checkpoint patch for > example has found some setup where it shows significant gains. We're trying > to keep performance stable across a much wider set of possibilities though. > > Let's talk about default parameters instead, which quickly demonstrates > where your assumptions fail. The server I happen to be running pgbench > tests on today has 72GB of RAM running SL6 with RedHat derived kernel > 2.6.32-358.11.1. This is a very popular middle grade server configuration > nowadays. There dirty_background_ratio and dirty_background_ratio are 10 > (percent). That means that roughly 7GB of RAM can be used for write > caching. Note that this is a fairly low write cache tuning compared to a > survey of systems in the field--lots of people have servers with earlier > kernels where these numbers can be as high as 20 or even 40% instead. > > The current feasible tuning for shared_buffers suggests a value of 8GB is > near the upper limit, beyond which cache related overhead makes increases > counterproductive. Your examples are showing 53% of shared_buffers dirty at > checkpoint time; that's typical. The checkpointer is then writing out just > over 4GB of data. > > With that background what process here has more data to make decisions with? > > -The operating system has 7GB of writes it's trying to optimize. That > potentially includes backend, background writer, checkpoint, temp table, > statistics, log, and WAL data. The scheduler is also considering read > operations. > > -The checkpointer process has 4GB of writes from rarely written shared > memory it's trying to optimize. Actually I was arguing that the reasoning that OS will take care of the sorting does not apply in reasonably common cases. My point is that the OS isn't able to optimize the writes because spread checkpoints trickle the writes out to the OS in random order over a long time. If OS writeback behavior is left in the default configuration it will start writing out data before checkpoint write phase ends (due to dirty_expire_centisecs), this will miss write combining opportunities that would arise if we sorted the data before dumping them to the OS dirty buffers. I'm not arguing that we try to bypass OS I/O scheduling decisions, I'm arguing that by arranging checkpoint writes in logical order we will make pages visible to the I/O scheduler in a way that will lead to more efficient writes. Also I think that you are overestimating the capabilities of the OS IO scheduler. At least for Linux, the IO scheduler does not see pages in the dirty list - only pages for which writeback has been initiated. In default configuration this means up to 128 read and 128 write I/Os are considered. The writes are picked by basically doing round robin on files with dirty pages and doing a clocksweep scan for a chunk of pages from each. So in reality there is practically no benefit in having the OS do the reordering, while there is the issue that flushing a large amount of dirty pages at once does very nasty things to query latency by overloading all of the I/O queues. > This is why if you take the opposite approach of yours today--go searching > for workloads where sorting is counterproductive--those are equally easy to > find. Any test of write speed I do starts with about 50 different > scale/client combinations. Why do I suggest pgbench-tools as a way to do > performance tests? It's because an automated sweep of client setups like it > does is the minimum necessary to create enough variation in workload for > changing the database's write path. It's really amazing how often doing > that shows a proposed change is just shuffling the good and bad cases > around. That's been the case for every sorting and fsync delay change > submitted so far. I'm not even interested in testing today's submission > because I tried that particular approach for a few months, twice so far, and > it fell apart on just as many workloads as it helped. As you know running a full suite of write benchmarks takes a very long time, with results often being inconclusive (noise is greater than effect we are trying to measure). This is why I'm interested which workloads you suspect might fall apart from this patch - because I can't think of any. Worst case would be that the OS fully absorbs all checkpoint writes before writing anything out, so the sorting is useless waste of CPU and memory. The CPU cost here is on the order of a fraction of a second of CPU time per checkpoint, basically nothing. >> The checkpointer has the best long term overview of the situation here, OS >> scheduling only has the short term view of outstanding read and write >> requests. > > > True only if shared_buffers is large compared to the OS write cache, which > was not the case on the example I generated with all of a minute's work. I > regularly see servers where Linux's "Dirty" area becomes a multiple of the > dirty buffers written by a checkpoint. I can usually make that happen at > will with CLUSTER and VACUUM on big tables. The idea that the checkpointer > has a long-term view while the OS has a short one, that presumes a setup > that I would say is possible but not common. Because the checkpointer is throttling itself while writing out, it always has a longer term view than the OS. The OS doesn't know which pages are coming before PostgreSQL writes them out. >> kernel settings: dirty_background_bytes = 32M, >> dirty_bytes = 128M. > > > You disclaimed this as a best case scenario. It is a low throughput / low > latency tuning. That's fine, but if Postgres optimizes itself toward those > cases it runs the risk of high throughput servers with large caches being > detuned. I've posted examples before showing very low write caches like > this leading to VACUUM running at 1/2 its normal speed or worse, as a simple > example of where a positive change in one area can backfire badly on another > workload. That particular problem was so common I updated pgbench-tools > recently to track table maintenance time between tests, because that > demonstrated an issue even when the TPS numbers all looked fine. Tuning kernel writecache down like this is obviously a tradeoff. The fact that it actually helps is dependent on the fact that Linux is so bad at scheduling writeback. Sorting checkpoints is different as there is no workload where it would hurt by any measurable amount. I picked the test case because it makes the benefit obvious and shows that there are reasonable workloads where sorting does wonders. I have no doubt that there are other workloads that will benefit a lot, but constructing such test cases takes a significant amount of time. I have had seen many cases where having this patch in would have made my life a lot easier. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: