Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id CA+CSw_uL1JL_yGBs7R1-n_0efKK6JPGgcd9mn8OZLbMps0s7Gg@mail.gmail.com
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
On Tue, Jul 16, 2013 at 9:17 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 7/16/13 12:46 PM, Ants Aasma wrote:
>
>> Spread checkpoints sprinkles the writes out over a long
>> period and the general tuning advice is to heavily bound the amount of
>> memory the OS willing to keep dirty.
>
>
> That's arguing that you can make this feature be useful if you tune in a
> particular way.  That's interesting, but the goal here isn't to prove the
> existence of some workload that a change is useful for.  You can usually
> find a test case that validates any performance patch as helpful if you
> search for one.  Everyone who has submitted a sorted checkpoint patch for
> example has found some setup where it shows significant gains.  We're trying
> to keep performance stable across a much wider set of possibilities though.
>
> Let's talk about default parameters instead, which quickly demonstrates
> where your assumptions fail.  The server I happen to be running pgbench
> tests on today has 72GB of RAM running SL6 with RedHat derived kernel
> 2.6.32-358.11.1.  This is a very popular middle grade server configuration
> nowadays.  There dirty_background_ratio and dirty_background_ratio are 10
> (percent).  That means that roughly 7GB of RAM can be used for write
> caching.  Note that this is a fairly low write cache tuning compared to a
> survey of systems in the field--lots of people have servers with earlier
> kernels where these numbers can be as high as 20 or even 40% instead.
>
> The current feasible tuning for shared_buffers suggests a value of 8GB is
> near the upper limit, beyond which cache related overhead makes increases
> counterproductive.  Your examples are showing 53% of shared_buffers dirty at
> checkpoint time; that's typical.  The checkpointer is then writing out just
> over 4GB of data.
>
> With that background what process here has more data to make decisions with?
>
> -The operating system has 7GB of writes it's trying to optimize.  That
> potentially includes backend, background writer, checkpoint, temp table,
> statistics, log, and WAL data.  The scheduler is also considering read
> operations.
>
> -The checkpointer process has 4GB of writes from rarely written shared
> memory it's trying to optimize.

Actually I was arguing that the reasoning that OS will take care of
the sorting does not apply in reasonably common cases. My point is
that the OS isn't able to optimize the writes because spread
checkpoints trickle the writes out to the OS in random order over a
long time. If OS writeback behavior is left in the default
configuration it will start writing out data before checkpoint write
phase ends (due to dirty_expire_centisecs), this will miss write
combining opportunities that would arise if we sorted the data before
dumping them to the OS dirty buffers. I'm not arguing that we try to
bypass OS I/O scheduling decisions, I'm arguing that by arranging
checkpoint writes in logical order we will make pages visible to the
I/O scheduler in a way that will lead to more efficient writes.

Also I think that you are overestimating the capabilities of the OS IO
scheduler. At least for Linux, the IO scheduler does not see pages in
the dirty list - only pages for which writeback has been initiated. In
default configuration this means up to 128 read and 128 write I/Os are
considered. The writes are picked by basically doing round robin on
files with dirty pages and doing a clocksweep scan for a chunk of
pages from each. So in reality there is practically no benefit in
having the OS do the reordering, while there is the issue that
flushing a large amount of dirty pages at once does very nasty things
to query latency by overloading all of the I/O queues.

> This is why if you take the opposite approach of yours today--go searching
> for workloads where sorting is counterproductive--those are equally easy to
> find.  Any test of write speed I do starts with about 50 different
> scale/client combinations.  Why do I suggest pgbench-tools as a way to do
> performance tests?  It's because an automated sweep of client setups like it
> does is the minimum necessary to create enough variation in workload for
> changing the database's write path.  It's really amazing how often doing
> that shows a proposed change is just shuffling the good and bad cases
> around.  That's been the case for every sorting and fsync delay change
> submitted so far.  I'm not even interested in testing today's submission
> because I tried that particular approach for a few months, twice so far, and
> it fell apart on just as many workloads as it helped.

As you know running a full suite of write benchmarks takes a very long
time, with results often being inconclusive (noise is greater than
effect we are trying to measure). This is why I'm interested which
workloads you suspect might fall apart from this patch - because I
can't think of any. Worst case would be that the OS fully absorbs all
checkpoint writes before writing anything out, so the sorting is
useless waste of CPU and memory. The CPU cost here is on the order of
a fraction of a second of CPU time per checkpoint, basically nothing.

>> The checkpointer has the best long term overview of the situation here, OS
>> scheduling only has the short term view of outstanding read and write
>> requests.
>
>
> True only if shared_buffers is large compared to the OS write cache, which
> was not the case on the example I generated with all of a minute's work.  I
> regularly see servers where Linux's "Dirty" area becomes a multiple of the
> dirty buffers written by a checkpoint.  I can usually make that happen at
> will with CLUSTER and VACUUM on big tables.  The idea that the checkpointer
> has a long-term view while the OS has a short one, that presumes a setup
> that I would say is possible but not common.

Because the checkpointer is throttling itself while writing out, it
always has a longer term view than the OS. The OS doesn't know which
pages are coming before PostgreSQL writes them out.

>> kernel settings: dirty_background_bytes = 32M,
>> dirty_bytes = 128M.
>
>
> You disclaimed this as a best case scenario.  It is a low throughput / low
> latency tuning.  That's fine, but if Postgres optimizes itself toward those
> cases it runs the risk of high throughput servers with large caches being
> detuned.  I've posted examples before showing very low write caches like
> this leading to VACUUM running at 1/2 its normal speed or worse, as a simple
> example of where a positive change in one area can backfire badly on another
> workload.  That particular problem was so common I updated pgbench-tools
> recently to track table maintenance time between tests, because that
> demonstrated an issue even when the TPS numbers all looked fine.

Tuning kernel writecache down like this is obviously a tradeoff. The
fact that it actually helps is dependent on the fact that Linux is so
bad at scheduling writeback. Sorting checkpoints is different as there
is no workload where it would hurt by any measurable amount. I picked
the test case because it makes the benefit obvious and shows that
there are reasonable workloads where sorting does wonders. I have no
doubt that there are other workloads that will benefit a lot, but
constructing such test cases takes a significant amount of time. I
have had seen many cases where having this patch in would have made my
life a lot easier.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Adding optionally commit number in PG_VERSION_STR
Next
From: soroosh sardari
Date:
Subject: Re: A general Q about index