Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id 51E58E35.4070500@2ndQuadrant.com
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Ants Aasma <ants@cybertec.at>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
On 7/16/13 12:46 PM, Ants Aasma wrote:

> Spread checkpoints sprinkles the writes out over a long
> period and the general tuning advice is to heavily bound the amount of
> memory the OS willing to keep dirty.

That's arguing that you can make this feature be useful if you tune in a 
particular way.  That's interesting, but the goal here isn't to prove 
the existence of some workload that a change is useful for.  You can 
usually find a test case that validates any performance patch as helpful 
if you search for one.  Everyone who has submitted a sorted checkpoint 
patch for example has found some setup where it shows significant gains.  We're trying to keep performance stable
acrossa much wider set of 
 
possibilities though.

Let's talk about default parameters instead, which quickly demonstrates 
where your assumptions fail.  The server I happen to be running pgbench 
tests on today has 72GB of RAM running SL6 with RedHat derived kernel 
2.6.32-358.11.1.  This is a very popular middle grade server 
configuration nowadays.  There dirty_background_ratio and 
dirty_background_ratio are 10 (percent).  That means that roughly 7GB of 
RAM can be used for write caching.  Note that this is a fairly low write 
cache tuning compared to a survey of systems in the field--lots of 
people have servers with earlier kernels where these numbers can be as 
high as 20 or even 40% instead.

The current feasible tuning for shared_buffers suggests a value of 8GB 
is near the upper limit, beyond which cache related overhead makes 
increases counterproductive.  Your examples are showing 53% of 
shared_buffers dirty at checkpoint time; that's typical.  The 
checkpointer is then writing out just over 4GB of data.

With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize.  That 
potentially includes backend, background writer, checkpoint, temp table, 
statistics, log, and WAL data.  The scheduler is also considering read 
operations.

-The checkpointer process has 4GB of writes from rarely written shared 
memory it's trying to optimize.

This is why if you take the opposite approach of yours today--go 
searching for workloads where sorting is counterproductive--those are 
equally easy to find.  Any test of write speed I do starts with about 50 
different scale/client combinations.  Why do I suggest pgbench-tools as 
a way to do performance tests?  It's because an automated sweep of 
client setups like it does is the minimum necessary to create enough 
variation in workload for changing the database's write path.  It's 
really amazing how often doing that shows a proposed change is just 
shuffling the good and bad cases around.  That's been the case for every 
sorting and fsync delay change submitted so far.  I'm not even 
interested in testing today's submission because I tried that particular 
approach for a few months, twice so far, and it fell apart on just as 
many workloads as it helped.

> The checkpointer has the best long term overview of the situation here, OS
> scheduling only has the short term view of outstanding read and write
> requests.

True only if shared_buffers is large compared to the OS write cache, 
which was not the case on the example I generated with all of a minute's 
work.  I regularly see servers where Linux's "Dirty" area becomes a 
multiple of the dirty buffers written by a checkpoint.  I can usually 
make that happen at will with CLUSTER and VACUUM on big tables.  The 
idea that the checkpointer has a long-term view while the OS has a short 
one, that presumes a setup that I would say is possible but not common.

> kernel settings: dirty_background_bytes = 32M,
> dirty_bytes = 128M.

You disclaimed this as a best case scenario.  It is a low throughput / 
low latency tuning.  That's fine, but if Postgres optimizes itself 
toward those cases it runs the risk of high throughput servers with 
large caches being detuned.  I've posted examples before showing very 
low write caches like this leading to VACUUM running at 1/2 its normal 
speed or worse, as a simple example of where a positive change in one 
area can backfire badly on another workload.  That particular problem 
was so common I updated pgbench-tools recently to track table 
maintenance time between tests, because that demonstrated an issue even 
when the TPS numbers all looked fine.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Differences in WHERE clause of SELECT
Next
From: Soroosh Sardari
Date:
Subject: Re: A general Q about index