Re: Checkpoint sync pause - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Checkpoint sync pause
Date
Msg-id 4F14D5EC.60204@2ndQuadrant.com
Whole thread Raw
In response to Re: Checkpoint sync pause  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Checkpoint sync pause  (Josh Berkus <josh@agliodbs.com>)
Re: Checkpoint sync pause  (Robert Haas <robertmhaas@gmail.com>)
Re: Checkpoint sync pause  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On 01/16/2012 11:00 AM, Robert Haas wrote:
> Also, I am still struggling with what the right benchmarking 
> methodology even is to judge whether
> any patch in this area "works".  Can you provide more details about
> your test setup?

The "test" setup is a production server with a few hundred users at peak 
workload, reading and writing to the database.  Each RAID controller 
(couple of them with their own tablespaces) has either 512MG or 1GB of 
battery-backed write cache.  The setup that leads to the bad situation 
happens like this:

-The steady stream of backend writes that happen between checkpoints 
have filled up most of the OS write cache.  A look at /proc/meminfo 
shows around 2.5GB "Dirty:"

-Since we have shared_buffers set to 512MB to try and keep checkpoint 
storms from being too bad, there might be 300MB of dirty pages involved 
in the checkpoint.  The write phase dumps this all into Linux's cache.  
There's now closer to 3GB of dirty data there.  @64GB of RAM, this is 
still only 4.7% though--just below the effective lower range for 
dirty_background_ratio.  Linux is perfectly content to let it all sit there.

-Sync phase begins.  Between absorption and the new checkpoint writes, 
there are >300 segments to sync waiting here.

-The first few syncs force data out of Linux's cache and into the BBWC.  
Some of these return almost instantly.  Others block for a moderate 
number of seconds.  That's not necessarily a showstopper, on XFS at 
least.  So long as the checkpointer is not being given all of the I/O in 
the system, the fact that it's stuck waiting for a sync doesn't mean the 
server is unresponsive to the needs of other backends.  Early data might 
look like this:

DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

[Here 'gap' is a precise measurement of how close the sync pause feature 
is working, with it set to 3 seconds.  This is from an earlier version 
of this patch.  All the timing issues I used to measure went away in the 
current implementation because it doesn't have to worry about doing 
background writer LRU work anymore, with the checkpointer split out]

But after a few hundred of these, every downstream cache is filled up.  
The result is seeing more really ugly sync times, like #164 here:

DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
...
DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec

At the same time #164 is happening, that 45 second long window, a pile 
of clients will get stuck where they can't do any I/O.  The RAID 
controller that used to have a useful mix of data is now completely 
filled with >=512MB of random writes.  It's now failing to write as fast 
as new data is coming in.  Eventually that leads to pressure building up 
in Linux's cache.  Now you're in the bad place:  dirty_background_ratio 
is crossed, Linux is now worried about spooling all cached writes to 
disk as fast as it can, the checkpointer is sync'ing its own important 
data to disk as fast as it can too, and all caches are inefficient 
because they're full.

To recreate a scenario like this, I've realized the benchmark needs to 
have a couple of characteristics:

-It has to focus on transaction latency instead of throughput.  We know 
that doing syncs more often will lower throughput due to reduced 
reordering etc.

-It cannot run at maximum possible speed all the time.  It needs to be 
the case that the system keeps up with the load during the rest of the 
time, but the sync phase of checkpoints causes I/O to queue faster than 
it's draining, thus saturating all caches and then blocking backends.  
Ideally, "Dirty:" in /proc/meminfo will reach >90% of the 
dirty_background_ratio trigger line around the same time the sync phase 
starts.

-There should be a lot of clients doing a mix of work.  The way Linux 
I/O works, the scheduling for readers vs. writers is complicated, and 
this is one of the few areas where things like CFQ vs. Deadline matter.

I've realized now one reason I never got anywhere with this while 
running pgbench tests is that pgbench always runs at 100% of capacity.  
It fills all the caches involved completely as fast as it can, and every 
checkpoint starts with them already filled to capacity.  So when latency 
gets bad at checkpoint time, no amount of clever reordering will help 
keep those writes from interfering with other processes.  There just 
isn't any room to work with left.

What I think is needed instead is a write-heavy benchmark with a think 
time in it, so that we can dial the workload up to, say, 90% of I/O 
capacity, but that spikes to 100% when checkpoint sync happens.  Then 
rearrangements in syncing that reduces caching pressure should be 
visible as a latency reduction in client response times.  My guess is 
that dbt-2 can be configured to provide such a workload, and I don't see 
a way forward here except for me to fully embrace that and start over 
with it.

> Just one random thought: I wonder if it would make sense to cap the
> delay after each sync to the time spending performing that sync.  That
> would make the tuning of the delay less sensitive to the total number
> of files, because we won't unnecessarily wait after each sync when
> they're not actually taking any time to complete.

This is one of the attractive ideas in this area that didn't work out so 
well when tested.  The problem is that writes into a battery-backed 
write cache will show zero latency for some time until the cache is 
filled...and then you're done.  You have to pause anyway, even though it 
seems write speed is massive, to give the cache some time to drain to 
disk between syncs that push data toward it.  Even though it absorbed 
your previous write with no delay, that doesn't mean it takes no time to 
process that write.  With proper write caching, that processing is just 
happening asynchronously.

This is related to another observation, noting what went wrong when we 
tried deploying my fully auto-tuning sync spread patch onto production.  
If the sync phase of the checkpoint starts to fall behind, and you've 
configured for a sync pause, you have to just suck that up and accept 
you'll finish late[1].  When you do get into the situation where the 
cache is completely filled, writes will slow dramatically.  In the above 
log example, sync #164 taking 45 seconds means that #165 will surely be 
considered behind schedule now.  If you use that feedback to then reduce 
the sync pause, feeling that you are behind schedule and cannot afford 
to pause anymore, now you've degenerated right back to the original 
troubled behavior:  sync calls, as fast as they can be accepted by the 
OS, no delay between them.

[1] Where I think I'm going to end up with this eventually now is that 
setting checkpoint_sync_pause is the important tunable.  The parameter 
that then gets auto-tuned is checkpoint_timeout.  If you have 300 
relations to sync and you have to wait 10 seconds between syncs to get 
latency down, the server is going to inform you an hour between 
checkpoints is all you can do here.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Why is CF 2011-11 still listed as "In Progress"?
Next
From: Josh Berkus
Date:
Subject: Re: Checkpoint sync pause