Thread: Checkpoint sync pause

Checkpoint sync pause

From
Greg Smith
Date:
Last year at this point, I submitted an increasingly complicated
checkpoint sync spreading feature.  I wasn't able to prove any
repeatable drop in sync time latency from those patches.  While that was
going on, and continuing into recently, the production server that
started all this with its sync time latency issues didn't stop having
that problem.  Data collection continued, new patches were tried.

There was a really simple triage step Simon and I made before getting
into the complicated ones:  just delay for a few seconds between every
single sync call made during a checkpoint.  That approach is still
hanging around that server's patched PostgreSQL package set, and it
still works better than anything more complicated we've tried so far.
The recent split of background writer and checkpointer makes that whole
thing even easier to do without rippling out to have unexpected
consequences.

In order to be able to tune this usefully, you need to know information
about how many files a typical checkpoint syncs.  That could be
available without needing log scraping using the "Publish checkpoint
timing and sync files summary data to pg_stat_bgwriter" addition I just
submitted.  People who set this new checkpoint_sync_pause value too high
can face checkpoints running over schedule, but you can measure how bad
your exposure is with the new view information.

I owe the community a lot of data to prove this is useful before I'd
expect it to be taken seriously.  I was planning to leave this whole
area alone until 9.3.  But since recent submissions may pull me back
into trying various ways of rearranging the write path for 9.2, I wanted
to have my own miniature horse in that race.  It works simply:

...
2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync: number=34
file=base/16385/11766 time=0.006 msec
2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=3
2012-01-16 02:39:01.284 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=2
2012-01-16 02:39:01.385 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=1
2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync: number=35
file=global/12007 time=375.710 msec
2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=3
2012-01-16 02:39:01.961 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=2
2012-01-16 02:39:02.061 EST [25052]: DEBUG:  checkpoint sync delay:
seconds left=1
2012-01-16 02:39:02.161 EST [25052]: DEBUG:  checkpoint sync: number=36
file=base/16385/11754 time=0.008 msec
2012-01-16 02:39:02.555 EST [25052]: LOG:  checkpoint complete: wrote
2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0
recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36,
longest=1.085 s, average=0.040 s

No docs yet, really need a better guide to tuning checkpoints as they
exist now before there's a place to attach a discussion of this to.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Attachment

Re: Checkpoint sync pause

From
Robert Haas
Date:
On Mon, Jan 16, 2012 at 2:57 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> ...
> 2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync: number=34
> file=base/16385/11766 time=0.006 msec
> 2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=3
> 2012-01-16 02:39:01.284 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=2
> 2012-01-16 02:39:01.385 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=1
> 2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync: number=35
> file=global/12007 time=375.710 msec
> 2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=3
> 2012-01-16 02:39:01.961 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=2
> 2012-01-16 02:39:02.061 EST [25052]: DEBUG:  checkpoint sync delay: seconds
> left=1
> 2012-01-16 02:39:02.161 EST [25052]: DEBUG:  checkpoint sync: number=36
> file=base/16385/11754 time=0.008 msec
> 2012-01-16 02:39:02.555 EST [25052]: LOG:  checkpoint complete: wrote 2586
> buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0 recycled;
> write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36, longest=1.085
> s, average=0.040 s
>
> No docs yet, really need a better guide to tuning checkpoints as they exist
> now before there's a place to attach a discussion of this to.

Yeah, I think this is an area where a really good documentation patch
might help more users than any code we could write.  On the technical
end, I dislike this a little bit because the parameter is clearly
something some people are going to want to set, but it's not at all
clear what value they should set it to and it has complex interactions
with the other checkpoint settings - and the user's hardware
configuration.  If there's no way to make it more self-tuning, then
perhaps we should just live with that, but it would be nice to come up
with something more user-transparent.  Also, I am still struggling
with what the right benchmarking methodology even is to judge whether
any patch in this area "works".  Can you provide more details about
your test setup?

Just one random thought: I wonder if it would make sense to cap the
delay after each sync to the time spending performing that sync.  That
would make the tuning of the delay less sensitive to the total number
of files, because we won't unnecessarily wait after each sync when
they're not actually taking any time to complete.  It's probably
easier to estimate the number of segments that are likely to contain
lots of dirty data than to estimate the total number of segments that
you might have touched at least once since the last checkpoint, and
there's no particular reason to think the latter is really what you
should be tuning on anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Checkpoint sync pause

From
Greg Smith
Date:
On 01/16/2012 11:00 AM, Robert Haas wrote:
> Also, I am still struggling with what the right benchmarking 
> methodology even is to judge whether
> any patch in this area "works".  Can you provide more details about
> your test setup?

The "test" setup is a production server with a few hundred users at peak 
workload, reading and writing to the database.  Each RAID controller 
(couple of them with their own tablespaces) has either 512MG or 1GB of 
battery-backed write cache.  The setup that leads to the bad situation 
happens like this:

-The steady stream of backend writes that happen between checkpoints 
have filled up most of the OS write cache.  A look at /proc/meminfo 
shows around 2.5GB "Dirty:"

-Since we have shared_buffers set to 512MB to try and keep checkpoint 
storms from being too bad, there might be 300MB of dirty pages involved 
in the checkpoint.  The write phase dumps this all into Linux's cache.  
There's now closer to 3GB of dirty data there.  @64GB of RAM, this is 
still only 4.7% though--just below the effective lower range for 
dirty_background_ratio.  Linux is perfectly content to let it all sit there.

-Sync phase begins.  Between absorption and the new checkpoint writes, 
there are >300 segments to sync waiting here.

-The first few syncs force data out of Linux's cache and into the BBWC.  
Some of these return almost instantly.  Others block for a moderate 
number of seconds.  That's not necessarily a showstopper, on XFS at 
least.  So long as the checkpointer is not being given all of the I/O in 
the system, the fact that it's stuck waiting for a sync doesn't mean the 
server is unresponsive to the needs of other backends.  Early data might 
look like this:

DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

[Here 'gap' is a precise measurement of how close the sync pause feature 
is working, with it set to 3 seconds.  This is from an earlier version 
of this patch.  All the timing issues I used to measure went away in the 
current implementation because it doesn't have to worry about doing 
background writer LRU work anymore, with the checkpointer split out]

But after a few hundred of these, every downstream cache is filled up.  
The result is seeing more really ugly sync times, like #164 here:

DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
...
DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec

At the same time #164 is happening, that 45 second long window, a pile 
of clients will get stuck where they can't do any I/O.  The RAID 
controller that used to have a useful mix of data is now completely 
filled with >=512MB of random writes.  It's now failing to write as fast 
as new data is coming in.  Eventually that leads to pressure building up 
in Linux's cache.  Now you're in the bad place:  dirty_background_ratio 
is crossed, Linux is now worried about spooling all cached writes to 
disk as fast as it can, the checkpointer is sync'ing its own important 
data to disk as fast as it can too, and all caches are inefficient 
because they're full.

To recreate a scenario like this, I've realized the benchmark needs to 
have a couple of characteristics:

-It has to focus on transaction latency instead of throughput.  We know 
that doing syncs more often will lower throughput due to reduced 
reordering etc.

-It cannot run at maximum possible speed all the time.  It needs to be 
the case that the system keeps up with the load during the rest of the 
time, but the sync phase of checkpoints causes I/O to queue faster than 
it's draining, thus saturating all caches and then blocking backends.  
Ideally, "Dirty:" in /proc/meminfo will reach >90% of the 
dirty_background_ratio trigger line around the same time the sync phase 
starts.

-There should be a lot of clients doing a mix of work.  The way Linux 
I/O works, the scheduling for readers vs. writers is complicated, and 
this is one of the few areas where things like CFQ vs. Deadline matter.

I've realized now one reason I never got anywhere with this while 
running pgbench tests is that pgbench always runs at 100% of capacity.  
It fills all the caches involved completely as fast as it can, and every 
checkpoint starts with them already filled to capacity.  So when latency 
gets bad at checkpoint time, no amount of clever reordering will help 
keep those writes from interfering with other processes.  There just 
isn't any room to work with left.

What I think is needed instead is a write-heavy benchmark with a think 
time in it, so that we can dial the workload up to, say, 90% of I/O 
capacity, but that spikes to 100% when checkpoint sync happens.  Then 
rearrangements in syncing that reduces caching pressure should be 
visible as a latency reduction in client response times.  My guess is 
that dbt-2 can be configured to provide such a workload, and I don't see 
a way forward here except for me to fully embrace that and start over 
with it.

> Just one random thought: I wonder if it would make sense to cap the
> delay after each sync to the time spending performing that sync.  That
> would make the tuning of the delay less sensitive to the total number
> of files, because we won't unnecessarily wait after each sync when
> they're not actually taking any time to complete.

This is one of the attractive ideas in this area that didn't work out so 
well when tested.  The problem is that writes into a battery-backed 
write cache will show zero latency for some time until the cache is 
filled...and then you're done.  You have to pause anyway, even though it 
seems write speed is massive, to give the cache some time to drain to 
disk between syncs that push data toward it.  Even though it absorbed 
your previous write with no delay, that doesn't mean it takes no time to 
process that write.  With proper write caching, that processing is just 
happening asynchronously.

This is related to another observation, noting what went wrong when we 
tried deploying my fully auto-tuning sync spread patch onto production.  
If the sync phase of the checkpoint starts to fall behind, and you've 
configured for a sync pause, you have to just suck that up and accept 
you'll finish late[1].  When you do get into the situation where the 
cache is completely filled, writes will slow dramatically.  In the above 
log example, sync #164 taking 45 seconds means that #165 will surely be 
considered behind schedule now.  If you use that feedback to then reduce 
the sync pause, feeling that you are behind schedule and cannot afford 
to pause anymore, now you've degenerated right back to the original 
troubled behavior:  sync calls, as fast as they can be accepted by the 
OS, no delay between them.

[1] Where I think I'm going to end up with this eventually now is that 
setting checkpoint_sync_pause is the important tunable.  The parameter 
that then gets auto-tuned is checkpoint_timeout.  If you have 300 
relations to sync and you have to wait 10 seconds between syncs to get 
latency down, the server is going to inform you an hour between 
checkpoints is all you can do here.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Checkpoint sync pause

From
Josh Berkus
Date:
On 1/16/12 5:59 PM, Greg Smith wrote:
> 
> What I think is needed instead is a write-heavy benchmark with a think
> time in it, so that we can dial the workload up to, say, 90% of I/O
> capacity, but that spikes to 100% when checkpoint sync happens.  Then
> rearrangements in syncing that reduces caching pressure should be
> visible as a latency reduction in client response times.  My guess is
> that dbt-2 can be configured to provide such a workload, and I don't see
> a way forward here except for me to fully embrace that and start over
> with it.

You can do this with custom pgbench workloads, thanks to random and
sleep functions.  Somebody went and make pgbench programmable, I don't
remember who.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Checkpoint sync pause

From
Robert Haas
Date:
On Mon, Jan 16, 2012 at 8:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> [ interesting description of problem scenario and necessary conditions for reproducing it ]

This is about what I thought was happening, but I'm still not quite
sure how to recreate it in the lab.

Have you had a chance to test with Linux 3.2 does any better in this
area?  As I understand it, it doesn't do anything particularly
interesting about the willingness of the kernel to cache gigantic
amounts of dirty data, but (1) supposedly it does a better job not
yanking the disk head around by just putting foreground processes to
sleep while writes happen in the background, rather than having the
foreground processes compete with the background writer for control of
the disk head; and (2) instead of having a sharp edge where background
writing kicks in, it tries to gradually ratchet up the pressure to get
things written out.

Somehow I can't shake the feeling that this is fundamentally a Linux
problem, and that it's going to be nearly impossible to work around in
user space without some help from the kernel.  I guess in some sense
it's reasonable that calling fsync() blasts the data at the platter at
top speed, but if that leads to starving everyone else on the system
then it starts to seem a lot less reasonable: part of the kernel's job
is to guarantee all processes fair access to shared resources, and if
it doesn't do that, we're always going to be playing catch-up.

>> Just one random thought: I wonder if it would make sense to cap the
>> delay after each sync to the time spending performing that sync.  That
>> would make the tuning of the delay less sensitive to the total number
>> of files, because we won't unnecessarily wait after each sync when
>> they're not actually taking any time to complete.
>
> This is one of the attractive ideas in this area that didn't work out so
> well when tested.  The problem is that writes into a battery-backed write
> cache will show zero latency for some time until the cache is filled...and
> then you're done.  You have to pause anyway, even though it seems write
> speed is massive, to give the cache some time to drain to disk between syncs
> that push data toward it.  Even though it absorbed your previous write with
> no delay, that doesn't mean it takes no time to process that write.  With
> proper write caching, that processing is just happening asynchronously.

Hmm, OK.  Well, to borrow a page from one of your other ideas, how
about keeping track of the number of fsync requests queued for each
file, and make the delay proportional to that number?  We might have
written the same block more than once, so it could be an overestimate,
but it rubs me the wrong way to think that a checkpoint is going to
finish late because somebody ran a CREATE TABLE statement that touched
5 or 6 catalogs, and now we've got to pause for 15-18 seconds because
they've each got one dirty block.  :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Checkpoint sync pause

From
Jeff Janes
Date:
On Mon, Jan 16, 2012 at 5:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 01/16/2012 11:00 AM, Robert Haas wrote:
>>
>> Also, I am still struggling with what the right benchmarking methodology
>> even is to judge whether
>> any patch in this area "works".  Can you provide more details about
>> your test setup?
>
>
> The "test" setup is a production server with a few hundred users at peak
> workload, reading and writing to the database.  Each RAID controller (couple
> of them with their own tablespaces) has either 512MG or 1GB of
> battery-backed write cache.  The setup that leads to the bad situation
> happens like this:
>
> -The steady stream of backend writes that happen between checkpoints have
> filled up most of the OS write cache.  A look at /proc/meminfo shows around
> 2.5GB "Dirty:"

"backend writes" includes bgwriter writes, right?

>
> -Since we have shared_buffers set to 512MB to try and keep checkpoint storms
> from being too bad, there might be 300MB of dirty pages involved in the
> checkpoint.  The write phase dumps this all into Linux's cache.  There's now
> closer to 3GB of dirty data there.  @64GB of RAM, this is still only 4.7%
> though--just below the effective lower range for dirty_background_ratio.

Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level?  If so, how did it do?  Or does it just
refuse to obey below the 5% level, as well?

>  Linux is perfectly content to let it all sit there.
>
> -Sync phase begins.  Between absorption and the new checkpoint writes, there
> are >300 segments to sync waiting here.

If there is 3GB of dirty data spread over >300 segments each segment
is about full-sized (1GB) then on average <1% of each segment is
dirty?

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is.  In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.

Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does?  If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?

> -The first few syncs force data out of Linux's cache and into the BBWC.
>  Some of these return almost instantly.  Others block for a moderate number
> of seconds.  That's not necessarily a showstopper, on XFS at least.  So long
> as the checkpointer is not being given all of the I/O in the system, the
> fact that it's stuck waiting for a sync doesn't mean the server is
> unresponsive to the needs of other backends.  Early data might look like
> this:
>
> DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
> DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
> DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
> DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
> DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
> DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
> DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
> DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
> DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
> DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

Syncs 3 and 5 kind of surprise me.  It seems like the times should be
more bimodal.  If the cache is already full, why doesn't the system
promptly collapse, like it does later?  And if it is not, why would it
take 12 seconds, or even 2 seconds?  Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?

>
> [Here 'gap' is a precise measurement of how close the sync pause feature is
> working, with it set to 3 seconds.  This is from an earlier version of this
> patch.  All the timing issues I used to measure went away in the current
> implementation because it doesn't have to worry about doing background
> writer LRU work anymore, with the checkpointer split out]
>
> But after a few hundred of these, every downstream cache is filled up.  The
> result is seeing more really ugly sync times, like #164 here:
>
> DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
> DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
> DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
> DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
> DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
> DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
> DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
> DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
> ...
> DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
> LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec
>
> At the same time #164 is happening, that 45 second long window, a pile of
> clients will get stuck where they can't do any I/O.

What I/O are they trying to do?  It seems like all your data is in RAM
(if not, I'm surprised you can get queries to ran fast enough to
create this much dirty data).  So they probably aren't blocking on
reads which are being interfered with by all the attempted writes.
Your WAL is probably on a separate controller which is not impacted
(unless the impaction is at the kernel level).  What other writes
would there be?  The only major ones I can think of is backend writes
of shared_buffers, or writes of CLOG buffers because you can't read in
a new one until an existing one is cleaned.

The current shared_buffer allocation method (or my misunderstanding of
it) reminds me of the joke about the guy who walks into his kitchen
with a cow-pie in his hand and tells his wife "Look what I almost
stepped in".  If you find a buffer that is usagecount=0 and unpinned,
but dirty, then why is it dirty?  It is likely to be dirty because the
background writer can't keep up.  And if the background writer can't
keep up, it is probably having trouble with writes blocking.  So, for
Pete's sake, don't try to write it out yourself!  If you can't find a
clean, reusable buffer in a reasonable number of attempts, I guess at
some point you need to punt and write one out.  But currently it grabs
the first unpinned usagecount=0 buffer it sees and writes it out if
dirty, without even checking if the next one might be clean.



> The RAID controller
> that used to have a useful mix of data is now completely filled with >=512MB
> of random writes.  It's now failing to write as fast as new data is coming
> in.  Eventually that leads to pressure building up in Linux's cache.  Now
> you're in the bad place:  dirty_background_ratio is crossed, Linux is now
> worried about spooling all cached writes to disk as fast as it can, the
> checkpointer is sync'ing its own important data to disk as fast as it can
> too, and all caches are inefficient because they're full.
>
> To recreate a scenario like this, I've realized the benchmark needs to have
> a couple of characteristics:
>
> -It has to focus on transaction latency instead of throughput.  We know that
> doing syncs more often will lower throughput due to reduced reordering etc.

One option for pgbench I've contemplated was better latency reporting.I don't really want to have mine very large log
files(and just 
writing them out can produce IO that competes with the IO you actually
care about, if you don't have a lot of controllers around to isolate
everything.).  I'd like to see a report every 5 seconds about what the
longest latency was over that last 5 seconds.

Doing syncs more often would only lower throughput by reduced
reordering if there were substantial opportunities for reordering to
start with.  If you are dirtying 300 segments but only 3GB of data,
then unless a lot of those segments are far from full you might not
have that much opportunity for reordering to start with.

Also, what limits the amount of work that needs to get done?  If you
make a change that decreases throughput but also decreases latency,
then something else has got to give.  If the available throughput is
less than the work that needs to get done, either connections and
latency will go up until the system will completely falls over, or
some attempts to connect will get refused.  The only other possibility
I see is that some clients will get sick of the latency and decide
that their particular chunk of work doesn't need to get done at this
particular time after all--a decision often made by humans but rarely
by benchmarking programs.

> -It cannot run at maximum possible speed all the time.  It needs to be the
> case that the system keeps up with the load during the rest of the time, but
> the sync phase of checkpoints causes I/O to queue faster than it's draining,
> thus saturating all caches and then blocking backends.  Ideally, "Dirty:" in
> /proc/meminfo will reach >90% of the dirty_background_ratio trigger line
> around the same time the sync phase starts.
>
> -There should be a lot of clients doing a mix of work.  The way Linux I/O
> works, the scheduling for readers vs. writers is complicated, and this is
> one of the few areas where things like CFQ vs. Deadline matter.

OK, my idea was just to dial back pgbench's -c until if fits the
criteria of your previous paragraph, but I guess dialing it back would
change the kernels perception of scheduling priorities.  Do you use
connection poolers?

> I've realized now one reason I never got anywhere with this while running
> pgbench tests is that pgbench always runs at 100% of capacity.  It fills all
> the caches involved completely as fast as it can, and every checkpoint
> starts with them already filled to capacity.  So when latency gets bad at
> checkpoint time, no amount of clever reordering will help keep those writes
> from interfering with other processes.  There just isn't any room to work
> with left.

But it only does that if you tell it to.  If you just keep cranking up
-c until TPS flat-lines or decreases, then of course you will be in
that situation.  You can insert sleeps, or just run fewer connections,
assuming that that doesn't run afoul of IO scheduler.

The imbalance between selects versus updates/inserts is harder to
correct for.  I've tries to model some things just be starting a
pgbench -S and a default pgbench with a strategic ratio of clients for
each, but it would be nice if this were easier/more automatic.
Another nice feature would be if you could define a weight with -f.
If I want one transaction file to execute 50 times more often than
another, it not so nice to have specify 51 -f switches with 50 of them
all being the same file.

What problems do you see with pgbench?  Can you not reproduce
something similar to the production latency problems, or can you
reproduce them, but things that fix the problem in pgbench don't
translate to production?  Or the other way around, things that work in
production didn't work in pgbench?

> What I think is needed instead is a write-heavy benchmark with a think time
> in it, so that we can dial the workload up to, say, 90% of I/O capacity, but
> that spikes to 100% when checkpoint sync happens.  Then rearrangements in
> syncing that reduces caching pressure should be visible as a latency
> reduction in client response times.  My guess is that dbt-2 can be
> configured to provide such a workload, and I don't see a way forward here
> except for me to fully embrace that and start over with it.

But I would think that pgbench can be configured to do that as well,
and would probably offer a wider array of other testers.  Of course,if
they have to copy and specify 30 different -f files, maybe getting
dbt-2 to install and run would be easier than that.  My attempts at
getting dbt-5 to work for me do not make me eager jump from pgbench to
try more other things.

...

> [1] Where I think I'm going to end up with this eventually now is that
> setting checkpoint_sync_pause is the important tunable.  The parameter that
> then gets auto-tuned is checkpoint_timeout.  If you have 300 relations to
> sync and you have to wait 10 seconds between syncs to get latency down, the
> server is going to inform you an hour between checkpoints is all you can do
> here.

Do we have a theoretical guess on about how fast you should be able to
go, based on the RAID capacity and the speed and density at which you
dirty data?

If you just keep spreading out the syncs until a certain latency is
reached, then you are continuing to dirty data a prodigious rate
between those sync.  I think it is quite likely you will hit a vicious
circle, where one time the server informs you that one hour-checkpoint
are all you can do, then an hour later tells you that 5 hour
checkpoints are all you can do, and then 5 hours later it tells you
that a week is really the best it can do.

At some point you need to buy more RAID controllers.  Any way to know
how close to that point you are?

Cheers,

Jeff


Re: Checkpoint sync pause

From
Greg Smith
Date:
On 02/03/2012 11:41 PM, Jeff Janes wrote:
>> -The steady stream of backend writes that happen between checkpoints have
>> filled up most of the OS write cache.  A look at /proc/meminfo shows around
>> 2.5GB "Dirty:"
> "backend writes" includes bgwriter writes, right?

Right.

> Has using a newer kernal with dirty_background_bytes been tried, so it
> could be set to a lower level?  If so, how did it do?  Or does it just
> refuse to obey below the 5% level, as well?

Trying to dip below 5% using dirty_background_bytes slows VACUUM down 
faster than it improves checkpoint latency.  Since the sort of servers 
that have checkpoint issues are quite often ones that have VACUUM ones, 
too, that whole path doesn't seem very productive.  The one test I 
haven't tried yet is whether increasing the size of the VACUUM ring 
buffer might improve how well the server responds to a lower write cache.

> If there is 3GB of dirty data spread over>300 segments each segment
> is about full-sized (1GB) then on average<1% of each segment is
> dirty?
>
> If that analysis holds, then it seem like there is simply an awful lot
> of data has to be written randomly, no matter how clever the
> re-ordering is.  In other words, it is not that a harried or panicked
> kernel or RAID control is failing to do good re-ordering when it has
> opportunities to, it is just that you dirty your data too randomly for
> substantial reordering to be possible even under ideal conditions.

Averages are deceptive here.  This data follows the usual distribution 
for real-world data, which is that there is a hot chunk of data that 
receives far more writes than average (particularly index blocks), along 
with a long tail of segments that are only seeing one or two 8K blocks 
modified (catalog data, stats, application metadata).

Plenty of useful reordering happens here.  It's happening in Linux's 
cache and in the controller's cache.  The constant of stream of 
checkpoint syncs doesn't stop that.  It does seems to do two bad things 
though:  a) makes some of these bad cache filled situations more likely, 
and b) doesn't leave any I/O capacity unused for clients to get some 
work done.  One of the real possibilities I've been considering more 
lately is that the value we've seen of the pauses during sync aren't so 
much about optimizing I/O, that instead it's from allowing a brief 
window of client backend I/O to slip in there between the cache filling 
checkpoint sync.

> Does the BBWC, once given an fsync command and reporting success,
> write out those block forthwith, or does it lolly-gag around like the
> kernel (under non-fsync) does?  If it is waiting around for
> write-combing opportunities that will never actually materialize in
> sufficient quantities to make up for the wait, how to get it to stop?
>
> Was the sorted checkpoint with an fsync after every file (real file,
> not VFD) one of the changes you tried?

As far as I know the typical BBWC is always working.  When a read or a 
write comes in, it starts moving immediately.  When it gets behind, it 
starts making seek decisions more intelligently based on visibility of 
the whole queue.  But they don't delay doing any work at all the way 
Linux does.

I haven't had very good luck with sorting checkpoints at the PostgreSQL 
relation level on server-size systems.  There is a lot of sorting 
already happening at both the OS (~3GB) and BBWC (>=512MB) levels on 
this server.  My own tests on my smaller test server--with a scaled down 
OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a 
useful technique on top of that.  It's never bubbled up to being 
considered a likely win on the production one as a result.

>> DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
>> DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
>> DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
>> DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
>> DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
>> DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
>> DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
>> DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
>> DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
>> DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec
> Syncs 3 and 5 kind of surprise me.  It seems like the times should be
> more bimodal.  If the cache is already full, why doesn't the system
> promptly collapse, like it does later?  And if it is not, why would it
> take 12 seconds, or even 2 seconds?  Or is this just evidence that the
> gaps you are inserting are partially, but not completely, effective?

Given a mix of completely random I/O, a 24 disk array like this system 
has is lucky to hit 20MB/s clearing it out.  It doesn't take too much of 
that before even 12 seconds makes sense.  The 45 second pauses are the 
ones demonstrating the controller's cached is completely overwhelmed.  
It's rare to see caching turn truly bimodal, because the model for it 
has both a variable ingress and egress rate involved.  Even as the 
checkpoint sync is pushing stuff in, at the same time writes are being 
evacuated at some speed out the other end.

> What I/O are they trying to do?  It seems like all your data is in RAM
> (if not, I'm surprised you can get queries to ran fast enough to
> create this much dirty data).  So they probably aren't blocking on
> reads which are being interfered with by all the attempted writes.

Reads on infrequently read data.  Long tail again; even though caching 
is close to 100%, the occasional outlier client who wants some rarely 
accessed page with their personal data on it shows up.  Pollute the 
write caches badly enough, and what happens to reads mixed into there 
gets very fuzzy.  Depends on the exact mechanics of the I/O scheduler 
used in the kernel version deployed.

> The current shared_buffer allocation method (or my misunderstanding of
> it) reminds me of the joke about the guy who walks into his kitchen
> with a cow-pie in his hand and tells his wife "Look what I almost
> stepped in".  If you find a buffer that is usagecount=0 and unpinned,
> but dirty, then why is it dirty?  It is likely to be dirty because the
> background writer can't keep up.  And if the background writer can't
> keep up, it is probably having trouble with writes blocking.  So, for
> Pete's sake, don't try to write it out yourself!  If you can't find a
> clean, reusable buffer in a reasonable number of attempts, I guess at
> some point you need to punt and write one out.  But currently it grabs
> the first unpinned usagecount=0 buffer it sees and writes it out if
> dirty, without even checking if the next one might be clean.

Don't forget that in the version deployed here, the background writer 
isn't running during the sync phase.  I think the direction you're 
talking about here circles back to "why doesn't the BGW just put things 
it finds clean onto the free list?", a direction which would make 
"nothing on the free list" a noteworthy event suggesting the BGW needs 
to run more often.

> One option for pgbench I've contemplated was better latency reporting.
>   I don't really want to have mine very large log files (and just
> writing them out can produce IO that competes with the IO you actually
> care about, if you don't have a lot of controllers around to isolate
> everything.).

Every time I've measured this, I've found it to be <1% of the total 
I/O.  The single line of data with latency counts, written buffered, is 
pretty slim compared with the >=8K any write transaction is likely to 
have touched.  The only time I've found the disk writing overhead 
becoming serious on an absolute scale is when I'm running read-only 
in-memory benchmarks, where the rate might hit >100K TPS.  But by 
definition, that sort of test has I/O bandwidth to spare, so there it 
doesn't actually impact results much.  Just a fraction of a core doing 
some sequential writes.

> Also, what limits the amount of work that needs to get done?  If you
> make a change that decreases throughput but also decreases latency,
> then something else has got to give.

The thing that is giving way here is total time taken to execute the 
checkpoint.  There's even a theoretical gain possible form that.  It's 
possible to prove (using the pg_stat_bgwriter counts) that having 
checkpoints less frequently decreases total I/O, because there are less 
writes of the most popular blocks happening.  Right now, when I tune 
that to decrease total I/O the upper limit is when it starts spiking up 
latency.  This new GUC is trying to allow a different way to increase 
checkpoint time that seems to do less of that.

> What problems do you see with pgbench?  Can you not reproduce
> something similar to the production latency problems, or can you
> reproduce them, but things that fix the problem in pgbench don't
> translate to production?  Or the other way around, things that work in
> production didn't work in pgbench?

I can't simulate something similar enough to the production latency 
problem.  Your comments about doing something like specifying 50 "-f" 
files or a weighting are in the right area; it might be possible to hack 
a better simulation with an approach like that.  The thing that makes 
wandering that way even harder than it seems at first is how we split 
the pgbench work among multiple worker threads.

I'm not using connection pooling for the pgbench simulations I'm doing.  
There's some of that happening in the production application server.with it.

> But I would think that pgbench can be configured to do that as well,
> and would probably offer a wider array of other testers.  Of course,if
> they have to copy and specify 30 different -f files, maybe getting
> dbt-2 to install and run would be easier than that.  My attempts at
> getting dbt-5 to work for me do not make me eager jump from pgbench to
> try more other things.

dbt-5 is a work in progress, known to be tricky to get going.  dbt-2 is 
mature enough that it was used for this sort of role in 8.3 
development.  And it's even used by other database systems for similar 
testing.  It's the closest thing to an open-source standard for 
write-heavy workloads as we'll find here.

What I'm doing right now is recording a large amount of pgbench data for 
my test system here, to validate it has the typical problems pgbench 
runs into.  Once that's done I expect to switch to dbt-2 and see whether 
it's a more useful latency test environment.  That plan is working out 
fine so far, it just hit a couple of weeks of unanticipated delay.

> Do we have a theoretical guess on about how fast you should be able to
> go, based on the RAID capacity and the speed and density at which you
> dirty data?

This is a hard question to answer; it's something I've been thinking 
about and modeling a lot lately.  The problem is that the speed an array 
writes at depends on how many reads or writes it does during each seek 
and/or rotation.  The array here can do 1GB/s of all sequential I/O, and 
15 - 20MB/s on all random I/O.  The more efficiently writes are 
scheduled, the more like sequential I/O the workload becomes.  Any 
attempt to even try to estimate real-world throughput needs the number 
of concurrent processes as another input, and the complexity of the 
resulting model is high.

-- 
Greg Smith   2ndQuadrant USgreg@2ndQuadrant.com    Baltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.com



Re: Checkpoint sync pause

From
Jeff Janes
Date:
On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On 02/03/2012 11:41 PM, Jeff Janes wrote:
>>>
>>> -The steady stream of backend writes that happen between checkpoints have
>>> filled up most of the OS write cache.  A look at /proc/meminfo shows
>>> around
>>> 2.5GB "Dirty:"
>>
>> "backend writes" includes bgwriter writes, right?
>
>
> Right.
>
>
>> Has using a newer kernal with dirty_background_bytes been tried, so it
>> could be set to a lower level?  If so, how did it do?  Or does it just
>> refuse to obey below the 5% level, as well?
>
>
> Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster
> than it improves checkpoint latency.

Does it cause VACUUM to create latency for other processes (like the
checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM
just slow down without effecting other tasks?

It seems to me that just lowering dirty_background_bytes (while not
also lowering dirty_bytes) should not cause the latter to happen, but
it seems like these kernel tunables never do exactly what they
advertise.

This may not be relevant to the current situation, but I wonder if we
don't need a "vacuum_cost_page_dirty_seq" so that if the pages we are
dirtying are consecutive (or at least closely spaced) they cost less,
in anticipation that the eventual writes will be combined and thus
consume less IO resources.  I would think it would be common for some
regions of table to be intensely dirtied, and some to be lightly
dirtied (but still aggregating up to a considerable amount of random
IO).   But the vacuum process might also need to be made more
"bursty", as even if it generates sequential dirty pages the IO system
might write them randomly anyway if there are too many delays
interspersed


> Since the sort of servers that have
> checkpoint issues are quite often ones that have VACUUM ones, too, that
> whole path doesn't seem very productive.  The one test I haven't tried yet
> is whether increasing the size of the VACUUM ring buffer might improve how
> well the server responds to a lower write cache.

I wouldn't expect this to help.  It seems like it would hurt, as it
just leaves the data for even longer (however long it takes to
circumnavigate the ring buffer) before there is any possibility of it
getting written.  I guess it does increase the chances that the dirty
pages will "accidentally" get written by the bgwriter rather than the
vacuum itself, but I doubt that that would be significant.

...
>>
>> Was the sorted checkpoint with an fsync after every file (real file,
>> not VFD) one of the changes you tried?
>
>
...
>
> I haven't had very good luck with sorting checkpoints at the PostgreSQL
> relation level on server-size systems.  There is a lot of sorting already
> happening at both the OS (~3GB) and BBWC (>=512MB) levels on this server.
>  My own tests on my smaller test server--with a scaled down OS (~750MB) and
> BBWC (256MB) cache--haven't ever validated sorting as a useful technique on
> top of that.  It's never bubbled up to being considered a likely win on the
> production one as a result.

Without sorted checkpoints (or some other fancier method) you have to
write out the entire pool before you can do any fsyncs.  Or you have
to do multiple fsyncs of the same file, with at least one occurring
after the entire pool was written.  With a sorted checkpoint, you can
start issuing once-only fsyncs very early in the checkpoint process.
I think that on large servers, that would be the main benefit, not the
actually more efficient IO.  (On small servers I've seen sorted
checkpoints be much faster on shutdown checkpoints, but not on natural
checkpoints, and presumably this improvement *is* due to better
ordering).

On your servers, you need big delays between fsyncs and not between
writes (as they are buffered until the fsync).  But in other
situations, people need the delays between the writes.  By using
sorted checkpoints with fsyncs between each file, the delays between
writes are naturally delays between fsyncs as well.  So I think the
benefit of using sorted checkpoints is that code to improve your
situations is less likely to degrade someone else's situation, without
having to introduce an extra layer of tunables.


>
>> What I/O are they trying to do?  It seems like all your data is in RAM
>> (if not, I'm surprised you can get queries to ran fast enough to
>> create this much dirty data).  So they probably aren't blocking on
>> reads which are being interfered with by all the attempted writes.
>
>
> Reads on infrequently read data.  Long tail again; even though caching is
> close to 100%, the occasional outlier client who wants some rarely accessed
> page with their personal data on it shows up.  Pollute the write caches
> badly enough, and what happens to reads mixed into there gets very fuzzy.
>  Depends on the exact mechanics of the I/O scheduler used in the kernel
> version deployed.

OK, but I would still think it is a minority of transactions which
need at least one of those infrequently read data and most do not.  So
a few clients would freeze, but the rest should keep going until they
either try to execute a read themselves, or they run into a
heavyweight lock held by someone else who is read-blocking.  So if
1/1000 of all transactions need to make a disk read, but clients are
running at 100s of TPS, then I guess after a few tens of seconds all
clients will be blocked on reads and you will see total freeze up.
But it seems more likely to me that they are in fact freezing on
writes.  Is there a way to directly observe what they are blocking on?I wish "top" would separate %wait into read and
write.

>
>
>> The current shared_buffer allocation method (or my misunderstanding of
>> it) reminds me of the joke about the guy who walks into his kitchen
>> with a cow-pie in his hand and tells his wife "Look what I almost
>> stepped in".  If you find a buffer that is usagecount=0 and unpinned,
>> but dirty, then why is it dirty?  It is likely to be dirty because the
>> background writer can't keep up.  And if the background writer can't
>> keep up, it is probably having trouble with writes blocking.  So, for
>> Pete's sake, don't try to write it out yourself!  If you can't find a
>> clean, reusable buffer in a reasonable number of attempts, I guess at
>> some point you need to punt and write one out.  But currently it grabs
>> the first unpinned usagecount=0 buffer it sees and writes it out if
>> dirty, without even checking if the next one might be clean.
>
>
> Don't forget that in the version deployed here, the background writer isn't
> running during the sync phase.

Oh, I had thought you had compiled your own custom work around to
that.  So much of the problem might go away upon a new release and an
upgrade, as far as we know?

>  I think the direction you're talking about
> here circles back to "why doesn't the BGW just put things it finds clean
> onto the free list?",

I wouldn't put it that way, because to me the freelist is the code
located in freelist.c.  The linked list is a freelist.  But the clock
sweep is also a freelist, just implemented in a different way.

If the hypothetical BGW doesn't remove the entry from the buffer
mapping table and invalidate it when it adds to the linked list, then
we might pull a "free" buffer from the linked list and discover it is
not actually free.  If we want to make it so that it does remove the
entry from the buffer mapping table (which doesn't seem like a good
idea to me) we could implement that just as well with the clock-sweep
as we could with the linked list.

I think the linked list is a bit of a red herring.  Many of the
concepts people discuss implementing on the linked list could just as
easily be implemented with the clock sweep.  And I've seen no evidence
at all that the clock sweep is the problem.  The LWLock that protects
can obviously be a problem, but that seems to be due to the overhead
of acquiring a contended lock, not the work done under the lock.
Reducing the lock-strength around this might be a good idea, but that
reduction could be done just as easily (and as far as I can tell, more
easily) with the clock sweep than the linked list.

> a direction which would make "nothing on the free
> list" a noteworthy event suggesting the BGW needs to run more often.

Isn't seeing a dirty unpinned usage_count==0 buffer in the clocksweep
just as noteworthy as seeing an empty linked list?  From what I can
tell, you can't dirty a buffer without pinning it, you can't pin a
buffer without making usage_count>0, and we never decrement
usage_count on a pinned buffer.  So, the only way to see a dirty
buffer that is unpinned and has zero usage_count is if another normal
backend saw it unpinned and decremented the count, which would have to
be a full clock sweep ago, and the bgwriter hasn't visited it since
then.

If our goal is to autotune the bgwriter_* parameters, then detecting
either an empty linked list or dirty but usable buffer in the clock
sweep would be a good way to do that.  But, I think the bigger issue
is to assume that the bgwriter is already tuned as well as it can be,
and that beating on it further will not improve its moral.  If the IO
write caches are all full, there is nothing bgwriter can do about it
by running more often.  In that case, we can't really do anything
about the dirty pages it is leaving around our yard.  But what we can
do is not pick up those little piles of toxic waste and bring them
into our living rooms.  That is, don't try to write out the dirty page
in the foreground, instead go looking for a clean one.  We can evict
it without doing a write, and hopefully we can read in the replacement
either from OS cache, or from disk if reads are not as gummed up as
writes are.


>> But I would think that pgbench can be configured to do that as well,
>> and would probably offer a wider array of other testers.  Of course,if
>> they have to copy and specify 30 different -f files, maybe getting
>> dbt-2 to install and run would be easier than that.  My attempts at
>> getting dbt-5 to work for me do not make me eager jump from pgbench to
>> try more other things.
>
>
> dbt-5 is a work in progress, known to be tricky to get going.  dbt-2 is
> mature enough that it was used for this sort of role in 8.3 development.
>  And it's even used by other database systems for similar testing.  It's the
> closest thing to an open-source standard for write-heavy workloads as we'll
> find here.

OK, thanks for the reassurance.  I'll no longer be afraid to give it a
try if I get an opportunity.

Cheers,

Jeff


Re: Checkpoint sync pause

From
Amit Kapila
Date:
>> Without sorted checkpoints (or some other fancier method) you have to
>> write out the entire pool before you can do any fsyncs.  Or you have
>> to do multiple fsyncs of the same file, with at least one occurring
>> after the entire pool was written.  With a sorted checkpoint, you can
>> start issuing once-only fsyncs very early in the checkpoint process.
>> I think that on large servers, that would be the main benefit, not the
>> actually more efficient IO.  (On small servers I've seen sorted
>> checkpoints be much faster on shutdown checkpoints, but not on natural
>> checkpoints, and presumably this improvement *is* due to better
>> ordering).

>> On your servers, you need big delays between fsyncs and not between
>> writes (as they are buffered until the fsync).  But in other
>> situations, people need the delays between the writes.  By using
>> sorted checkpoints with fsyncs between each file, the delays between
>> writes are naturally delays between fsyncs as well.  So I think the
>> benefit of using sorted checkpoints is that code to improve your
>> situations is less likely to degrade someone else's situation, without
>> having to introduce an extra layer of tunables.

What I understood is that you are suggesting, it is better to do sorted
checkpoints which essentially means flush nearby buffers together.
However if does this way, might be it will violate Oracle Patent
(20050044311 - Reducing disk IO by full-cache write-merging). I am not very
sure about it. But you can refer it once.

>> I think the linked list is a bit of a red herring.  Many of the
>> concepts people discuss implementing on the linked list could just as
>> easily be implemented with the clock sweep.  And I've seen no evidence
>> at all that the clock sweep is the problem.  The LWLock that protects
>> can obviously be a problem, but that seems to be due to the overhead
>> of acquiring a contended lock, not the work done under the lock.
>> Reducing the lock-strength around this might be a good idea, but that
>> reduction could be done just as easily (and as far as I can tell, more
>> easily) with the clock sweep than the linked list.

with clock-sweep, there are many chances that backend needs to traverse more
to find a suitable buffer.
However, if clean buffer is put in freelist, it can be directly picked from
there.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jeff Janes
Sent: Monday, February 13, 2012 12:14 AM
To: Greg Smith
Cc: Robert Haas; PostgreSQL-development
Subject: Re: [HACKERS] Checkpoint sync pause

On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On 02/03/2012 11:41 PM, Jeff Janes wrote:
>>>
>>> -The steady stream of backend writes that happen between checkpoints
have
>>> filled up most of the OS write cache.  A look at /proc/meminfo shows
>>> around
>>> 2.5GB "Dirty:"
>>
>> "backend writes" includes bgwriter writes, right?
>
>
> Right.
>
>
>> Has using a newer kernal with dirty_background_bytes been tried, so it
>> could be set to a lower level?  If so, how did it do?  Or does it just
>> refuse to obey below the 5% level, as well?
>
>
> Trying to dip below 5% using dirty_background_bytes slows VACUUM down
faster
> than it improves checkpoint latency.

Does it cause VACUUM to create latency for other processes (like the
checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM
just slow down without effecting other tasks?

It seems to me that just lowering dirty_background_bytes (while not
also lowering dirty_bytes) should not cause the latter to happen, but
it seems like these kernel tunables never do exactly what they
advertise.

This may not be relevant to the current situation, but I wonder if we
don't need a "vacuum_cost_page_dirty_seq" so that if the pages we are
dirtying are consecutive (or at least closely spaced) they cost less,
in anticipation that the eventual writes will be combined and thus
consume less IO resources.  I would think it would be common for some
regions of table to be intensely dirtied, and some to be lightly
dirtied (but still aggregating up to a considerable amount of random
IO).   But the vacuum process might also need to be made more
"bursty", as even if it generates sequential dirty pages the IO system
might write them randomly anyway if there are too many delays
interspersed


> Since the sort of servers that have
> checkpoint issues are quite often ones that have VACUUM ones, too, that
> whole path doesn't seem very productive.  The one test I haven't tried yet
> is whether increasing the size of the VACUUM ring buffer might improve how
> well the server responds to a lower write cache.

I wouldn't expect this to help.  It seems like it would hurt, as it
just leaves the data for even longer (however long it takes to
circumnavigate the ring buffer) before there is any possibility of it
getting written.  I guess it does increase the chances that the dirty
pages will "accidentally" get written by the bgwriter rather than the
vacuum itself, but I doubt that that would be significant.

...
>>
>> Was the sorted checkpoint with an fsync after every file (real file,
>> not VFD) one of the changes you tried?
>
>
...
>
> I haven't had very good luck with sorting checkpoints at the PostgreSQL
> relation level on server-size systems.  There is a lot of sorting already
> happening at both the OS (~3GB) and BBWC (>=512MB) levels on this server.
>  My own tests on my smaller test server--with a scaled down OS (~750MB)
and
> BBWC (256MB) cache--haven't ever validated sorting as a useful technique
on
> top of that.  It's never bubbled up to being considered a likely win on
the
> production one as a result.

Without sorted checkpoints (or some other fancier method) you have to
write out the entire pool before you can do any fsyncs.  Or you have
to do multiple fsyncs of the same file, with at least one occurring
after the entire pool was written.  With a sorted checkpoint, you can
start issuing once-only fsyncs very early in the checkpoint process.
I think that on large servers, that would be the main benefit, not the
actually more efficient IO.  (On small servers I've seen sorted
checkpoints be much faster on shutdown checkpoints, but not on natural
checkpoints, and presumably this improvement *is* due to better
ordering).

On your servers, you need big delays between fsyncs and not between
writes (as they are buffered until the fsync).  But in other
situations, people need the delays between the writes.  By using
sorted checkpoints with fsyncs between each file, the delays between
writes are naturally delays between fsyncs as well.  So I think the
benefit of using sorted checkpoints is that code to improve your
situations is less likely to degrade someone else's situation, without
having to introduce an extra layer of tunables.


>
>> What I/O are they trying to do?  It seems like all your data is in RAM
>> (if not, I'm surprised you can get queries to ran fast enough to
>> create this much dirty data).  So they probably aren't blocking on
>> reads which are being interfered with by all the attempted writes.
>
>
> Reads on infrequently read data.  Long tail again; even though caching is
> close to 100%, the occasional outlier client who wants some rarely
accessed
> page with their personal data on it shows up.  Pollute the write caches
> badly enough, and what happens to reads mixed into there gets very fuzzy.
>  Depends on the exact mechanics of the I/O scheduler used in the kernel
> version deployed.

OK, but I would still think it is a minority of transactions which
need at least one of those infrequently read data and most do not.  So
a few clients would freeze, but the rest should keep going until they
either try to execute a read themselves, or they run into a
heavyweight lock held by someone else who is read-blocking.  So if
1/1000 of all transactions need to make a disk read, but clients are
running at 100s of TPS, then I guess after a few tens of seconds all
clients will be blocked on reads and you will see total freeze up.
But it seems more likely to me that they are in fact freezing on
writes.  Is there a way to directly observe what they are blocking on?I wish "top" would separate %wait into read and
write.

>
>
>> The current shared_buffer allocation method (or my misunderstanding of
>> it) reminds me of the joke about the guy who walks into his kitchen
>> with a cow-pie in his hand and tells his wife "Look what I almost
>> stepped in".  If you find a buffer that is usagecount=0 and unpinned,
>> but dirty, then why is it dirty?  It is likely to be dirty because the
>> background writer can't keep up.  And if the background writer can't
>> keep up, it is probably having trouble with writes blocking.  So, for
>> Pete's sake, don't try to write it out yourself!  If you can't find a
>> clean, reusable buffer in a reasonable number of attempts, I guess at
>> some point you need to punt and write one out.  But currently it grabs
>> the first unpinned usagecount=0 buffer it sees and writes it out if
>> dirty, without even checking if the next one might be clean.
>
>
> Don't forget that in the version deployed here, the background writer
isn't
> running during the sync phase.

Oh, I had thought you had compiled your own custom work around to
that.  So much of the problem might go away upon a new release and an
upgrade, as far as we know?

>  I think the direction you're talking about
> here circles back to "why doesn't the BGW just put things it finds clean
> onto the free list?",

I wouldn't put it that way, because to me the freelist is the code
located in freelist.c.  The linked list is a freelist.  But the clock
sweep is also a freelist, just implemented in a different way.

If the hypothetical BGW doesn't remove the entry from the buffer
mapping table and invalidate it when it adds to the linked list, then
we might pull a "free" buffer from the linked list and discover it is
not actually free.  If we want to make it so that it does remove the
entry from the buffer mapping table (which doesn't seem like a good
idea to me) we could implement that just as well with the clock-sweep
as we could with the linked list.

I think the linked list is a bit of a red herring.  Many of the
concepts people discuss implementing on the linked list could just as
easily be implemented with the clock sweep.  And I've seen no evidence
at all that the clock sweep is the problem.  The LWLock that protects
can obviously be a problem, but that seems to be due to the overhead
of acquiring a contended lock, not the work done under the lock.
Reducing the lock-strength around this might be a good idea, but that
reduction could be done just as easily (and as far as I can tell, more
easily) with the clock sweep than the linked list.

> a direction which would make "nothing on the free
> list" a noteworthy event suggesting the BGW needs to run more often.

Isn't seeing a dirty unpinned usage_count==0 buffer in the clocksweep
just as noteworthy as seeing an empty linked list?  From what I can
tell, you can't dirty a buffer without pinning it, you can't pin a
buffer without making usage_count>0, and we never decrement
usage_count on a pinned buffer.  So, the only way to see a dirty
buffer that is unpinned and has zero usage_count is if another normal
backend saw it unpinned and decremented the count, which would have to
be a full clock sweep ago, and the bgwriter hasn't visited it since
then.

If our goal is to autotune the bgwriter_* parameters, then detecting
either an empty linked list or dirty but usable buffer in the clock
sweep would be a good way to do that.  But, I think the bigger issue
is to assume that the bgwriter is already tuned as well as it can be,
and that beating on it further will not improve its moral.  If the IO
write caches are all full, there is nothing bgwriter can do about it
by running more often.  In that case, we can't really do anything
about the dirty pages it is leaving around our yard.  But what we can
do is not pick up those little piles of toxic waste and bring them
into our living rooms.  That is, don't try to write out the dirty page
in the foreground, instead go looking for a clean one.  We can evict
it without doing a write, and hopefully we can read in the replacement
either from OS cache, or from disk if reads are not as gummed up as
writes are.


>> But I would think that pgbench can be configured to do that as well,
>> and would probably offer a wider array of other testers.  Of course,if
>> they have to copy and specify 30 different -f files, maybe getting
>> dbt-2 to install and run would be easier than that.  My attempts at
>> getting dbt-5 to work for me do not make me eager jump from pgbench to
>> try more other things.
>
>
> dbt-5 is a work in progress, known to be tricky to get going.  dbt-2 is
> mature enough that it was used for this sort of role in 8.3 development.
>  And it's even used by other database systems for similar testing.  It's
the
> closest thing to an open-source standard for write-heavy workloads as
we'll
> find here.

OK, thanks for the reassurance.  I'll no longer be afraid to give it a
try if I get an opportunity.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: Checkpoint sync pause

From
Jeff Janes
Date:
On Sun, Feb 12, 2012 at 10:49 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>> Without sorted checkpoints (or some other fancier method) you have to
>>> write out the entire pool before you can do any fsyncs.  Or you have
>>> to do multiple fsyncs of the same file, with at least one occurring
>>> after the entire pool was written.  With a sorted checkpoint, you can
>>> start issuing once-only fsyncs very early in the checkpoint process.
>>> I think that on large servers, that would be the main benefit, not the
>>> actually more efficient IO.  (On small servers I've seen sorted
>>> checkpoints be much faster on shutdown checkpoints, but not on natural
>>> checkpoints, and presumably this improvement *is* due to better
>>> ordering).
>
>>> On your servers, you need big delays between fsyncs and not between
>>> writes (as they are buffered until the fsync).  But in other
>>> situations, people need the delays between the writes.  By using
>>> sorted checkpoints with fsyncs between each file, the delays between
>>> writes are naturally delays between fsyncs as well.  So I think the
>>> benefit of using sorted checkpoints is that code to improve your
>>> situations is less likely to degrade someone else's situation, without
>>> having to introduce an extra layer of tunables.
>
> What I understood is that you are suggesting, it is better to do sorted
> checkpoints which essentially means flush nearby buffers together.

More importantly, you can issue an fsync after all pages for any given
file are written, thus naturally spreading out the fsyncs instead of
reserving them to until the end, or some arbitrary fraction of the
checkpoint cycle.  For this purpose, the buffers only need to be
sorted by physical file they are in, not by block order within the
file.

> However if does this way, might be it will violate Oracle Patent
> (20050044311 - Reducing disk IO by full-cache write-merging). I am not very
> sure about it. But you can refer it once.

Thank you.  I was not aware of it, and am constantly astonished what
kinds of things are patentable.

>>> I think the linked list is a bit of a red herring.  Many of the
>>> concepts people discuss implementing on the linked list could just as
>>> easily be implemented with the clock sweep.  And I've seen no evidence
>>> at all that the clock sweep is the problem.  The LWLock that protects
>>> can obviously be a problem, but that seems to be due to the overhead
>>> of acquiring a contended lock, not the work done under the lock.
>>> Reducing the lock-strength around this might be a good idea, but that
>>> reduction could be done just as easily (and as far as I can tell, more
>>> easily) with the clock sweep than the linked list.
>
> with clock-sweep, there are many chances that backend needs to traverse more
> to find a suitable buffer.

Maybe, but I have not seen any evidence that this is the case.  My
analyses, experiments, and simulations show that when the buffer
allocations are high, the mere act of running the sweep that often
keeps average useagecount low, so the average sweep is very short.

> However, if clean buffer is put in freelist, it can be directly picked from
> there.

Not directly, you have to take a lock.

Cheers,

Jeff