Re: Checkpoint sync pause - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Checkpoint sync pause
Date
Msg-id 4F319603.6090501@gregsmith.com
Whole thread Raw
In response to Re: Checkpoint sync pause  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Checkpoint sync pause
List pgsql-hackers
On 02/03/2012 11:41 PM, Jeff Janes wrote:
>> -The steady stream of backend writes that happen between checkpoints have
>> filled up most of the OS write cache.  A look at /proc/meminfo shows around
>> 2.5GB "Dirty:"
> "backend writes" includes bgwriter writes, right?

Right.

> Has using a newer kernal with dirty_background_bytes been tried, so it
> could be set to a lower level?  If so, how did it do?  Or does it just
> refuse to obey below the 5% level, as well?

Trying to dip below 5% using dirty_background_bytes slows VACUUM down 
faster than it improves checkpoint latency.  Since the sort of servers 
that have checkpoint issues are quite often ones that have VACUUM ones, 
too, that whole path doesn't seem very productive.  The one test I 
haven't tried yet is whether increasing the size of the VACUUM ring 
buffer might improve how well the server responds to a lower write cache.

> If there is 3GB of dirty data spread over>300 segments each segment
> is about full-sized (1GB) then on average<1% of each segment is
> dirty?
>
> If that analysis holds, then it seem like there is simply an awful lot
> of data has to be written randomly, no matter how clever the
> re-ordering is.  In other words, it is not that a harried or panicked
> kernel or RAID control is failing to do good re-ordering when it has
> opportunities to, it is just that you dirty your data too randomly for
> substantial reordering to be possible even under ideal conditions.

Averages are deceptive here.  This data follows the usual distribution 
for real-world data, which is that there is a hot chunk of data that 
receives far more writes than average (particularly index blocks), along 
with a long tail of segments that are only seeing one or two 8K blocks 
modified (catalog data, stats, application metadata).

Plenty of useful reordering happens here.  It's happening in Linux's 
cache and in the controller's cache.  The constant of stream of 
checkpoint syncs doesn't stop that.  It does seems to do two bad things 
though:  a) makes some of these bad cache filled situations more likely, 
and b) doesn't leave any I/O capacity unused for clients to get some 
work done.  One of the real possibilities I've been considering more 
lately is that the value we've seen of the pauses during sync aren't so 
much about optimizing I/O, that instead it's from allowing a brief 
window of client backend I/O to slip in there between the cache filling 
checkpoint sync.

> Does the BBWC, once given an fsync command and reporting success,
> write out those block forthwith, or does it lolly-gag around like the
> kernel (under non-fsync) does?  If it is waiting around for
> write-combing opportunities that will never actually materialize in
> sufficient quantities to make up for the wait, how to get it to stop?
>
> Was the sorted checkpoint with an fsync after every file (real file,
> not VFD) one of the changes you tried?

As far as I know the typical BBWC is always working.  When a read or a 
write comes in, it starts moving immediately.  When it gets behind, it 
starts making seek decisions more intelligently based on visibility of 
the whole queue.  But they don't delay doing any work at all the way 
Linux does.

I haven't had very good luck with sorting checkpoints at the PostgreSQL 
relation level on server-size systems.  There is a lot of sorting 
already happening at both the OS (~3GB) and BBWC (>=512MB) levels on 
this server.  My own tests on my smaller test server--with a scaled down 
OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a 
useful technique on top of that.  It's never bubbled up to being 
considered a likely win on the production one as a result.

>> DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
>> DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
>> DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
>> DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
>> DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
>> DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
>> DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
>> DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
>> DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
>> DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec
> Syncs 3 and 5 kind of surprise me.  It seems like the times should be
> more bimodal.  If the cache is already full, why doesn't the system
> promptly collapse, like it does later?  And if it is not, why would it
> take 12 seconds, or even 2 seconds?  Or is this just evidence that the
> gaps you are inserting are partially, but not completely, effective?

Given a mix of completely random I/O, a 24 disk array like this system 
has is lucky to hit 20MB/s clearing it out.  It doesn't take too much of 
that before even 12 seconds makes sense.  The 45 second pauses are the 
ones demonstrating the controller's cached is completely overwhelmed.  
It's rare to see caching turn truly bimodal, because the model for it 
has both a variable ingress and egress rate involved.  Even as the 
checkpoint sync is pushing stuff in, at the same time writes are being 
evacuated at some speed out the other end.

> What I/O are they trying to do?  It seems like all your data is in RAM
> (if not, I'm surprised you can get queries to ran fast enough to
> create this much dirty data).  So they probably aren't blocking on
> reads which are being interfered with by all the attempted writes.

Reads on infrequently read data.  Long tail again; even though caching 
is close to 100%, the occasional outlier client who wants some rarely 
accessed page with their personal data on it shows up.  Pollute the 
write caches badly enough, and what happens to reads mixed into there 
gets very fuzzy.  Depends on the exact mechanics of the I/O scheduler 
used in the kernel version deployed.

> The current shared_buffer allocation method (or my misunderstanding of
> it) reminds me of the joke about the guy who walks into his kitchen
> with a cow-pie in his hand and tells his wife "Look what I almost
> stepped in".  If you find a buffer that is usagecount=0 and unpinned,
> but dirty, then why is it dirty?  It is likely to be dirty because the
> background writer can't keep up.  And if the background writer can't
> keep up, it is probably having trouble with writes blocking.  So, for
> Pete's sake, don't try to write it out yourself!  If you can't find a
> clean, reusable buffer in a reasonable number of attempts, I guess at
> some point you need to punt and write one out.  But currently it grabs
> the first unpinned usagecount=0 buffer it sees and writes it out if
> dirty, without even checking if the next one might be clean.

Don't forget that in the version deployed here, the background writer 
isn't running during the sync phase.  I think the direction you're 
talking about here circles back to "why doesn't the BGW just put things 
it finds clean onto the free list?", a direction which would make 
"nothing on the free list" a noteworthy event suggesting the BGW needs 
to run more often.

> One option for pgbench I've contemplated was better latency reporting.
>   I don't really want to have mine very large log files (and just
> writing them out can produce IO that competes with the IO you actually
> care about, if you don't have a lot of controllers around to isolate
> everything.).

Every time I've measured this, I've found it to be <1% of the total 
I/O.  The single line of data with latency counts, written buffered, is 
pretty slim compared with the >=8K any write transaction is likely to 
have touched.  The only time I've found the disk writing overhead 
becoming serious on an absolute scale is when I'm running read-only 
in-memory benchmarks, where the rate might hit >100K TPS.  But by 
definition, that sort of test has I/O bandwidth to spare, so there it 
doesn't actually impact results much.  Just a fraction of a core doing 
some sequential writes.

> Also, what limits the amount of work that needs to get done?  If you
> make a change that decreases throughput but also decreases latency,
> then something else has got to give.

The thing that is giving way here is total time taken to execute the 
checkpoint.  There's even a theoretical gain possible form that.  It's 
possible to prove (using the pg_stat_bgwriter counts) that having 
checkpoints less frequently decreases total I/O, because there are less 
writes of the most popular blocks happening.  Right now, when I tune 
that to decrease total I/O the upper limit is when it starts spiking up 
latency.  This new GUC is trying to allow a different way to increase 
checkpoint time that seems to do less of that.

> What problems do you see with pgbench?  Can you not reproduce
> something similar to the production latency problems, or can you
> reproduce them, but things that fix the problem in pgbench don't
> translate to production?  Or the other way around, things that work in
> production didn't work in pgbench?

I can't simulate something similar enough to the production latency 
problem.  Your comments about doing something like specifying 50 "-f" 
files or a weighting are in the right area; it might be possible to hack 
a better simulation with an approach like that.  The thing that makes 
wandering that way even harder than it seems at first is how we split 
the pgbench work among multiple worker threads.

I'm not using connection pooling for the pgbench simulations I'm doing.  
There's some of that happening in the production application server.with it.

> But I would think that pgbench can be configured to do that as well,
> and would probably offer a wider array of other testers.  Of course,if
> they have to copy and specify 30 different -f files, maybe getting
> dbt-2 to install and run would be easier than that.  My attempts at
> getting dbt-5 to work for me do not make me eager jump from pgbench to
> try more other things.

dbt-5 is a work in progress, known to be tricky to get going.  dbt-2 is 
mature enough that it was used for this sort of role in 8.3 
development.  And it's even used by other database systems for similar 
testing.  It's the closest thing to an open-source standard for 
write-heavy workloads as we'll find here.

What I'm doing right now is recording a large amount of pgbench data for 
my test system here, to validate it has the typical problems pgbench 
runs into.  Once that's done I expect to switch to dbt-2 and see whether 
it's a more useful latency test environment.  That plan is working out 
fine so far, it just hit a couple of weeks of unanticipated delay.

> Do we have a theoretical guess on about how fast you should be able to
> go, based on the RAID capacity and the speed and density at which you
> dirty data?

This is a hard question to answer; it's something I've been thinking 
about and modeling a lot lately.  The problem is that the speed an array 
writes at depends on how many reads or writes it does during each seek 
and/or rotation.  The array here can do 1GB/s of all sequential I/O, and 
15 - 20MB/s on all random I/O.  The more efficiently writes are 
scheduled, the more like sequential I/O the workload becomes.  Any 
attempt to even try to estimate real-world throughput needs the number 
of concurrent processes as another input, and the complexity of the 
resulting model is high.

-- 
Greg Smith   2ndQuadrant USgreg@2ndQuadrant.com    Baltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: 16-bit page checksums for 9.2
Next
From: Peter Eisentraut
Date:
Subject: Re: patch for implementing SPI_gettypemod()