Re: Checkpoint spikes - Mailing list pgsql-performance

From Greg Smith
Subject Re: Checkpoint spikes
Date
Msg-id 4B1762E1.7010002@2ndquadrant.com
Whole thread Raw
In response to Re: Analyse without locking?  (Richard Neill <rn214@cam.ac.uk>)
Responses Re: Checkpoint spikes  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-performance
Richard Neill wrote:
> On the other hand, the system never seems to be I/O bound. (we have at
> least 25 MB/sec of write bandwidth, and use a small fraction of that
> normally).
I would bet that if you sampled vmstat or iostat every single second,
you'd discover there's a large burst in write speed for the same few
seconds that queries are stuck.  If you're averaging out the data over a
5 second or longer period, you'll never see it--the spike will get lost
in the average.  You just can't monitor checkpoint spikes unless you're
watching I/O with an extremely tight time resolution.  Watching the
"Writeback" figure in /proc/meminfo is helpful too, that is where I
normally see everything jammed up.

> Here's the typical checkpoint logs:
> 2009-12-03 06:21:21 GMT LOG:  checkpoint complete: wrote 12400 buffers
> (2.2%); 0 transaction log file(s) added, 0 removed, 12 recycled;
> write=149.883 s, sync=5.143 s, total=155.040 s
See that "sync" number there?  That's your problem; while that sync
operation is going on, everybody else is grinding to a halt waiting for
it.  Not a coincidence that the duration is about the same amount of
time that your queries are getting stuck.  This example shows 12400
buffers = 97MB of total data written.  Since those writes are pretty
random I/O, it's easily possible to get stuck for a few seconds waiting
for that much data to make it out to disk.  You only gave the write
phase a couple of minutes to spread things out over; meanwhile, Linux
may not even bother starting to write things out until 30 seconds into
that, so the effective time between when writes to disk start and when
the matching sync happens on your system is extremely small.  That's not
good--you have to give that several minutes of breathing room if you
want to avoid checkpoint spikes.

> We're using 8.4.1, on ext4 with SSD. Is it possible that something
> exotic is occurring to do with write barriers (on by default in ext4,
> and we haven't changed this).
> Perhaps a low priority IO process for writing the previous WAL to disk
> is blocking a high-priority transaction (which is trying to write to
> the new WAL). If the latter is trying to sync, could the large amount
> of lower priority IO be getting in the way thanks to write barriers?
> If so, can I safely turn off write barriers?
Linux is pretty dumb in general here.  fsync operations will usually end
up writing out way more of the OS buffer cache than they need to.  And
the write cache can get quite big before pdflush decides it should
actually do some work, the whole thing is optimized for throughput
rather than latency.  I don't really trust barriers at all, so I don't
know if there's some specific tuning you can do with those to improve
things.  Your whole system is bleeding edge craziness IMHO--SSD, ext4,
write barriers, all stuff that just doesn't work reliably yet far as I'm
concerned.

...but that's not what you want to hear.  When I can suggest that should
help is increasing checkpoint_segments (>32), checkpoint_timeout (>=10
minutes), checkpoint_completion_target (0.9), and lowering the amount of
writes Linux will cache before it gets more aggressive about flushing
them.  Those things will fight the root cause of the problem, by giving
more time between the "write" and "sync" phases of the checkpoint.  It's
ok if "write" takes a long while, decreasing the "sync" number is your
goal you need to keep your eye on.

I've written a couple of articles on this specific topic if you want
more background on the underlying issues, it's kind of heavy reading:

http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


pgsql-performance by date:

Previous
From: Richard Neill
Date:
Subject: Re: Analyse without locking?
Next
From: Craig Ringer
Date:
Subject: Re: [BUGS] BUG #5228: Execution of prepared query is slow when timestamp parameter is used