Thread: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

From

Marinos Yannikos

Date:

15 April 2010, 14:14:10

Hi,

we are seeing latency spikes in the 2-3 second range (sometimes 8-10s) for
queries that usually take 3-4ms on our systems and I am running out of things to
try to get rid of them. Perhaps someone here has more ideas - here's a
description of the systems and what I've tried with no impact at all:

2 x 6-core Opterons (2431)
32GB RAM
2 SATA disks (WD1500HLFS) in software RAID-1
Linux 2.6.26 64 bit (Debian kernel)
PostgreSQL 8.3.9 (Debian package)
FS mounted with option noatime
vm.dirty_ratio = 80

3 DB clusters, 2 of which are actively used, all on the same RAID-1 FS
fsync=off
shared_buffers=5GB (database size is ~4.7GB on disk right now)
temp_buffers=50MB
work_mem=500MB
wal_buffers=256MB (*)
checkpoint_segments=256 (*)
commit_delay=100000 (*)
autovacuum=off (*)

(*) added while testing, no change w.r.t. the spikes seen at all

The databases have moderate read load (no burst load, typical web backend) and
somewhat regular write load (updates in batches, always single-row
update/delete/inserts using the primary key, 90% updates, a few 100s to 1000s
rows together, without explicit transactions/locking).

This is how long the queries take (seen from the client):
Thu Apr 15 18:16:14 CEST 2010 real      0m0.004s
Thu Apr 15 18:16:15 CEST 2010 real      0m0.004s
Thu Apr 15 18:16:16 CEST 2010 real      0m0.003s
Thu Apr 15 18:16:17 CEST 2010 real      0m0.005s
Thu Apr 15 18:16:18 CEST 2010 real      0m0.068s
Thu Apr 15 18:16:19 CEST 2010 real      0m0.004s
Thu Apr 15 18:16:20 CEST 2010 real      0m0.005s
Thu Apr 15 18:16:21 CEST 2010 real      0m0.235s
Thu Apr 15 18:16:22 CEST 2010 real      0m0.005s
Thu Apr 15 18:16:23 CEST 2010 real      0m3.006s <== !
Thu Apr 15 18:16:27 CEST 2010 real      0m0.004s
Thu Apr 15 18:16:28 CEST 2010 real      0m0.084s
Thu Apr 15 18:16:29 CEST 2010 real      0m0.003s
Thu Apr 15 18:16:30 CEST 2010 real      0m0.005s
Thu Apr 15 18:16:32 CEST 2010 real      0m0.038s
Thu Apr 15 18:16:33 CEST 2010 real      0m0.005s
Thu Apr 15 18:16:34 CEST 2010 real      0m0.005s

The spikes aren't periodic, i.e. not every 10,20,30 seconds or 5 minutes etc,
they seem completely random... PostgreSQL also reports (due to
log_min_duration_statement=1000) small bursts of queries that take much longer
than they should:

[nothing for a few minutes]
2010-04-15 16:50:03 CEST LOG:  duration: 8995.934 ms  statement: select ...
2010-04-15 16:50:04 CEST LOG:  duration: 3383.780 ms  statement: select ...
2010-04-15 16:50:04 CEST LOG:  duration: 3328.523 ms  statement: select ...
2010-04-15 16:50:05 CEST LOG:  duration: 1120.108 ms  statement: select ...
2010-04-15 16:50:05 CEST LOG:  duration: 1079.879 ms  statement: select ...
[nothing for a few minutes]
(explain analyze yields 5-17ms for the above queries)

Things I've tried apart from the PostgreSQL parameters above:
- switching from ext3 with default journal settings to data=writeback
- switching to ext2
- vm.dirty_background_ratio set to 1, 10, 20, 60
- vm.dirty_expire_centisecs set to 3000 (default), 8640000 (1 day)
- fsync on
- some inofficial Debian 2.6.32 kernel and ext3 with data=writeback (because of
http://lwn.net/Articles/328363/ although it seems to address fsync latency and
not read latency)
- running irqbalance

All these had no visible impact on the latency spikes.

I can also exclude faulty hardware with some certainty (since we have 12
identical systems with this problem).

I am suspecting some strange software RAID or kernel problem, unless the default
bgwriter settings can actually cause selects to get stuck for so long when there
are too many dirty buffers (I hope not). Unless I'm missing something, I only
have a non-RAID setup or ramdisks (tmpfs), or SSDs left to try to get rid of
these, so any suggestion will be greatly appreciated. Generally, I'd be very
interested in hearing how people tune their databases and their hardware/Linux
for consistently low query latency (esp. when everything should fit in memory).

Regards,
  Marinos

Re: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

From

Tom Lane

Date:

15 April 2010, 14:38:17

Marinos Yannikos <mjy@geizhals.at> writes:
> we are seeing latency spikes in the 2-3 second range (sometimes 8-10s) for
> queries that usually take 3-4ms on our systems and I am running out of things to
> try to get rid of them.

Have you checked whether the spikes correlate with checkpoints?  Turn
on log_checkpoints and watch for awhile.  If so, fooling with the
checkpoint parameters might give some relief.  However, 8.3 already has
the spread-checkpoint code so I'm not sure how much more win can be had
there.

More generally, you should watch vmstat/iostat output and see if you
can correlate the spikes with I/O activity, CPU peaks, etc.

A different line of thought is that maybe the delays have to do with lock
contention --- log_lock_waits might help you identify that.

> fsync=off

That's pretty scary.

> work_mem=500MB

Yipes.  I don't think you have enough RAM for that to be safe.

> commit_delay=100000 (*)

This is probably not a good idea.

> autovacuum=off (*)

Nor this.

            regards, tom lane

Re: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

From

"Kevin Grittner"

Date:

15 April 2010, 14:47:44

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Have you checked whether the spikes correlate with checkpoints?
> Turn on log_checkpoints and watch for awhile.  If so, fooling with
> the checkpoint parameters might give some relief.

If that by itself doesn't do it, I've found that making the
background writer more aggressive can help.  We've had good luck
with:

bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4.0

If you still have checkpoint-related latency issues, you could try
scaling back shared_buffers, letting the OS cache handle more of the
data.

Also, if you have a RAID controller with a battery-backed RAM cache,
make sure it is configured for write-back.

-Kevin

Re: 8.3.9 - latency spikes with Linux (and tuning for consistently low latency)

From

Greg Smith

Date:

15 April 2010, 19:45:30

Marinos Yannikos wrote:
> vm.dirty_ratio = 80

This is tuned the opposite direction of what you want.  The default
tuning in the generation of kernels you're using is:

/proc/sys/vm/dirty_ratio = 10
/proc/sys/vm/dirty_background_ratio = 5

And those should be considered upper limits if you want to tune for latency.

Unfortunately, even 5% will still allow 1.6GB of dirty data to queue up
without being written given 32GB of RAM, which is still plenty to lead
to a multi-second pause at times.

> 3 DB clusters, 2 of which are actively used, all on the same
> [software] RAID-1 FS

So your basic problem here is that you don't have enough disk I/O to
support this load.  You can tune it all day and that fundamental issue
will never go away.  You'd need a battery-backed write controller
capable of hardware RAID to even have a shot at supporting a system with
this much RAM without long latency pauses.  I'd normally break out the
WAL onto a separate volume too.

> [nothing for a few minutes]
> 2010-04-15 16:50:03 CEST LOG:  duration: 8995.934 ms  statement:
> select ...
> 2010-04-15 16:50:04 CEST LOG:  duration: 3383.780 ms  statement:
> select ...
> 2010-04-15 16:50:04 CEST LOG:  duration: 3328.523 ms  statement:
> select ...
> 2010-04-15 16:50:05 CEST LOG:  duration: 1120.108 ms  statement:
> select ...
> 2010-04-15 16:50:05 CEST LOG:  duration: 1079.879 ms  statement:
> select ...
> [nothing for a few minutes]

Guessing five minutes each time?  You should turn on checkpoint_logs to
be sure, but I'd bet money that's the interval, and that these are
checkpoint spikes.  If the checkpoing log shows up at about the same
time as all these queries that were blocking behind it, that's what
you've got.

> shared_buffers=5GB (database size is ~4.7GB on disk right now)

The best shot you have at making this problem a little better just with
software tuning is to reduce this to something much smaller; 128MB -
256MB would be my starting suggestion.  Make sure checkpoint_segments is
still set to a high value.

The other thing you could try is to tune like this:

checkpoint_segments=256MB
checkpoint_timeout=20min

Which would get you 4X as much checkpoint spreading as you have now.

> fsync=off

This is just generally a bad idea.

> work_mem=500MB
> wal_buffers=256MB (*)
> commit_delay=100000 (*)

That's way too big a value for work_mem; there's no sense making
wal_buffers bigger than 16MB; and you shouldn't ever adjust
commit_delay.  It's a mostly broken feature that might even introduce
latency issues in your situation.  None of these are likely related to
your problem today though.

> I am suspecting some strange software RAID or kernel problem, unless
> the default bgwriter settings can actually cause selects to get stuck
> for so long when there are too many dirty buffers (I hope not).

This fairly simple:  your kernel is configured to allow the system to
cache hundreds of megabytes, if not gigabytes, of writes.  There is no
way to make that go completely away because the Linux kernel has an
unfortunate design in terms of being low latency.  I've written two
papers in this area:

http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

And I doubt I could get the worst case on these tuned down to under a
second using software RAID without a proper disk controller.
Periodically, the database must get everything in RAM flushed out to
disk, and the only way to make that happen instantly is for there to be
a hardware write cache to dump it into, and the most common way to get
one of those is to buy a hardware RAID card.

> Unless I'm missing something, I only have a non-RAID setup or ramdisks
> (tmpfs), or SSDs left to try to get rid of these

Battery-backed write caching controller, and then re-tune afterwards.
Nothing else will improve your situation very much.  SSDs have their own
issues under heavy writes and the RAID has nothing to do with your
problem.  If this is disposable data and you can run from a RAM disk,
now that would work, but now you've got some serious work to do in order
to make that persistent.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us