Re: sniff test on some PG 8.4 numbers - Mailing list pgsql-performance

From Greg Smith
Subject Re: sniff test on some PG 8.4 numbers
Date
Msg-id 513D5D62.3030400@2ndQuadrant.com
Whole thread Raw
In response to Re: sniff test on some PG 8.4 numbers  (Jon Nelson <jnelson+pgsql@jamponi.net>)
Responses Re: sniff test on some PG 8.4 numbers
List pgsql-performance
On 3/10/13 9:18 PM, Jon Nelson wrote:

> The following is with ext4, nobarrier, and noatime. As noted in the
> original post, I have done a fair bit of system tuning. I have the
> dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
> respectively.

That's good, but be aware those values are still essentially unlimited
write caches.  A server with 4 good but regular hard drives might do as
little as 10MB/s of random writes on a real workload.  If 2GB of data
ends up dirty, the flushing that happens at the end of a database
checkpoint will need to clear all of that out of RAM.  When that
happens, you're looking at a 3 minute long cache flush to push out 2GB.
  It's not unusual for pgbench tests to pause for over a minute straight
when that happens.  With your setup, where checkpoints happen every 5
minutes, this is only happening once per test run.  The disruption isn't
easily visible if you look at the average rate; it's outweighed by the
periods where writes happen very fast because the cache isn't full yet.
  You have to get pgbench to plot latency over time to see them and then
analyze that data.  This problem is the main reason I put together the
pgbench-tools set for running things, because once you get to processing
the latency files and make graphs from them it starts to be a pain to
look at the results.

> I built 9.2 and using 9.2 and the following pgbench invocation:
>
> pgbench  -j 8  -c 32 -M prepared -T 600
>
> transaction type: TPC-B (sort of)
> scaling factor: 400

I misread this completely in your message before; I thought you wrote
4000.  A scaling factor of 400 is making a database that's 6GB in size.
  Your test is basically seeing how fast the system memory and the RAID
cache can move things around.  In that situation, your read and write
numbers are reasonable.  They aren't actually telling you anything
useful about the disks though, because they're barely involved here.
You've sniffed the CPU, memory, and RAID controller and they smell fine.
  You'll need at least an order of magnitude increase in scale to get a
whiff of the disks.

pgbench scale numbers give approximately 16MB per scale factor.  You
don't actually stress the drives until that total number is at least 2X
as big as RAM.  We had to raise the limit on the pgbench scales recently
because it only goes up to ~20,000 on earlier versions, and that's not a
big enough scale to test many servers now.

On the select-only tests, much of the increase from ~100K to ~200K is
probably going from 8.4 to 9.2.  There's two major and several minor
tuning changes that make it much more efficient at that specific task.

> These are the *only* changes I've made to the config file:
>
> shared_buffers = 32GB
> wal_buffers = 16MB
> checkpoint_segments = 1024

Note that these are the only changes that actually impact pgbench
results.  The test doesn't stress very many parts of the system, such as
the query optimizer.

Also be aware these values may not be practical to use in production.
You can expect bad latency issues due to having shared_buffers so large.
  All that memory has to be reconciled and written to disk if it's been
modified at each checkpoint, and 32GB of such work is a lot.  I have
systems where we can't make shared_buffers any bigger than 4GB before
checkpoint pauses get too bad.

Similarly, setting checkpoint_segments to 1024 means that you might go
through 16GB of writes before a checkpoint happens.  That's great for
average performance...but when that checkpoint does hit, you're facing a
large random I/O backlog.

There's not much you can do about all this on the Linux side.  If you
drop the dirty_* parameters too much, maintenance operations like VACUUM
start to get slow.  Really all you can do is avoid setting
shared_buffers and checkpoint_segments too high, so the checkpoint
backlog never gets gigantic.  The tuning you've done is using higher
values than we normally recommend because it's not quite practical to
deploy like that.  That and the very small database are probably why
your numbers are so high.

> Note: I did get better results with HT on vs. with HT off, so I've
> left HT on for now.

pgbench select-only in particular does like hyper-threading.  We get
occasional reports of more memory-bound workloads actually slowing when
it's turned on.  I think it's a wash and leave it on.  Purchasing and
management people tend to get annoyed if they discover the core count of
the server is half what they thought they were buying.  The potential
downside of HT isn't so big that its worth opening that can of worms,
unless you've run real application level tests to prove it hurts.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


pgsql-performance by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: sniff test on some PG 8.4 numbers
Next
From: Jeff Adams - NOAA Affiliate
Date:
Subject: Large Table - Slow Window Functions (Better Approach?)