Re: Hardware/OS recommendations for large databases ( - Mailing list pgsql-performance

From Alan Stange
Subject Re: Hardware/OS recommendations for large databases (
Date
Msg-id 438241E5.2010701@rentec.com
Whole thread Raw
In response to Re: Hardware/OS recommendations for large databases (  ("Luke Lonergan" <llonergan@greenplum.com>)
Responses Re: Hardware/OS recommendations for large databases (  ("Luke Lonergan" <llonergan@greenplum.com>)
Re: Hardware/OS recommendations for large databases (  ("Luke Lonergan" <llonergan@greenplum.com>)
List pgsql-performance
Luke,

it's time to back yourself up with some numbers.   You're claiming the
need for a significant rewrite of portions of postgresql and you haven't
done the work to make that case.

You've apparently made some mistakes on the use of dd to benchmark a
storage system.   Use lmdd and umount the file system before the read
and post your results.  Using a file 2x the size of memory doesn't work
corectly.  You can quote any other numbers you want, but until you use
lmdd correctly you should be ignored.  Ideally, since postgresql uses
1GB files, you'll want to use 1GB files for dd as well.

Luke Lonergan wrote:
> Alan,
>
> On 11/21/05 6:57 AM, "Alan Stange" <stange@rentec.com> wrote:
>
>
>> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
>> 800000+0 records in
>> 800000+0 records out
>>
>> real    0m13.780s
>> user    0m0.134s
>> sys     0m13.510s
>>
>> Oops.   I just wrote 470MB/s to a file system that has peak write speed
>> of 200MB/s peak.
>>
> How much RAM on this machine?
>
Doesn't matter.  The result will always be wrong without a call to
sync() or fsync() before the close() if you're trying to measure the
speed of the disk subsystem.   Add that sync() and the result will be
correct for any memory size.  Just for completeness:  Solaris implicitly
calls sync() as part of close.   Bonnie used to get this wrong, so
quoting Bonnie isn't any good.   Note that on some systems using 2x
memory for these tests is almost OK.  For example, Solaris used to have
a hiwater mark that would throttle processes and not allow more than a
few 100K of  writes to be outstanding on a file.  Linux/XFS clearly
allows a lot of write data to be outstanding.  It's best to understand
the tools and know what they do and why they can be wrong than simply
quoting some other tool that makes the same mistakes.

I find that postgresql is able to achieve about 175MB/s on average from
a system capable of delivering 200MB/s peak and it does this with a lot
of cpu time to spare.   Maybe dd can do a little better and deliver
185MB/s.    If I were to double the speed of my IO system, I might find
that a single postgresql instance can sink about 300MB/s of data (based
on the last numbers I posted).  That's why I have multi-cpu opterons and
more than one query/client as they soak up the remaining IO capacity.

It is guaranteed that postgresql will hit some threshold of performance
in the future and possible rewrites of some core functionality will be
needed, but no numbers posted here so far have made the case that
postgresql is in trouble now.     In the mean time, build balanced
systems with cpus that match the capabilities of the storage subsystems,
use 32KB block sizes for large memory databases that are doing lots of
sequential scans, use file systems tuned for large files, use opterons, etc.


As always, one has to post some numbers.   Here's an example of how dd
doesn't do what you might expect:

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are "correct".  But one measures the kernels ability to
absorb 2000 8KB writes with no guarantee that the data is on disk and
the second measures the disk subsystems ability to write 16MB of data.
dd is equivalent to the first result.  You can't use the first type of
result and complain that postgresql is slow.  If you wrote 16G of data
on a machine with 8G memory then your dd result is possibly too fast by
a factor of two as 8G of the data might not be on disk yet.  We won't
know until you post some results.

Cheers,

-- Alan


pgsql-performance by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: Hardware/OS recommendations for large databases (
Next
From: Scott Marlowe
Date:
Subject: Re: VERY slow after many updates