Re: Hardware/OS recommendations for large databases ( - Mailing list pgsql-performance

From Alan Stange
Subject Re: Hardware/OS recommendations for large databases (
Date
Msg-id 4381E077.80009@rentec.com
Whole thread Raw
In response to Re: Hardware/OS recommendations for large databases (  ("Luke Lonergan" <llonergan@greenplum.com>)
Responses Re: Hardware/OS recommendations for large databases (
Re: Hardware/OS recommendations for large databases (
List pgsql-performance
Luke Lonergan wrote:
> OK - slower this time:
> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
> which all are capped at 120MB/s when doing sequential scans with different
> versions of Postgres.
>
Postgresql issues the exact same sequence of read() calls as does dd.
So why is dd so much faster?

I'd be careful with the dd read of a 16GB file on an 8GB system.  Make
sure you umount the file system first, to make sure all of the file is
flushed from memory.   Some systems use a freebehind on sequential reads
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is
still in memory.   The same point also holds for the writes:  when dd
finishes not all the data is on disk.   You need to issue a sync() call
to make that happen.  Use lmdd to ensure that the data is actually all
written.   In other words, I think your dd results are possibly misleading.

It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
800000+0 records in
800000+0 records out

real    0m13.780s
user    0m0.134s
sys     0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed
of 200MB/s peak.

Now, you might say that you wrote a 16GB file on an 8 GB machine so this
isn't an issue.   It does make your dd numbers look fast as some of the
data will be unwritten.


I'd also suggest running dd on the same files as postgresql.  I suspect
you'd find that the layout of the postgresql files isn't that good as
they are grown bit by bit, unlike the file created by simply dd'ing a
large file.

> Understand my point: It doesn't matter that there is idle or iowait on the
> CPU, the postgres executor is not able to drive the I/O rate for two
> reasons: there is a lot of CPU used for the scan (the 40% you reported) and
> a lack of asynchrony (the iowait time).  That means that by speeding up the
> CPU you only reduce the first part, but you don't fix the second and v.v.
>
> With more aggressive readahead, the second problem (the I/O asynchrony) is
> handled better by the Linux kernel and filesystem.  That's what we're seeing
> with XFS.

I think your point doesn't hold up.  Every time you make it, I come away
posting another result showing it to be incorrect.

The point your making doesn't match my experience with *any* storage or
program I've ever used, including postgresql.   Your point suggests that
the storage system is idle  and that postgresql is broken because it
isn't able to use the resources available...even when the cpu is very
idle.  How can that make sense?   The issue here is that the storage
system is very active doing reads on the files...which might be somewhat
poorly allocated on disk because postgresql grows the tables bit by bit.

I had the same readahead in Reiser and in XFS.   The XFS performance was
better because XFS does a better job of large file allocation on disk,
thus resulting in many fewer seeks (generated by the file system itself)
to read the files back in.   As an example, some file systems like UFS
purposely scatter large files across cylinder groups to avoid forcing
large seeks on small files; one can tune this behavior so that large
files are more tightly allocated.



Of course, because this is engineering, I have another obligatory data
point:   This time it's a 4.2GB table using 137,138  32KB pages with
nearly 41 million rows.

A "select count(1)" on the table completes in 14.6 seconds, for an
average read rate of 320 MB/s.

One cpu was idle, the other averaged 32% system time and 68 user time
for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu
would show increased performance as I really am cpu bound finally.

Postgresql is clearly able to issue the relevant sequential read()
system calls and sink the resulting data without a problem if the file
system is capable of providing the data.  It can do this up to a speed
of ~300MB/s on this class of system.   Now it should be fairly simple to
tweak the few spots where some excess memory copies are being done and
up this result substantially.  I hope postgresql is always using the
libc memcpy as that's going to be a lot faster then some private routine.

-- Alan



pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Hardware/OS recommendations for large databases (
Next
From: Greg Stark
Date:
Subject: Re: Hardware/OS recommendations for large databases (