Thread: Re: Hardware/OS recommendations for large databases (

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 9:31 AM, "Alan Stange" <stange@rentec.com> wrote:
>
>
>> Here's the output from one iteration of iostat -k 60 while the box is
>> doing a select count(1) on a 238GB table.
>>
>> avg-cpu:  %user   %nice    %sys %iowait   %idle
>>            0.99    0.00   17.97   32.40   48.64
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdd             345.95    130732.53         0.00    7843952          0
>>
>> We're reading 130MB/s for a full minute.  About 20% of a single cpu was
>> being used.   The remainder being idle.
>>
>
> Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
> system otherwise idle?
>
Actually, this was dual cpu and there was other activity during the full
minute, but it was on other file devices, which I didn't include in the
above output.   Given that, and given what I see on the box now I'd
raise the 20% to 30% just to be more conservative.  It's all in the
kernel either way; using a different scheduler or file system would
change that result.  Even better would be using direct IO to not flush
everything else from memory and avoid some memory copies from kernel to
user space.  Note that almost none of the time is user time.  Changing
postgresql won't change the cpu useage.

One IMHO obvious improvement would be to have vacuum and analyze only do
direct IO.  Now they appear to be very effective memory flushing tools.
Table scans on tables larger than say 4x memory should probably also use
direct IO for reads.

>
>
>> We've done nothing fancy and achieved results you claim shouldn't be
>> possible.  This is a system that was re-installed yesterday, no tuning
>> was done to the file systems, kernel or storage array.
>>
>
> Are you happy with 130MB/s?  How much did you pay for that?  Is it more than
> $2,000, or double my 2003 PC?
>
I don't know what the system cost.   It was part of block of dual
opterons from Sun that we got some time ago.   I think the 130MB/s is
slow given the hardware, but it's acceptable.  I'm not too price
sensitive; I care much more about reliability, uptime, etc.

>
>
>> What am I doing wrong?
>>
>> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
>> (for a DOE lab).   And now I don't know what I'm doing,
>>
> Cool.  Would that be Sandia?
>
> We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
> complex queries.
Disk?!  4 StorageTek tape silos.  That would be .002 TB/s.  One has to
change how you think when you have that much data. And hope you don't
have a fire, because there's no backup.   That work was while I was at
BNL.   I believe they are now at 4PB of tape and 150TB of disk.

-- Alan

Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Alan Stange <stange@rentec.com> writes:

> Luke Lonergan wrote:
> > Alan,
> >
> > On 11/18/05 9:31 AM, "Alan Stange" <stange@rentec.com> wrote:
> >
> >
> >> Here's the output from one iteration of iostat -k 60 while the box is
> >> doing a select count(1) on a 238GB table.
> >>
> >> avg-cpu:  %user   %nice    %sys %iowait   %idle
> >>            0.99    0.00   17.97   32.40   48.64
> >>
> >> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> >> sdd             345.95    130732.53         0.00    7843952          0
> >>
> >> We're reading 130MB/s for a full minute.  About 20% of a single cpu was
> >> being used.   The remainder being idle.
> >>
> >
> > Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
> > system otherwise idle?
> >
> Actually, this was dual cpu

I hate to agree with him but that looks like a dual machine with one CPU
pegged. Yes most of the time is being spent in the kernel, but you're still
basically cpu limited.

That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
drives and quite respectable for a 3-disk stripe set, even reasonable for a
4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
only getting 130MB/s then it does seem likely the cpu is actually holding you
back here.

Still it doesn't show Postgres being nearly so CPU wasteful as the original
poster claimed.

> It's all in the kernel either way; using a different scheduler or file
> system would change that result. Even better would be using direct IO to not
> flush everything else from memory and avoid some memory copies from kernel
> to user space. Note that almost none of the time is user time. Changing
> postgresql won't change the cpu useage.

Well changing to direct i/o would still be changing Postgres so that's
unclear. And there are plenty of more mundane ways that Postgres is
responsible for how efficiently or not the kernel is used. Just using fewer
syscalls to do the same amount of reading would reduce cpu consumption.


> One IMHO obvious improvement would be to have vacuum and analyze only do direct
> IO.  Now they appear to be very effective memory flushing tools.  Table scans
> on tables larger than say 4x memory should probably also use direct IO for
> reads.

--
greg

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

On 11/18/05 10:30 AM, "Alan Stange" <stange@rentec.com> wrote:

> Actually, this was dual cpu and there was other activity during the full
> minute, but it was on other file devices, which I didn't include in the
> above output.   Given that, and given what I see on the box now I'd
> raise the 20% to 30% just to be more conservative.  It's all in the
> kernel either way; using a different scheduler or file system would
> change that result.  Even better would be using direct IO to not flush
> everything else from memory and avoid some memory copies from kernel to
> user space.  Note that almost none of the time is user time.  Changing
> postgresql won't change the cpu useage.

These are all things that help on the IO wait side possibly, however, there
is a producer/consumer problem in postgres that goes something like this:

- Read some (small number of, sometimes 1) 8k pages
- Do some work on those pages, including lots of copies
- repeat

This back and forth without threading (like AIO, or a multiprocessing
executor) causes cycling and inefficiency that limits throughput.
Optimizing some of the memcopies and other garbage out, plus increasing the
internal (postgres) readahead would probably double the disk bandwidth.

But to be disk-bound (meaning that the disk subsystem is running at full
speed), requires asynchronous I/O.  We do this now with Bizgres MPP, and we
get fully saturated disk channels on every machine.  That means that even on
one machine, we run many times faster than non-MPP postgres.

> One IMHO obvious improvement would be to have vacuum and analyze only do
> direct IO.  Now they appear to be very effective memory flushing tools.
> Table scans on tables larger than say 4x memory should probably also use
> direct IO for reads.

That's been suggested many times prior - I agree, but this also needs AIO to
be maximally effective.

> I don't know what the system cost.   It was part of block of dual
> opterons from Sun that we got some time ago.   I think the 130MB/s is
> slow given the hardware, but it's acceptable.  I'm not too price
> sensitive; I care much more about reliability, uptime, etc.

Then I know what they cost - we have them too (V20z and V40z).  You should
be getting 400MB/s+ with external RAID.

>>> What am I doing wrong?
>>>
>>> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
>>> (for a DOE lab).   And now I don't know what I'm doing,
>>>
>> Cool.  Would that be Sandia?
>>
>> We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
>> complex queries.
> Disk?!  4 StorageTek tape silos.  That would be .002 TB/s.  One has to
> change how you think when you have that much data. And hope you don't
> have a fire, because there's no backup.   That work was while I was at
> BNL.   I believe they are now at 4PB of tape and 150TB of disk.

We had 1.5 Petabytes on 2 STK Silos at NAVO from 1996-1998 where I ran R&D.
We also had a Cray T932 an SGI Origin 3000 with 256 CPUs, a Cray T3E with
1280 CPUs, 2 Cray J916s with 1 TB of shared disk, a Cray C90-16, a Sun E10K,
etc etc, along with clusters of Alpha machines and lots of SGIs.  It's nice
to work with a $40M annual budget.

Later, working with FSL we implemented a weather forecasting cluster that
ultimately became the #5 fastest computer on the TOP500 supercomputing list
from 512 Alpha cluster nodes.  That machine had a 10-way shared SAN, tape
robotics and a Myrinet interconnect and ran 64-bit Linux (in 1998).

- Luke



Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke Lonergan wrote:
>> opterons from Sun that we got some time ago.   I think the 130MB/s is
>> slow given the hardware, but it's acceptable.  I'm not too price
>> sensitive; I care much more about reliability, uptime, etc.
>>
> I don't know what the system cost. It was part of block of dual
>
> Then I know what they cost - we have them too (V20z and V40z).  You should
> be getting 400MB/s+ with external RAID.
Yes, but we don't.   This is where I would normally begin a rant on how
craptacular Linux can be at times.  But, for the sake of this
discussion, postgresql isn't reading the data any more slowly than does
any other program.

And we don't have the time to experiment with the box.

I know it should be better, but it's good enough for our purposes at
this time.

-- Alan


Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Greg Stark wrote:
> Alan Stange <stange@rentec.com> writes:
>
>
>> Luke Lonergan wrote:
>>
>>> Alan,
>>>
>>> On 11/18/05 9:31 AM, "Alan Stange" <stange@rentec.com> wrote:
>>>
>>>
>>>
>>>> Here's the output from one iteration of iostat -k 60 while the box is
>>>> doing a select count(1) on a 238GB table.
>>>>
>>>> avg-cpu:  %user   %nice    %sys %iowait   %idle
>>>>            0.99    0.00   17.97   32.40   48.64
>>>>
>>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>>> sdd             345.95    130732.53         0.00    7843952          0
>>>>
>>>> We're reading 130MB/s for a full minute.  About 20% of a single cpu was
>>>> being used.   The remainder being idle.
>>>>
>>>>
>>> Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
>>> system otherwise idle?
>>>
>>>
>> Actually, this was dual cpu
>>
>
> I hate to agree with him but that looks like a dual machine with one CPU
> pegged. Yes most of the time is being spent in the kernel, but you're still
> basically cpu limited.
>
> That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
> drives and quite respectable for a 3-disk stripe set, even reasonable for a
> 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
> only getting 130MB/s then it does seem likely the cpu is actually holding you
> back here.
>
> Still it doesn't show Postgres being nearly so CPU wasteful as the original
> poster claimed.
>
Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
and 60% idle (aka iowait in the above numbers).
Of that 40%, other things were happening as well during the 1 minute
snapshot.   During some iostat outputs that I didn't post the cpu time
was ~ 20%.

So, you can take your pick.   The single cpu usage is somewhere between
20% and 40%.  As I can't remove other users of the system, it's the best
measurement that I can make right now.

Either way, it's not close to being cpu bound.  This is with Opteron
248, 2.2Ghz cpus.

Note that the storage system has been a bit disappointing:  it's an IBM
Fast T600 with a 200MB/s fiber attachment.   It could be better, but
it's not been the bottleneck in our work, so we haven't put any energy
into it.

>> It's all in the kernel either way; using a different scheduler or file
>> system would change that result. Even better would be using direct IO to not
>> flush everything else from memory and avoid some memory copies from kernel
>> to user space. Note that almost none of the time is user time. Changing
>> postgresql won't change the cpu useage.
>>
> Well changing to direct i/o would still be changing Postgres so that's
> unclear. And there are plenty of more mundane ways that Postgres is
> responsible for how efficiently or not the kernel is used. Just using fewer
> syscalls to do the same amount of reading would reduce cpu consumption.
Absolutely.  This is why we're using a 32KB block size and also switched
to using O_SYNC for the WAL syncing method.     That's many MB/s that
don't need to be cached in the kernel (thus evicting other data), and we
avoid all the fysnc/fdatasync syscalls.

The purpose of direct IO isn't to make the vacuum or analyze faster, but
to lessen their impact on queries with someone waiting for the
results.   That's our biggest hit:  running a sequential scan on 240GB
of data and flushing everything else out of memory.

Now that I'm think about this a bit, a big chunk of time is probably
being lost in TLB misses and other virtual memory events that would be
avoided if a larger page size was being used.

-- Alan


Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

On 11/18/05 11:39 AM, "Alan Stange" <stange@rentec.com> wrote:

> Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
> and 60% idle (aka iowait in the above numbers).

The "aka iowait" is the problem here - iowait is not idle (otherwise it
would be in the "idle" column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.

Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs
- OLTP performance optimizations are different than decision support

Regards,

- Luke



Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 11:39 AM, "Alan Stange" <stange@rentec.com> wrote:
>
>
>> Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
>> and 60% idle (aka iowait in the above numbers).
>>
>
> The "aka iowait" is the problem here - iowait is not idle (otherwise it
> would be in the "idle" column).
>
> Iowait is time spent waiting on blocking io calls.  As another poster
> pointed out, you have a two CPU system, and during your scan, as predicted,
> one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
> be context switched to other users, but as I pointed out earlier, that's not
> useful for getting response on decision support queries.
>
iowait time is idle time. Period.   This point has been debated
endlessly for Solaris and other OS's as well.

Here's the man page:
              %iowait
                     Show  the  percentage  of  time that the CPU or
CPUs were
                     idle during which the system had an outstanding
disk  I/O
                     request.

If the system had some other cpu bound work to perform you wouldn't ever
see any iowait time.  Anyone claiming the cpu was 100% busy on the
sequential scan using the one set of numbers I posted is
misunderstanding the actual metrics.

> Thanks for your data, it exemplifies many of the points brought up:
> - Lots of disks and expensive I/O hardware does not help improve performance
> on large table queries because I/O bandwidth does not scale beyond
> 110-120MB/s on the fastest CPUs
>
I don't think that is the conclusion from anecdotal numbers I posted.
This file subsystem doesn't perform as well as expected for any tool.
Bonnie, dd, star, etc., don't get a better data rate either.   In fact,
the storage system wasn't built for performance; it was build to
reliably hold a big chunk of data.   Even so,  postgresql is reading at
130MB/s on it, using about 30% of a single cpu, almost all of which was
system time.   I would get the same 130MB/s on a system with cpus that
were substantially slower; the limitation isn't the cpus, or
postgresql.  It's the IO system that is poorly configured for this test,
not postgresqls ability to use it.

In fact, given the numbers I posted, it's clear this system could
handily generate more than 120 MB/s using a single cpu given a better IO
subsystem;  it has cpu time to spare.   A simple test can be done:
build the database in /dev/shm and time the scans.  It's the same read()
system call being used and now one has made the IO system "infinitely
fast".  The claim is being made that standard postgresql is unable to
generate more than 120MB/s of IO on any IO system due to an inefficient
use of the kernel API and excessive memory copies, etc.  Having the
database be on a ram based file system is an example of "expensive IO
hardware" and all else would be the same.   Hmmm, now that I think about
this,  I could throw a medium sized table onto /dev/shm using
tablespaces on one of our 8GB linux boxes.    So why is this experiment
not valid, or what is it about the above assertion that I am missing?


Anyway, if one cares about high speed sequential IO, then one should use
a much larger block size to start.   Using 8KB IOs is inappropriate for
such a configuration.  We happen to be using 32KB blocks on our largest
database and it's been the best move for us.

-- Alan

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Another data point.

We had some down time on our system today to complete some maintenance
work.  It took the opportunity to rebuild the 700GB file system using
XFS instead of Reiser.

One iostat output for 30 seconds is

avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.58    0.00   19.69   31.94   46.78

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd             343.73    175035.73       277.55    5251072       8326

while doing a select count(1) on the same large table as before.
Subsequent iostat output all showed that this data rate was being
maintained.  The system is otherwise mostly idle during this measurement.

The sequential read rate is 175MB/s.  The system is the same as earlier,
one cpu is idle and the second is ~40% busy doing the scan and ~60%
idle.   This is  postgresql 8.1rc1, 32KB block size.  No tuning except
for using a 1024KB read ahead.

The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel
controller).  I see no reason why this configuration wouldn't generate
higher IO rates if a faster IO connection were available.

Can you explain again why you think there's an IO ceiling of 120MB/s
because I really don't understand?

-- Alan



Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Alan Stange wrote:
> Luke Lonergan wrote:
>> The "aka iowait" is the problem here - iowait is not idle (otherwise it
>> would be in the "idle" column).
>>
>> Iowait is time spent waiting on blocking io calls.  As another poster
>> pointed out, you have a two CPU system, and during your scan, as
>
> iowait time is idle time. Period.   This point has been debated
> endlessly for Solaris and other OS's as well.

I'm sure the the theory is nice but here's my experience with iowait
just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to
lookup some stuff in Wikipedia under Mozilla and my computer system
became completely unusable for nearly a minute while who knows what
Mozilla was doing. (Probably loading all the language packs.) I could
not even switch to IRC (already loaded) to chat with other people while
Mozilla was chewing up all my disk I/O.

So I went to another computer, connected to mine remotely (slow...) and
checked top. 90% in the "wa" column which I assume is the iowait column.
It may be idle in theory but it's not a very useful idle -- wasn't able
to switch to any programs already running, couldn't click on the XFce
launchbar to run any new programs.

Re: Hardware/OS recommendations for large databases (

From
"Steinar H. Gunderson"
Date:
On Sat, Nov 19, 2005 at 08:13:09AM -0800, Luke Lonergan wrote:
> Iowait is time spent waiting on blocking io calls.

To be picky, iowait is time spent in the idle task while the I/O queue is not
empty. It does not matter if the I/O is blocking or not (from userspace's
point of view), and if the I/O was blocking (say, PIO) from the kernel's
point of view, it would be counted in system.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
William Yu wrote:
> Alan Stange wrote:
>> Luke Lonergan wrote:
>>> The "aka iowait" is the problem here - iowait is not idle (otherwise it
>>> would be in the "idle" column).
>>>
>>> Iowait is time spent waiting on blocking io calls.  As another poster
>>> pointed out, you have a two CPU system, and during your scan, as
>>
>> iowait time is idle time. Period.   This point has been debated
>> endlessly for Solaris and other OS's as well.
>
> I'm sure the the theory is nice but here's my experience with iowait
> just a minute ago. I run Linux/XFce as my desktop -- decided I wanted
> to lookup some stuff in Wikipedia under Mozilla and my computer system
> became completely unusable for nearly a minute while who knows what
> Mozilla was doing. (Probably loading all the language packs.) I could
> not even switch to IRC (already loaded) to chat with other people
> while Mozilla was chewing up all my disk I/O.
>
> So I went to another computer, connected to mine remotely (slow...)
> and checked top. 90% in the "wa" column which I assume is the iowait
> column. It may be idle in theory but it's not a very useful idle --
> wasn't able to switch to any programs already running, couldn't click
> on the XFce launchbar to run any new programs.

So, you have a sucky computer.    I'm sorry, but iowait is still idle
time, whether you believe it or not.

-- Alan


Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Alan Stange <stange@rentec.com> writes:

> > Iowait is time spent waiting on blocking io calls.  As another poster
> > pointed out, you have a two CPU system, and during your scan, as predicted,
> > one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
> > be context switched to other users, but as I pointed out earlier, that's not
> > useful for getting response on decision support queries.

I don't think that's true. If the syscall was preemptable then it wouldn't
show up under "iowait", but rather "idle". The time spent in iowait is time in
uninterruptable sleeps where no other process can be scheduled.

> iowait time is idle time. Period.   This point has been debated endlessly for
> Solaris and other OS's as well.
>
> Here's the man page:
>               %iowait
>                      Show  the  percentage  of  time that the CPU or CPUs were
>                      idle during which the system had an outstanding disk  I/O
>                      request.
>
> If the system had some other cpu bound work to perform you wouldn't ever see
> any iowait time.  Anyone claiming the cpu was 100% busy on the sequential scan
> using the one set of numbers I posted is misunderstanding the actual metrics.

That's easy to test. rerun the test with another process running a simple C
program like "main() {while(1);}" (or two invocations of that on your system
because of the extra processor). I bet you'll see about half the percentage of
iowait because postres will get half as much opportunity to schedule i/o. If
what you are saying were true then you should get 0% iowait.

--
greg

Re: Hardware/OS recommendations for large databases (

From
"Steinar H. Gunderson"
Date:
On Sun, Nov 20, 2005 at 09:22:41AM -0500, Greg Stark wrote:
> I don't think that's true. If the syscall was preemptable then it wouldn't
> show up under "iowait", but rather "idle". The time spent in iowait is time in
> uninterruptable sleeps where no other process can be scheduled.

You are confusing userspace with kernel space. When a process is stuck in
uninterruptable sleep, it means _that process_ can't be interrupted (say,
by a signal). The kernel can preempt it without problems.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Greg Stark wrote:
> Alan Stange <stange@rentec.com> writes:
>
>
>>> Iowait is time spent waiting on blocking io calls.  As another poster
>>> pointed out, you have a two CPU system, and during your scan, as predicted,
>>> one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
>>> be context switched to other users, but as I pointed out earlier, that's not
>>> useful for getting response on decision support queries.
>>>
>
> I don't think that's true. If the syscall was preemptable then it wouldn't
> show up under "iowait", but rather "idle". The time spent in iowait is time in
> uninterruptable sleeps where no other process can be scheduled.
>
That would be wrong.   The time spent in iowait is idle time.  The
iowait stat would be 0 on a machine with a compute bound runnable
process available for each cpu.

Come on people, read the man page or look at the source code.   Just
stop making stuff up.


>
>> iowait time is idle time. Period.   This point has been debated endlessly for
>> Solaris and other OS's as well.
>>
>> Here's the man page:
>>               %iowait
>>                      Show  the  percentage  of  time that the CPU or CPUs were
>>                      idle during which the system had an outstanding disk  I/O
>>                      request.
>>
>> If the system had some other cpu bound work to perform you wouldn't ever see
>> any iowait time.  Anyone claiming the cpu was 100% busy on the sequential scan
>> using the one set of numbers I posted is misunderstanding the actual metrics.
>>
>
> That's easy to test. rerun the test with another process running a simple C
> program like "main() {while(1);}" (or two invocations of that on your system
> because of the extra processor). I bet you'll see about half the percentage of
> iowait because postres will get half as much opportunity to schedule i/o. If
> what you are saying were true then you should get 0% iowait.
Yes, I did this once about 10 years ago.   But instead of saying "I bet"
and guessing at the result, you should try it yourself.     Without
guessing, I can tell you that the iowait time will go to 0%.  You can do
this loop in the shell, so there's no code to write.  Also, it helps to
do this with the shell running at a lower priority.

-- Alan



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Alan Stange wrote:
> Another data point.
> We had some down time on our system today to complete some maintenance
> work.  It took the opportunity to rebuild the 700GB file system using
> XFS instead of Reiser.
>
> One iostat output for 30 seconds is
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.58    0.00   19.69   31.94   46.78
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdd             343.73    175035.73       277.55    5251072       8326
>
> while doing a select count(1) on the same large table as before.
> Subsequent iostat output all showed that this data rate was being
> maintained.  The system is otherwise mostly idle during this measurement.
>
> The sequential read rate is 175MB/s.  The system is the same as earlier,
> one cpu is idle and the second is ~40% busy doing the scan and ~60%
> idle.   This is  postgresql 8.1rc1, 32KB block size.  No tuning except
> for using a 1024KB read ahead.
>
> The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel
> controller).  I see no reason why this configuration wouldn't generate
> higher IO rates if a faster IO connection were available.
>
> Can you explain again why you think there's an IO ceiling of 120MB/s
> because I really don't understand?
>

I think what is going on here is that Luke's observation of the 120 Mb/s
rate is taken from data using 8K block size - it looks like we can get
higher rates with 32K.

A quick test on my P3 system seems to support this (the numbers are a
bit feeble, but the difference is interesting):

The test is SELECT 1 FROM table, stopping Pg and unmounting the file
system after each test.

8K blocksize:
25 s elapsed
48 % idle from vmstat (dual cpu system)
70 % busy from gstat (Freebsd GEOM io monitor)
181819 pages in relation
56 Mb/s effective IO throughput


32K blocksize:
23 s elapsed
44 % idle from vmstat
80 % busy from gstat
45249 pages in relation
60 Mb/s effective IO throughput


I re-ran these several times - very repeatable (+/- 0.25 seconds).

This is Freebsd 6.0 with the readahead set to 16 blocks, UFS2 filesystem
created with 32K blocksize (both cases). It might be interesting to see
the effect of using 16K (the default) with the 8K Pg block size, I would
expect this to widen the gap.

Cheers

Mark


Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:

> The test is SELECT 1 FROM table

That should read "The test is SELECT count(1) FROM table...."

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,


On 11/19/05 8:43 PM, "Alan Stange" <stange@rentec.com> wrote:

> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdd             343.73    175035.73       277.55    5251072       8326
>
> while doing a select count(1) on the same large table as before.
> Subsequent iostat output all showed that this data rate was being
> maintained.  The system is otherwise mostly idle during this measurement.

Yes - interesting.  Note the other result using XFS that I posted earlier
where I got 240+MB/s.  XFS has more aggressive readahead, which is why I
used it.

> Can you explain again why you think there's an IO ceiling of 120MB/s
> because I really don't understand?

OK - slower this time:

We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.

Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time).  That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem.  That's what we're seeing
with XFS.

- Luke



Re: Hardware/OS recommendations for large databases (

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> OK - slower this time:

> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
> which all are capped at 120MB/s when doing sequential scans with different
> versions of Postgres.

Luke, sometime it would be nice if you would post your raw evidence
and let other people do their own analysis.  I for one have gotten
tired of reading sweeping generalizations unbacked by any data.

I find the notion of a magic 120MB/s barrier, independent of either
CPU or disk speed, to be pretty dubious to say the least.  I would
like to know exactly what the "wide variety" of data points you
haven't shown us are.

            regards, tom lane

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke Lonergan wrote:
> OK - slower this time:
> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
> which all are capped at 120MB/s when doing sequential scans with different
> versions of Postgres.
>
Postgresql issues the exact same sequence of read() calls as does dd.
So why is dd so much faster?

I'd be careful with the dd read of a 16GB file on an 8GB system.  Make
sure you umount the file system first, to make sure all of the file is
flushed from memory.   Some systems use a freebehind on sequential reads
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is
still in memory.   The same point also holds for the writes:  when dd
finishes not all the data is on disk.   You need to issue a sync() call
to make that happen.  Use lmdd to ensure that the data is actually all
written.   In other words, I think your dd results are possibly misleading.

It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
800000+0 records in
800000+0 records out

real    0m13.780s
user    0m0.134s
sys     0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed
of 200MB/s peak.

Now, you might say that you wrote a 16GB file on an 8 GB machine so this
isn't an issue.   It does make your dd numbers look fast as some of the
data will be unwritten.


I'd also suggest running dd on the same files as postgresql.  I suspect
you'd find that the layout of the postgresql files isn't that good as
they are grown bit by bit, unlike the file created by simply dd'ing a
large file.

> Understand my point: It doesn't matter that there is idle or iowait on the
> CPU, the postgres executor is not able to drive the I/O rate for two
> reasons: there is a lot of CPU used for the scan (the 40% you reported) and
> a lack of asynchrony (the iowait time).  That means that by speeding up the
> CPU you only reduce the first part, but you don't fix the second and v.v.
>
> With more aggressive readahead, the second problem (the I/O asynchrony) is
> handled better by the Linux kernel and filesystem.  That's what we're seeing
> with XFS.

I think your point doesn't hold up.  Every time you make it, I come away
posting another result showing it to be incorrect.

The point your making doesn't match my experience with *any* storage or
program I've ever used, including postgresql.   Your point suggests that
the storage system is idle  and that postgresql is broken because it
isn't able to use the resources available...even when the cpu is very
idle.  How can that make sense?   The issue here is that the storage
system is very active doing reads on the files...which might be somewhat
poorly allocated on disk because postgresql grows the tables bit by bit.

I had the same readahead in Reiser and in XFS.   The XFS performance was
better because XFS does a better job of large file allocation on disk,
thus resulting in many fewer seeks (generated by the file system itself)
to read the files back in.   As an example, some file systems like UFS
purposely scatter large files across cylinder groups to avoid forcing
large seeks on small files; one can tune this behavior so that large
files are more tightly allocated.



Of course, because this is engineering, I have another obligatory data
point:   This time it's a 4.2GB table using 137,138  32KB pages with
nearly 41 million rows.

A "select count(1)" on the table completes in 14.6 seconds, for an
average read rate of 320 MB/s.

One cpu was idle, the other averaged 32% system time and 68 user time
for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu
would show increased performance as I really am cpu bound finally.

Postgresql is clearly able to issue the relevant sequential read()
system calls and sink the resulting data without a problem if the file
system is capable of providing the data.  It can do this up to a speed
of ~300MB/s on this class of system.   Now it should be fairly simple to
tweak the few spots where some excess memory copies are being done and
up this result substantially.  I hope postgresql is always using the
libc memcpy as that's going to be a lot faster then some private routine.

-- Alan



Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Alan Stange <stange@rentec.com> writes:

> The point your making doesn't match my experience with *any* storage or program
> I've ever used, including postgresql.   Your point suggests that the storage
> system is idle  and that postgresql is broken because it isn't able to use the
> resources available...even when the cpu is very idle.  How can that make sense?

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.

I also fear that heading in that direction could push Postgres even further
from the niche of software that works fine even on low end hardware into the
realm of software that only works on high end hardware. It's already suffering
a bit from that.

--
greg

Re: Hardware/OS recommendations for large databases (

From
Alvaro Herrera
Date:
Greg Stark wrote:

> I also fear that heading in that direction could push Postgres even further
> from the niche of software that works fine even on low end hardware into the
> realm of software that only works on high end hardware. It's already suffering
> a bit from that.

What's high end hardware for you?  I do development on a Celeron 533
machine with 448 MB of RAM and I find it to work well (for a "slow"
value of "well", certainly.)  If you're talking about embedded hardware,
that's another matter entirely and I don't think we really support the
idea of running Postgres on one of those things.

There's certainly true in that the memory requirements have increased a
bit, but I don't think it really qualifies as "high end" even on 8.1.

--
Alvaro Herrera                           Developer, http://www.PostgreSQL.org
Jude: I wish humans laid eggs
Ringlord: Why would you want humans to lay eggs?
Jude: So I can eat them

Re: Hardware/OS recommendations for large databases (

From
Bill McGonigle
Date:
Would it be worth first agreeing on a common set of criteria to
measure?  I see many data points going back and forth but not much
agreement on what's worth measuring and how to measure.

I'm not necessarily trying to herd cats, but it sure would be swell to
have the several knowledgeable minds here come up with something that
could uniformly tested on a range of machines, possibly even integrated
into pg_bench or something.  Disagreements on criteria or methodology
should be methodically testable.

Then I have dreams of a new pg_autotune that would know about these
kinds of system-level settings.

I haven't been on this list for long, and only using postgres for a
handful of years, so forgive it if this has been hashed out before.

-Bill
-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill@bfccomputing.com           Mobile: 603.252.2606
http://www.bfccomputing.com/    Pager: 603.442.1833
Jabber: flowerpt@gmail.com      Text: bill+text@bfccomputing.com
Blog: http://blog.bfccomputing.com/


Re: Hardware/OS recommendations for large databases (

From
Michael Stone
Date:
On Mon, Nov 21, 2005 at 02:01:26PM -0500, Greg Stark wrote:
>I also fear that heading in that direction could push Postgres even further
>from the niche of software that works fine even on low end hardware into the
>realm of software that only works on high end hardware. It's already suffering
>a bit from that.

Well, there are are alread a bunch of open source DB's that can handle
the low end. postgres is the closest thing to being able to handle the
high end.

Mike Stone

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

On 11/21/05 6:57 AM, "Alan Stange" <stange@rentec.com> wrote:

> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
> 800000+0 records in
> 800000+0 records out
>
> real    0m13.780s
> user    0m0.134s
> sys     0m13.510s
>
> Oops.   I just wrote 470MB/s to a file system that has peak write speed
> of 200MB/s peak.

How much RAM on this machine?

> Now, you might say that you wrote a 16GB file on an 8 GB machine so this
> isn't an issue.   It does make your dd numbers look fast as some of the
> data will be unwritten.

This simple test, at 2x memory correlates very closely to Bonnie++ numbers
for sequential scan.  What's more, we see close to the same peak in practice
with multiple scanners.  Furthermore, if you run two of them simultaneously
(on two filesystems), you can also see the I/O limited.

> I'd also suggest running dd on the same files as postgresql.  I suspect
> you'd find that the layout of the postgresql files isn't that good as
> they are grown bit by bit, unlike the file created by simply dd'ing a
> large file.

Can happen if you're not careful with filesystems (see above).

There's nothing "wrong" with the dd test.

> I think your point doesn't hold up.  Every time you make it, I come away
> posting another result showing it to be incorrect.

Prove it - your Reiserfs number was about the same.

I also posted an XFS number that was substantially higher than 110-120.

> The point your making doesn't match my experience with *any* storage or
> program I've ever used, including postgresql.   Your point suggests that
> the storage system is idle  and that postgresql is broken because it
> isn't able to use the resources available...even when the cpu is very
> idle.  How can that make sense?   The issue here is that the storage
> system is very active doing reads on the files...which might be somewhat
> poorly allocated on disk because postgresql grows the tables bit by bit.

Then you've made my point - if the problem is contiguity of files on disk,
then larger allocation blocks would help on the CPU side.

The objective is clear: given a high performance filesystem, how much of the
available bandwidth can Postgres achieve?  I think what we're seeing is that
XFS is dramatically improving that objective.

> I had the same readahead in Reiser and in XFS.   The XFS performance was
> better because XFS does a better job of large file allocation on disk,
> thus resulting in many fewer seeks (generated by the file system itself)
> to read the files back in.   As an example, some file systems like UFS
> purposely scatter large files across cylinder groups to avoid forcing
> large seeks on small files; one can tune this behavior so that large
> files are more tightly allocated.

Our other tests have used ext3, reiser and Solaris 10 UFS, so this might
make some sense.

> Of course, because this is engineering, I have another obligatory data
> point:   This time it's a 4.2GB table using 137,138  32KB pages with
> nearly 41 million rows.
>
> A "select count(1)" on the table completes in 14.6 seconds, for an
> average read rate of 320 MB/s.

So, assuming that the net memory scan rate is about 2GB/s, and two copies
(one from FS cache to buffer cache, one from buffer cache to the agg node),
you have a 700MB/s filesystem with the equivalent of DirectIO (no FS cache)
because you are reading directly from the I/O cache.  You got half of that
because the I/O processing in the executor is limited to 320MB/s on that
fast CPU.

My point is this: if you were to decrease the filesystem speed to say
400MB/s and still use the equivalent of DirectIO, I thinkPostgres would not
deliver 320MB/s, but rather something like 220MB/s due to the
producer/consumer arch of the executor.  If you get that part, then we're on
the same track, otherwise we disagree.

> One cpu was idle, the other averaged 32% system time and 68 user time
> for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu
> would show increased performance as I really am cpu bound finally.

Yep, with the equivalent of DirectIO you are.

- Luke



Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Tom,

On 11/21/05 6:56 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> "Luke Lonergan" <llonergan@greenplum.com> writes:
>> OK - slower this time:
>
>> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
>> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
>> which all are capped at 120MB/s when doing sequential scans with different
>> versions of Postgres.
>
> Luke, sometime it would be nice if you would post your raw evidence
> and let other people do their own analysis.  I for one have gotten
> tired of reading sweeping generalizations unbacked by any data.

This has partly been a challenge to get others to post their results.

> I find the notion of a magic 120MB/s barrier, independent of either
> CPU or disk speed, to be pretty dubious to say the least.  I would
> like to know exactly what the "wide variety" of data points you
> haven't shown us are.

I'll try to put up some of them, they've occurred over the last 3 years on
various platforms including:
- Dual 3.2GHz Xeon, 2 x Adaptec U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.4(?) - 2.6.10 kernel, ext2/3 and Reiser filesystems
120-130MB/s Postgres seq scan rate on 7.4 and 8.0.

- Dual 1.8 GHz Opteron, 2 x LSI U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.10 kernel, ext2/3 and Reiser filesystems
110-120MB/s Postgres seq scan rate on 8.0

- Same machine as above running Solaris 10, with UFS filesystem.  When I/O
caching is tuned, we reach the same 110-120MB/s Postgres seq scan rate

- Sam machine as above with 7 x 15K RPM 144GB disks in an external disk
tray, same scan rate

Only when we got these new SATA systems and tried out XFS with large
readahead have we been able to break past the 120-130MB/s.  After Alan's
post, it seems that XFS might be a big part of that.  I think we'll test
ext2/3 against XFS on the same machine to find out.

It may have to wait a week, as many of us are on vacation.

- Luke



Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke,

it's time to back yourself up with some numbers.   You're claiming the
need for a significant rewrite of portions of postgresql and you haven't
done the work to make that case.

You've apparently made some mistakes on the use of dd to benchmark a
storage system.   Use lmdd and umount the file system before the read
and post your results.  Using a file 2x the size of memory doesn't work
corectly.  You can quote any other numbers you want, but until you use
lmdd correctly you should be ignored.  Ideally, since postgresql uses
1GB files, you'll want to use 1GB files for dd as well.

Luke Lonergan wrote:
> Alan,
>
> On 11/21/05 6:57 AM, "Alan Stange" <stange@rentec.com> wrote:
>
>
>> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
>> 800000+0 records in
>> 800000+0 records out
>>
>> real    0m13.780s
>> user    0m0.134s
>> sys     0m13.510s
>>
>> Oops.   I just wrote 470MB/s to a file system that has peak write speed
>> of 200MB/s peak.
>>
> How much RAM on this machine?
>
Doesn't matter.  The result will always be wrong without a call to
sync() or fsync() before the close() if you're trying to measure the
speed of the disk subsystem.   Add that sync() and the result will be
correct for any memory size.  Just for completeness:  Solaris implicitly
calls sync() as part of close.   Bonnie used to get this wrong, so
quoting Bonnie isn't any good.   Note that on some systems using 2x
memory for these tests is almost OK.  For example, Solaris used to have
a hiwater mark that would throttle processes and not allow more than a
few 100K of  writes to be outstanding on a file.  Linux/XFS clearly
allows a lot of write data to be outstanding.  It's best to understand
the tools and know what they do and why they can be wrong than simply
quoting some other tool that makes the same mistakes.

I find that postgresql is able to achieve about 175MB/s on average from
a system capable of delivering 200MB/s peak and it does this with a lot
of cpu time to spare.   Maybe dd can do a little better and deliver
185MB/s.    If I were to double the speed of my IO system, I might find
that a single postgresql instance can sink about 300MB/s of data (based
on the last numbers I posted).  That's why I have multi-cpu opterons and
more than one query/client as they soak up the remaining IO capacity.

It is guaranteed that postgresql will hit some threshold of performance
in the future and possible rewrites of some core functionality will be
needed, but no numbers posted here so far have made the case that
postgresql is in trouble now.     In the mean time, build balanced
systems with cpus that match the capabilities of the storage subsystems,
use 32KB block sizes for large memory databases that are doing lots of
sequential scans, use file systems tuned for large files, use opterons, etc.


As always, one has to post some numbers.   Here's an example of how dd
doesn't do what you might expect:

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are "correct".  But one measures the kernels ability to
absorb 2000 8KB writes with no guarantee that the data is on disk and
the second measures the disk subsystems ability to write 16MB of data.
dd is equivalent to the first result.  You can't use the first type of
result and complain that postgresql is slow.  If you wrote 16G of data
on a machine with 8G memory then your dd result is possibly too fast by
a factor of two as 8G of the data might not be on disk yet.  We won't
know until you post some results.

Cheers,

-- Alan


Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>".  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[root@modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real    0m33.057s
user    0m0.116s
sys     0m13.577s

[root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real    0m33.032s
user    0m0.087s
sys     0m13.129s
============================================================================

So lmdd with sync=1 is apparently equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both sides for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus reiserfs versus
xfs:





Re: Hardware/OS recommendations for large databases (

From
Michael Stone
Date:
On Mon, Nov 21, 2005 at 10:14:29AM -0800, Luke Lonergan wrote:
>This has partly been a challenge to get others to post their results.

You'll find that people respond better if you don't play games with
them.

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke

================= Results ===================

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>".  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[root@modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real    0m33.057s
user    0m0.116s
sys     0m13.577s

[root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real    0m33.032s
user    0m0.087s
sys     0m13.129s
============================================================================

So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus xfs:

============================================================================
16GB write, then read
============================================================================
-----------------------
ext2:
-----------------------
[root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 144.2670 secs, 113.5672 MB/sec

[root@modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 49.3766 secs, 331.8170 MB/sec

-----------------------
ext3, data=ordered:
-----------------------
[root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 137.1607 secs, 119.4511 MB/sec

[root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 48.7398 secs, 336.1527 MB/sec

-----------------------
xfs:
-----------------------
[root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 52.6141 secs, 311.3994 MB/sec

[root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
============================================================================

I'm liking xfs!  Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves.  There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
we aren't reading from the read cache.  I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.

============================================================================
26GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages
----------
  3159138
(1 row)

3159138*8192/1000000
25879 Million Bytes, or 25.9GB

-----------------------
xfs:
-----------------------
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 99187.205 ms

-----------------------
ext2:
-----------------------
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages
----------
  3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;
   count
-----------
 119994608
(1 row)

Time: 122822.090 ms
============================================================================
Analysis of Postgres 8.0.3 results
============================================================================
                              ext2        xfs
Write Speed                   114         311
Read Speed                    332         407
Postgres Seq Scan Speed       212         263
Scan % of lmdd Read Speed     63.9%       64.6%

Well - looks like we get linear scaling with disk/file subsystem speedup.

- Luke



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>

It might be interesting to see what effect reducing the cpu consumption
  entailed by the count aggregation has - by (say) writing a little bit
of code to heap scan the desired relation (sample attached).

Cheers

Mark




/*
 * fastcount.c
 *
 * Do a count that uses considerably less CPU time than an aggregate.
 */

#include "postgres.h"

#include "funcapi.h"
#include "access/heapam.h"
#include "catalog/namespace.h"
#include "utils/builtins.h"


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
    text       *relname = PG_GETARG_TEXT_P(0);
    RangeVar   *relrv;
    Relation    rel;
    HeapScanDesc scan;
    HeapTuple    tuple;
    int64        result = 0;

    /* Use the name to get a suitable range variable and open the relation. */
    relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
    rel = heap_openrv(relrv, AccessShareLock);

    /* Start a heap scan on the relation. */
    scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
    while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
    {
        result++;
    }

    /* End the scan and close up the relation. */
    heap_endscan(scan);
    heap_close(rel, AccessShareLock);


    PG_RETURN_INT64(result);
}

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke,

- XFS will probably generate better data rates with larger files.   You
really need to use the same file size as does postgresql.  Why compare
the speed to reading a 16G file and the speed to reading a 1G file.
They won't be the same.  If need be, write some code that does the test
or modify lmdd to read a sequence of 1G files.   Will this make a
difference?  You don't know until you do it.   Any time you cross a
couple of 2^ powers in computing, you should expect some differences.

- you did umount the file system before reading the 16G file back in?
Because if you didn't then your read numbers are possibly garbage.
When the read began, 8G of the file was in memory.   You'd be very naive
to think that somehow the read of the first 8GB somehow flushed that
cached data out of memory.  After all, why would the kernel flush pages
from file X when you're in the middle of a sequential read of...file
X?   I'm not sure how Linux handles this, but Solaris would've found the
8G still in memory.

- What was the hardware and disk configuration on which these numbers
were generated?   For example, if you have a U320 controller, how did
the read rate become larger than 320MB/s?

- how did the results change from before?   Just posting the new results
is misleading given all the boasting we've had to read about your past
results.

- there are two results below for writing to ext2:  one at 209 MB/s and
one at 113MB/s.  Why are they different?

- what was the cpu usage during these tests?   We see postgresql doing
200+MB/s of IO.   You've claimed many times that the machine would be
compute bound at lower IO rates, so how much idle time does the cpu
still have?

- You wrote:  "We'll do a 16GB table size to ensure that we aren't
reading from the read cache. "  Do you really believe that??   You have
to umount the file system before each test to ensure you're really
measuring the disk IO rate.   If I'm reading your results correctly, it
looks like you have three results for ext and xfs, each of which is
faster than the prior one.  If I'm reading this correctly, then it looks
like one is clearly reading from the read cache.

- Gee, it's so nice of you to drop your 120MB/s observation.  I guess my
reading at 300MB/s wasn't convincing enough.  Yeah, I think it was the
cpus too...

- I wouldn't focus on the flat 64% of the data rate number.  It'll
probably be different on other systems.

I'm all for testing and testing.   It seems you still cut a corner
without umounting the file system first.  Maybe I'm a little too old
school on this, but I wouldn't spend a dime until you've done the
measurements correctly.

Good Luck.

-- Alan



Luke Lonergan wrote:
> Alan,
>
> Looks like Postgres gets sensible scan rate scaling as the filesystem speed
> increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
> got faster since I last tested this.
>
> The scaling looks like 64% of the I/O subsystem speed is available to the
> executor - so as the I/O subsystem increases in scan rate, so does Postgres'
> executor scan speed.
>
> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>
> - Luke
>
> ================= Results ===================
>
> Unless noted otherwise all results posted are for block device readahead set
> to 16M using "blockdev --setra=16384 <block_device>".  All are using the
> 2.6.9-11 Centos 4.1 kernel.
>
> For those who don't have lmdd, here is a comparison of two results on an
> ext2 filesystem:
>
> ============================================================================
> [root@modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
> bs=8k count=800000 && sync)"
> 800000+0 records in
> 800000+0 records out
>
> real    0m33.057s
> user    0m0.116s
> sys     0m13.577s
>
> [root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=800000 sync=1
> 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec
>
> real    0m33.032s
> user    0m0.087s
> sys     0m13.129s
> ============================================================================
>
> So lmdd with sync=1 is equivalent to a sync after a dd.
>
> I use 2x memory with dd for the *READ* performance testing, but let's make
> sure things are synced on both write and read for this set of comparisons.
>
> First, let's test ext2 versus "ext3, data=ordered", versus xfs:
>
> ============================================================================
> 16GB write, then read
> ============================================================================
> -----------------------
> ext2:
> -----------------------
> [root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 144.2670 secs, 113.5672 MB/sec
>
> [root@modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 49.3766 secs, 331.8170 MB/sec
>
> -----------------------
> ext3, data=ordered:
> -----------------------
> [root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 137.1607 secs, 119.4511 MB/sec
>
> [root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 48.7398 secs, 336.1527 MB/sec
>
> -----------------------
> xfs:
> -----------------------
> [root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 52.6141 secs, 311.3994 MB/sec
>
> [root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
> ============================================================================
>
> I'm liking xfs!  Something about the way files are layed out, as Alan
> suggested seems to dramatically improve write performance and perhaps
> consequently the read also improves.  There doesn't seem to be a difference
> between ext3 and ext2, as expected.
>
> Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
> we aren't reading from the read cache.  I'll write this file through
> Postgres COPY to be sure that the file layout is as Postgres creates it. The
> alternative would be to use COPY once, then tar/untar onto different
> filesystems, but that may not duplicate the real world results.
>
> These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
> the augmentations act to improve the executor I/O though, so for these
> purposes it should be the same as 8.0.3.
>
> ============================================================================
> 26GB of DBT-3 data from the lineitem table
> ============================================================================
> llonergan=# select relpages from pg_class where relname='lineitem';
>  relpages
> ----------
>   3159138
> (1 row)
>
> 3159138*8192/1000000
> 25879 Million Bytes, or 25.9GB
>
> -----------------------
> xfs:
> -----------------------
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 394908.501 ms
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 99425.223 ms
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 99187.205 ms
>
> -----------------------
> ext2:
> -----------------------
> llonergan=# select relpages from pg_class where relname='lineitem';
>  relpages
> ----------
>   3159138
> (1 row)
>
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 395286.475 ms
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 195756.381 ms
> llonergan=# select count(1) from lineitem;
>    count
> -----------
>  119994608
> (1 row)
>
> Time: 122822.090 ms
> ============================================================================
> Analysis of Postgres 8.0.3 results
> ============================================================================
>                               ext2        xfs
> Write Speed                   114         311
> Read Speed                    332         407
> Postgres Seq Scan Speed       212         263
> Scan % of lmdd Read Speed     63.9%       64.6%
>
> Well - looks like we get linear scaling with disk/file subsystem speedup.
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


Re: Hardware/OS recommendations for large databases (

From
Bruce Momjian
Date:
Greg Stark wrote:
>
> Alan Stange <stange@rentec.com> writes:
>
> > The point your making doesn't match my experience with *any* storage or program
> > I've ever used, including postgresql.   Your point suggests that the storage
> > system is idle  and that postgresql is broken because it isn't able to use the
> > resources available...even when the cpu is very idle.  How can that make sense?
>
> Well I think what he's saying is that Postgres is issuing a read, then waiting
> for the data to return. Then it does some processing, and goes back to issue
> another read. The CPU is idle half the time because Postgres isn't capable of
> doing any work while waiting for i/o, and the i/o system is idle half the time
> while the CPU intensive part happens.
>
> (Consider as a pathological example a program that reads 8k then sleeps for
> 10ms, and loops doing that 1,000 times. Now consider the same program
> optimized to read 8M asynchronously and sleep for 10s. By the time it's
> finished sleeping it has probably read in all 8M. Whereas the program that
> read 8k in little chunks interleaved with small sleeps would probably take
> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
> idle.)
>
> It's a reasonable theory and it's not inconsistent with the results you sent.
> But it's not exactly proven either. Nor is it clear how to improve matters.
> Adding additional threads to handle the i/o adds an enormous amount of
> complexity and creates lots of opportunity for other contention that could
> easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Bruce Momjian wrote:
> Greg Stark wrote:
>
>> Alan Stange <stange@rentec.com> writes:
>>
>>
>>> The point your making doesn't match my experience with *any* storage or program
>>> I've ever used, including postgresql.   Your point suggests that the storage
>>> system is idle  and that postgresql is broken because it isn't able to use the
>>> resources available...even when the cpu is very idle.  How can that make sense?
>>>
>> Well I think what he's saying is that Postgres is issuing a read, then waiting
>> for the data to return. Then it does some processing, and goes back to issue
>> another read. The CPU is idle half the time because Postgres isn't capable of
>> doing any work while waiting for i/o, and the i/o system is idle half the time
>> while the CPU intensive part happens.
>>
>> (Consider as a pathological example a program that reads 8k then sleeps for
>> 10ms, and loops doing that 1,000 times. Now consider the same program
>> optimized to read 8M asynchronously and sleep for 10s. By the time it's
>> finished sleeping it has probably read in all 8M. Whereas the program that
>> read 8k in little chunks interleaved with small sleeps would probably take
>> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
>> idle.)
>>
>> It's a reasonable theory and it's not inconsistent with the results you sent.
>> But it's not exactly proven either. Nor is it clear how to improve matters.
>> Adding additional threads to handle the i/o adds an enormous amount of
>> complexity and creates lots of opportunity for other contention that could
>> easily eat all of the gains.
>>
>
> Perfect summary.  We have a background writer now.  Ideally we would
> have a background reader, that reads-ahead blocks into the buffer cache.
> The problem is that while there is a relatively long time between a
> buffer being dirtied and the time it must be on disk (checkpoint time),
> the read-ahead time is much shorter, requiring some kind of quick
> "create a thread" approach that could easily bog us down as outlined
> above.
>
> Right now the file system will do read-ahead for a heap scan (but not an
> index scan), but even then, there is time required to get that kernel
> block into the PostgreSQL shared buffers, backing up Luke's observation
> of heavy memcpy() usage.
>
> So what are our options?  mmap()?  I have no idea.  Seems larger page
> size does help.
For sequential scans, you do have a background reader.  It's the
kernel.  As long as you don't issue a seek() between read() calls, the
kernel will get the hint about sequential IO and begin to perform a read
ahead for you.  This is where the above analysis isn't quite right:
while postgresql is processing the returned data from the read() call,
the kernel has also issued reads as part of the read ahead, keeping the
device busy while the cpu is busy.  (I'm assuming these details for
Linux; Solaris/UFS does work this way).  Issue one seek on the file and
the read ahead algorithm will back off for a while.   This was my point
about some descriptions of how the system works not being sensible.

If your goal is sequential IO, then one must use larger block sizes.
No one would use 8KB IO for achieving high sequential IO rates.   Simply
put, read() is about the slowest way to get 8KB of data.     Switching
to 32KB blocks reduces all the system call overhead by a large margin.
Larger blocks would be better still, up to the stripe size of your
mirror.   (Of course, you're using a mirror and not raid5 if you care
about performance.)

I don't think the memcpy of data from the kernel to userspace is that
big of an issue right now.  dd and all the high end network interfaces
manage OK doing it, so I'd expect postgresql to do all right with it now
yet too.   Direct IO will avoid that memcpy, but then you also don't get
any caching of the files in memory.  I'd be more concerned about any
memcpy calls or general data management within postgresql.    Does
postgresql use the platform specific memcpy() in libc?  Some care might
be needed to ensure that the memory blocks within postgresql are all
properly aligned to make sure that one isn't ping-ponging cache lines
around (usually done by padding the buffer sizes by an extra 32 bytes or
L1 line size).   Whatever you do, all the usual high performance
computing tricks should be used prior to considering any rewriting of
major code sections.

Personally, I'd like to see some detailed profiling being done using
hardware counters for cpu cycles and cache misses, etc.   Given the poor
quality of work that has been discussed here in this thread, I don't
have much confidence in any other additional results at this time.
None of the analysis would be acceptable in any environment in which
I've worked.   Be sure to take a look at Sun's free Workshop tools as
they are excellent for this sort of profiling and one doesn't need to
recompile to use them.    If I get a little time in the next week or two
I might take a crack at this.

Cheers,

-- Alan


Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Alan Stange <stange@rentec.com> writes:

> For sequential scans, you do have a background reader.  It's the kernel.  As
> long as you don't issue a seek() between read() calls, the kernel will get the
> hint about sequential IO and begin to perform a read ahead for you.  This is
> where the above analysis isn't quite right:  while postgresql is processing the
> returned data from the read() call, the kernel has also issued reads as part of
> the read ahead, keeping the device busy while the cpu is busy.  (I'm assuming
> these details for Linux; Solaris/UFS does work this way).  Issue one seek on
> the file and the read ahead algorithm will back off for a while.   This was my
> point about some descriptions of how the system works not being sensible.

Well that's certainly the hope. But we don't know that this is actually as
effective as you assume it is. It's awfully hard in the kernel to make much
more than a vague educated guess about what kind of readahead would actually
help.

This is especially true when a file isn't really being accessed in a
sequential fashion as Postgres may well do if, for example, multiple backends
are reading the same file. And as you pointed out it doesn't help at all for
random access index scans.

> If your goal is sequential IO, then one must use larger block sizes.   No one
> would use 8KB IO for achieving high sequential IO rates.   Simply put, read()
> is about the slowest way to get 8KB of data.     Switching to 32KB blocks
> reduces all the system call overhead by a large margin.  Larger blocks would be
> better still, up to the stripe size of your mirror.   (Of course, you're using
> a mirror and not raid5 if you care about performance.)

Switching to 32kB blocks throughout Postgres has pros but also major cons, not
the least is *extra* i/o for random access read patterns. One of the possible
advantages of the suggestions that were made, the ones you're shouting down,
would actually be the ability to use 32kB scatter/gather reads without
necessarily switching block sizes.

(Incidentally, your parenthetical comment is a bit confused. By "mirror" I
imagine you're referring to raid1+0 since mirrors alone, aka raid1, aren't a
popular way to improve performance. But raid5 actually performs better than
raid1+0 for sequential reads.)

> Does postgresql use the platform specific memcpy() in libc? Some care might
> be needed to ensure that the memory blocks within postgresql are all
> properly aligned to make sure that one isn't ping-ponging cache lines around
> (usually done by padding the buffer sizes by an extra 32 bytes or L1 line
> size). Whatever you do, all the usual high performance computing tricks
> should be used prior to considering any rewriting of major code sections.

So your philosophy is to worry about microoptimizations before worrying about
architectural issues?


--
greg

Re: Hardware/OS recommendations for large databases (

From
Bruce Momjian
Date:
Alan Stange wrote:
> Bruce Momjian wrote:
> > Right now the file system will do read-ahead for a heap scan (but not an
> > index scan), but even then, there is time required to get that kernel
> > block into the PostgreSQL shared buffers, backing up Luke's observation
> > of heavy memcpy() usage.
> >
> > So what are our options?  mmap()?  I have no idea.  Seems larger page
> > size does help.

> For sequential scans, you do have a background reader.  It's the
> kernel.  As long as you don't issue a seek() between read() calls, the

I guess you missed my text of "Right now the file system will do
read-ahead", meaning the kernel.

> I don't think the memcpy of data from the kernel to userspace is that
> big of an issue right now.  dd and all the high end network interfaces
> manage OK doing it, so I'd expect postgresql to do all right with it now
> yet too.   Direct IO will avoid that memcpy, but then you also don't get
> any caching of the files in memory.  I'd be more concerned about any
> memcpy calls or general data management within postgresql.    Does
> postgresql use the platform specific memcpy() in libc?  Some care might
> be needed to ensure that the memory blocks within postgresql are all
> properly aligned to make sure that one isn't ping-ponging cache lines
> around (usually done by padding the buffer sizes by an extra 32 bytes or
> L1 line size).   Whatever you do, all the usual high performance
> computing tricks should be used prior to considering any rewriting of
> major code sections.

We have dealt with alignment and MemCpy is what we used for small-sized
copies to reduce function call overhead.  If you want to improve it,
feel free to take a look.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Bruce,

On 11/22/05 4:13 PM, "Bruce Momjian" <pgman@candle.pha.pa.us> wrote:

> Perfect summary.  We have a background writer now.  Ideally we would
> have a background reader, that reads-ahead blocks into the buffer cache.
> The problem is that while there is a relatively long time between a
> buffer being dirtied and the time it must be on disk (checkpoint time),
> the read-ahead time is much shorter, requiring some kind of quick
> "create a thread" approach that could easily bog us down as outlined
> above.

Yes, the question is "how much read-ahead buffer is needed to equate to the
38% of I/O wait time in the current executor profile?"

The idea of asynchronous buffering would seem appropriate if the executor
would use the 38% of time as useful work.

A background reader is an interesting approach - it would require admin
management of buffers where AIO would leave that in the kernel.  The
advantage over AIO would be more universal platform support I suppose?

> Right now the file system will do read-ahead for a heap scan (but not an
> index scan), but even then, there is time required to get that kernel
> block into the PostgreSQL shared buffers, backing up Luke's observation
> of heavy memcpy() usage.

As evidenced by the 16MB readahead setting still resulting in only 36% IO
wait.

> So what are our options?  mmap()?  I have no idea.  Seems larger page
> size does help.

Not sure about that, we used to run with 32KB page size and I didn't see a
benefit on seq scan at all.  I haven't seen tests in this thread that
compare 8K to 32K.

- Luke



Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement.

- Luke

Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Luke Lonergan wrote:
> Why not contribute something - put up proof of your stated 8KB versus
> 32KB page size improvement.

I did observe that 32KB block sizes were a significant win "for our
usage patterns".   It might be a win for any of the following reasons:

0) The preliminaries:   ~300GB database with about ~50GB daily
turnover.   Our data is fairly reasonably grouped.  If we're getting one
item on a page we're usually looking at the other items as well.

1) we can live with a smaller FSM size.  We were often leaking pages
with a 10M page FSM setting.  With 32K pages, a 10M FSM size is
sufficient.   Yes, the solution to this is "run vacuum more often", but
when the vacuum was taking 10 hours at a time, that was hard to do.

2) The typical datum size in our largest table is about 2.8KB, which is
more than 1/4 page size thus resulting in the use of a toast table.
Switching to 32KB pages allows us to get a decent storage of this data
into the main tables, thus avoiding another table and associated large
index.   Not having the extra index in memory for a table with 90M rows
is probably beneficial.

3) vacuum time has been substantially reduced.  Vacuum analyze now run
in the 2 to 3 hour range depending on load.

4) less cpu time spent in the kernel.  We're basically doing 1/4 as many
system calls.

Overall the system has now been working well.  We used to see the
database being a bottleneck at times, but now it's keeping up nicely.

Hope this helps.

Happy Thanksgiving!

-- Alan

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Alan,

On 11/23/05 2:00 PM, "Alan Stange" <stange@rentec.com> wrote:

> Luke Lonergan wrote:
>> Why not contribute something - put up proof of your stated 8KB versus
>> 32KB page size improvement.
>
> I did observe that 32KB block sizes were a significant win "for our
> usage patterns".   It might be a win for any of the following reasons:
> (* big snip *)

Though all of what you relate is interesting, it seems irrelevant to your
earlier statement here:

>> Alan Stange <stange@rentec.com> writes:
>> If your goal is sequential IO, then one must use larger block sizes.
>> No one would use 8KB IO for achieving high sequential IO rates.   Simply
>> put, read() is about the slowest way to get 8KB of data.     Switching
>> to 32KB blocks reduces all the system call overhead by a large margin.
>> Larger blocks would be better still, up to the stripe size of your
>> mirror.   (Of course, you're using a mirror and not raid5 if you care
>> about performance.)

And I am interested in seeing if your statement is correct.  Do you have any
proof of this to share?

- Luke



Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

This is an excellent idea – unfortunately I’m in Maui right now (Mahalo!) and I’m not getting to testing with this.  My first try was with 8.0.3 and it’s an 8.1 function I presume.

Not to be lazy – but any hint as to how to do the same thing for 8.0?

- Luke


On 11/21/05 9:10 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

Luke Lonergan wrote:

> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>

It might be interesting to see what effect reducing the cpu consumption
  entailed by the count aggregation has - by (say) writing a little bit
of code to heap scan the desired relation (sample attached).

Cheers

Mark






/*
 * fastcount.c
 *
 * Do a count that uses considerably less CPU time than an aggregate.
 */

#include "postgres.h"

#include "funcapi.h"
#include "access/heapam.h"
#include "catalog/namespace.h"
#include "utils/builtins.h"


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
 text    *relname = PG_GETARG_TEXT_P(0);
 RangeVar   *relrv;
 Relation rel;
 HeapScanDesc scan;
 HeapTuple tuple;
 int64  result = 0;

 /* Use the name to get a suitable range variable and open the relation. */
 relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
 rel = heap_openrv(relrv, AccessShareLock);

 /* Start a heap scan on the relation. */
 scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
 while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 {
  result++;
 }

 /* End the scan and close up the relation. */
 heap_endscan(scan);
 heap_close(rel, AccessShareLock);


 PG_RETURN_INT64(result);
}

Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
> This is an excellent idea – unfortunately I’m in Maui right now
> (Mahalo!) and I’m not getting to testing with this.  My first try was
> with 8.0.3 and it’s an 8.1 function I presume.
>
> Not to be lazy – but any hint as to how to do the same thing for 8.0?
>

Yeah, it's 8.1 - I didn't think to check against 8.0. The attached
variant works with 8.0.4 (textToQualifiedNameList needs 2 args)

cheers

Mark

P.s. Maui eh, sounds real nice.
/*
 * fastcount.c
 *
 * Do a count that uses considerably less CPU time than an aggregate.
 *
 * (Variant for 8.0.x - textToQualifiedNameList needs 2 args)
 */

#include "postgres.h"

#include "funcapi.h"
#include "access/heapam.h"
#include "catalog/namespace.h"
#include "utils/builtins.h"


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
    text       *relname = PG_GETARG_TEXT_P(0);
    RangeVar   *relrv;
    Relation    rel;
    HeapScanDesc scan;
    HeapTuple    tuple;
    int64        result = 0;

    /* Use the name to get a suitable range variable and open the relation. */
    relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname, ""));
    rel = heap_openrv(relrv, AccessShareLock);

    /* Start a heap scan on the relation. */
    scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
    while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
    {
        result++;
    }

    /* End the scan and close up the relation. */
    heap_endscan(scan);
    heap_close(rel, AccessShareLock);


    PG_RETURN_INT64(result);
}

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

See the results below and analysis - the pure HeapScan gets 94.1% of the max
available read bandwidth (cool!).  Nothing wrong with heapscan in the
presence of large readahead, which is good news.

That says it's something else in the path.  As you probably know there is a
page lock taken, a copy of the tuple from the page, lock removed, count
incremented for every iteration of the agg node on a count(*).  Is the same
true of a count(1)?

I recall that the profile is full of memcpy and memory context calls.

It would be nice to put some tracers into the executor and see where the
time is going.  I'm also curious about the impact of the new 8.1 virtual
tuples in reducing the executor overhead.  In this case my bet's on the agg
node itself, what do you think?

- Luke

On 11/21/05 9:10 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> Luke Lonergan wrote:
>
>> So that leaves the question - why not more than 64% of the I/O scan rate?
>> And why is it a flat 64% as the I/O subsystem increases in speed from
>> 333-400MB/s?
>>
>
> It might be interesting to see what effect reducing the cpu consumption
>   entailed by the count aggregation has - by (say) writing a little bit
> of code to heap scan the desired relation (sample attached).

OK - here are results for a slightly smaller (still bigger than RAM)
lineitem on the same machine, using the same xfs filesystem that achieved
407MB/s:

============================================================================
12.9GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages
----------
  1579270
(1 row)

1579270*8192/1000000
12937 Million Bytes or 12.9GB

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 197870.105 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 49912.164 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 49218.739 ms

llonergan=# select fastcount('lineitem');
 fastcount
-----------
  59986052
(1 row)

Time: 33752.778 ms
llonergan=# select fastcount('lineitem');
 fastcount
-----------
  59986052
(1 row)

Time: 34543.646 ms
llonergan=# select fastcount('lineitem');
 fastcount
-----------
  59986052
(1 row)

Time: 34528.053 ms

============================================================================
Analysis:
============================================================================
                    Bandwidth       Percent of max
dd Read             407MB/s         100%
Count(1)            263MB/s         64.6%
HeapScan            383MB/s         94.1%

Wow - looks like the HeapScan gets almost all of the available bandwidth!

- Luke



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

> ============================================================================
> 12.9GB of DBT-3 data from the lineitem table
> ============================================================================
> llonergan=# select relpages from pg_class where relname='lineitem';
>  relpages
> ----------
>   1579270
> (1 row)
>
> 1579270*8192/1000000
> 12937 Million Bytes or 12.9GB
>
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
>   count
> ----------
>  59986052
> (1 row)
>
> Time: 197870.105 ms

So 198 seconds is the uncached read time with count (Just for clarity,
did you clear the Pg and filesystem caches or unmount / remount the
filesystem?)

> llonergan=# select count(1) from lineitem;
>   count
> ----------
>  59986052
> (1 row)
>
> Time: 49912.164 ms
> llonergan=# select count(1) from lineitem;
>   count
> ----------
>  59986052
> (1 row)
>
> Time: 49218.739 ms
>

and ~50 seconds is the (partially) cached read time with count

> llonergan=# select fastcount('lineitem');
>  fastcount
> -----------
>   59986052
> (1 row)
>
> Time: 33752.778 ms
> llonergan=# select fastcount('lineitem');
>  fastcount
> -----------
>   59986052
> (1 row)
>
> Time: 34543.646 ms
> llonergan=# select fastcount('lineitem');
>  fastcount
> -----------
>   59986052
> (1 row)
>
> Time: 34528.053 ms
>

so ~34 seconds is the (partially) cached read time for fastcount -
I calculate this to give ~362Mb/s effective IO rate (I'm doing / by
1024*1024 not 1000*1000) FWIW.

While this is interesting, you probably want to stop Pg, unmount the
filesystem, and restart Pg to get the uncached time for fastcount too
(and how does this compare to uncached read with dd using the same block
size?).

But at this stage it certainly looks the the heapscan code is pretty
efficient - great!

Oh - and do you want to try out 32K block size, I'm interested to see
what level of improvement you get (as my system is hopelessly cpu bound...)!

> ============================================================================
> Analysis:
> ============================================================================
>                     Bandwidth       Percent of max
> dd Read             407MB/s         100%
> Count(1)            263MB/s         64.6%
> HeapScan            383MB/s         94.1%


Cheers

Mark

Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
>
> It would be nice to put some tracers into the executor and see where the
> time is going.  I'm also curious about the impact of the new 8.1 virtual
> tuples in reducing the executor overhead.  In this case my bet's on the agg
> node itself, what do you think?
>

Yeah - it's pretty clear that the count aggregate is fairly expensive
wrt cpu - However, I am not sure if all agg nodes suffer this way (guess
we could try a trivial aggregate that does nothing for all tuples bar
the last and just reports the final value it sees).

Cheers

Mark


Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

>> Time: 197870.105 ms
>
> So 198 seconds is the uncached read time with count (Just for clarity,
> did you clear the Pg and filesystem caches or unmount / remount the
> filesystem?)

Nope - the longer time is due to the "second write" known issue with
Postgres - it writes the data to the table, but all of the pages are marked
dirty?  So, always on the first scan after loading they are written again.
This is clear as you watch vmstat - the pattern on the first seq scan is
half read / half write.

>> Time: 49218.739 ms
>>
>
> and ~50 seconds is the (partially) cached read time with count

Again - the pattern here is pure read and completely non-cached.  You see a
very nearly constant I/O rate when watching vmstat for the entire scan.

>> Time: 34528.053 ms

> so ~34 seconds is the (partially) cached read time for fastcount -
> I calculate this to give ~362Mb/s effective IO rate (I'm doing / by
> 1024*1024 not 1000*1000) FWIW.

The dd number uses 1000*1000, so I maintained it for the percentage of max.

> While this is interesting, you probably want to stop Pg, unmount the
> filesystem, and restart Pg to get the uncached time for fastcount too
> (and how does this compare to uncached read with dd using the same block
> size?).

I'll do it again sometime, but I've already deleted the file.  I've done the
following in the past to validate this though:

- Reboot machine
- Rerun scan

And we get identical results.

> But at this stage it certainly looks the the heapscan code is pretty
> efficient - great!

Yep.

> Oh - and do you want to try out 32K block size, I'm interested to see
> what level of improvement you get (as my system is hopelessly cpu bound...)!

Yah - done so in the past and not seen any - was waiting for Alan to post
his results.

>> ============================================================================
>> Analysis:
>> ============================================================================
>>                     Bandwidth       Percent of max
>> dd Read             407MB/s         100%
>> Count(1)            263MB/s         64.6%
>> HeapScan            383MB/s         94.1%

Note these are all in consistent 1000x1000 units.

Thanks for the test - neat trick!  We'll use it to do some more profiling
some time soon...

- Luke



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

> That says it's something else in the path.  As you probably know there is a
> page lock taken, a copy of the tuple from the page, lock removed, count
> incremented for every iteration of the agg node on a count(*).  Is the same
> true of a count(1)?
>

Sorry Luke - message 3 - I seem to be suffering from a very small
working memory buffer myself right now, I think it's after a day of
working with DB2 ... :-)

Anyway, as I read src/backend/parser/gram.y:6542 - count(*) is
transformed into count(1), so these two are identical.

Cheers (last time tonight, promise!)

Mark

Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
>
>>>Time: 197870.105 ms
>>
>>So 198 seconds is the uncached read time with count (Just for clarity,
>>did you clear the Pg and filesystem caches or unmount / remount the
>>filesystem?)
>
>
> Nope - the longer time is due to the "second write" known issue with
> Postgres - it writes the data to the table, but all of the pages are marked
> dirty?  So, always on the first scan after loading they are written again.
> This is clear as you watch vmstat - the pattern on the first seq scan is
> half read / half write.
>

Ah - indeed - first access after a COPY no? I should have thought of
that, sorry!


Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:

> Yeah - it's pretty clear that the count aggregate is fairly expensive wrt cpu -
> However, I am not sure if all agg nodes suffer this way (guess we could try a
> trivial aggregate that does nothing for all tuples bar the last and just
> reports the final value it sees).

As you mention count(*) and count(1) are the same thing.

Last I heard the reason count(*) was so expensive was because its state
variable was a bigint. That means it doesn't fit in a Datum and has to be
alloced and stored as a pointer. And because of the Aggregate API that means
it has to be allocated and freed for every tuple processed.

There was some talk of having a special case API for count(*) and maybe
sum(...) to avoid having to do this.

There was also some talk of making Datum 8 bytes wide on platforms where that
was natural (I guess AMD64, Sparc64, Alpha, Itanic).

Afaik none of these items have happened but I don't know for sure.

--
greg

Re: Hardware/OS recommendations for large databases (

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Last I heard the reason count(*) was so expensive was because its state
> variable was a bigint. That means it doesn't fit in a Datum and has to be
> alloced and stored as a pointer. And because of the Aggregate API that means
> it has to be allocated and freed for every tuple processed.

There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
Conway IIRC).

            regards, tom lane

Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Last I heard the reason count(*) was so expensive was because its state
> > variable was a bigint. That means it doesn't fit in a Datum and has to be
> > alloced and stored as a pointer. And because of the Aggregate API that means
> > it has to be allocated and freed for every tuple processed.
>
> There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
> Conway IIRC).

ah, cool, missed that.

--
greg

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
The same 12.9GB distributed across 4 machines using Bizgres MPP fits into
I/O cache.  The interesting result is that the query "select count(1)" is
limited in speed to 280 MB/s per CPU when run on the lineitem table.  So
when I run it spread over 4 machines, one CPU per machine I get this:

======================================================
Bizgres MPP, 4 data segments, 1 per 2 CPUs
======================================================
llonergan=# explain select count(1) from lineitem;
                                      QUERY PLAN
----------------------------------------------------------------------------
----------
 Aggregate  (cost=582452.00..582452.00 rows=1 width=0)
   ->  Gather Motion  (cost=582452.00..582452.00 rows=1 width=0)
         ->  Aggregate  (cost=582452.00..582452.00 rows=1 width=0)
               ->  Seq Scan on lineitem  (cost=0.00..544945.00 rows=15002800
width=0)
(4 rows)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 12191.435 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 11986.109 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 11448.941 ms
======================================================

That's 12,937 MB in 11.45 seconds, or 1,130 MB/s.  When you divide out the
number of Postgres instances (4), that's 283MB/s per Postgres instance.

To verify that this has nothing to do with MPP, I ran it in a special
internal mode on one instance and got the same result.

So - we should be able to double this rate by running one segment per CPU,
or two per host:

======================================================
Bizgres MPP, 8 data segments, 1 per CPU
======================================================
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 6484.594 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 6156.729 ms
llonergan=# select count(1) from lineitem;
  count
----------
 59986052
(1 row)

Time: 6063.416 ms
======================================================
That's 12,937 MB in 11.45 seconds, or 2,134 MB/s.  When you divide out the
number of Postgres instances (8), that's 267MB/s per Postgres instance.

So, if you want to "select count(1)", using more CPUs is a good idea!  For
most complex queries, having lots of CPUs + MPP is a good combo.

Here is an example of a sorting plan - this should probably be done with a
hash aggregation, but using 8 CPUs makes it go 8x faster:


- Luke



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
>
>>Last I heard the reason count(*) was so expensive was because its state
>>variable was a bigint. That means it doesn't fit in a Datum and has to be
>>alloced and stored as a pointer. And because of the Aggregate API that means
>>it has to be allocated and freed for every tuple processed.
>
>
> There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
> Conway IIRC).
>

It certainly makes quite a difference as I measure it:

doing select(1) from a 181000 page table (completely uncached) on my PIII:

8.0 : 32 s
8.1 : 25 s

Note that the 'fastcount()' function takes 21 s in both cases - so all
the improvement seems to be from the count overhead reduction.

Cheers

Mark