Thread: RAID 0 not as fast as expected

RAID 0 not as fast as expected

From

"Craig A. James"

Date:

14 September 2006, 15:21:19

I'm experiment with RAID, looking for an inexpensive way to boost performance.  I bought 4 Seagate 7200.9 120 GB SATA
drivesand two SIIG dual-port SATA cards.  (NB: I don't plan to run RAID 0 in production, probably RAID 10, so no need
tocomment on the failure rate of RAID 0.) 

I used this raw serial-speed test:

   time sh -c "dd if=/dev/zero of=./bigfile bs=8k count=1000000 && sync"
   (unmount/remount)
   time sh -c "dd if=./bigfile of=/dev/null bs=8k count=1000000 && sync"

Which showed that the RAID 0 4-disk array was almost exactly twice as fast as each disk individually.  I expected 4X
performancefor a 4-disk RAID 0.  My suspicion is that each of these budget SATA cards is bandwidth limited; they can't
actuallyhandle two disks simultaneously, and I'd need to get four separate SATA cards to get 4X performance (or a more
expensivecard such as the Areca someone mentioned the other day). 

On the other hand, it "feels like" (using our application) the seek performance is quite a bit better, which I'd expect
givenmy hypothesis about the SIIG cards.  I don't have concrete benchmarks on seek speed. 

Thanks,
Craig

Re: RAID 0 not as fast as expected

From

"Joshua D. Drake"

Date:

14 September 2006, 15:50:00

Craig A. James wrote:
> I'm experiment with RAID, looking for an inexpensive way to boost
> performance.  I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG
> dual-port SATA cards.  (NB: I don't plan to run RAID 0 in production,
> probably RAID 10, so no need to comment on the failure rate of RAID 0.)
>
> I used this raw serial-speed test:
>
>   time sh -c "dd if=/dev/zero of=./bigfile bs=8k count=1000000 && sync"
>   (unmount/remount)
>   time sh -c "dd if=./bigfile of=/dev/null bs=8k count=1000000 && sync"
>
> Which showed that the RAID 0 4-disk array was almost exactly twice as
> fast as each disk individually.  I expected 4X performance for a 4-disk
> RAID 0.  My suspicion is that each of these budget SATA cards is

I am assuming linux here, Linux software raid 0 is known not to be super
duper.

Secondly remember that there is overhead involved with using raid. The
direct correlation doesn't work.

Joshua D. Drake

>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Re: RAID 0 not as fast as expected

From

Alan Hodgson

Date:

14 September 2006, 16:08:57

On Thursday 14 September 2006 11:05, "Craig A. James"
<cjames@modgraph-usa.com> wrote:
> I'm experiment with RAID, looking for an inexpensive way to boost
> performance.  I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG
> dual-port SATA cards.  (NB: I don't plan to run RAID 0 in production,
> probably RAID 10, so no need to comment on the failure rate of RAID 0.)
>

Are those PCI cards?  If yes, it's just a bus bandwidth limit.

--
Alan

Re: RAID 0 not as fast as expected

From

"Craig A. James"

Date:

14 September 2006, 18:36:30

Alan Hodgson wrote:
> On Thursday 14 September 2006 11:05, "Craig A. James"
> <cjames@modgraph-usa.com> wrote:
>> I'm experiment with RAID, looking for an inexpensive way to boost
>> performance.  I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG
>> dual-port SATA cards.  (NB: I don't plan to run RAID 0 in production,
>> probably RAID 10, so no need to comment on the failure rate of RAID 0.)
>>
>
> Are those PCI cards?  If yes, it's just a bus bandwidth limit.

Ok, that makes sense.

   One SATA disk = 52 MB/sec
   4-disk RAID0  = 106 MB/sec

   PCI at 33 MHz x 32 bits (4 bytes) = 132 MB/sec.

I guess getting to 80% of the theoretical speed is as much as I should expect.

Thanks,
Craig

Re: RAID 0 not as fast as expected

From

Scott Marlowe

Date:

14 September 2006, 18:58:46

On Thu, 2006-09-14 at 16:35, Craig A. James wrote:
> Alan Hodgson wrote:
> > On Thursday 14 September 2006 11:05, "Craig A. James"
> > <cjames@modgraph-usa.com> wrote:
> >> I'm experiment with RAID, looking for an inexpensive way to boost
> >> performance.  I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG
> >> dual-port SATA cards.  (NB: I don't plan to run RAID 0 in production,
> >> probably RAID 10, so no need to comment on the failure rate of RAID 0.)
> >>
> >
> > Are those PCI cards?  If yes, it's just a bus bandwidth limit.
>
> Ok, that makes sense.
>
>    One SATA disk = 52 MB/sec
>    4-disk RAID0  = 106 MB/sec
>
>    PCI at 33 MHz x 32 bits (4 bytes) = 132 MB/sec.
>
> I guess getting to 80% of the theoretical speed is as much as I should expect.

Note that many mid to high end motherboards have multiple PCI busses /
channels, and you could put a card in each one and get > 132MByte/sec on
them.

But for a database, sequential throughput is almost never the real
problem.  It's usually random access that counts, and for that a RAID 10
is a pretty good choice.

Re: RAID 0 not as fast as expected

From

"Luke Lonergan"

Date:

15 September 2006, 00:38:42

Josh,

On 9/14/06 11:49 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote:

> I am assuming linux here, Linux software raid 0 is known not to be super
> duper.

I've obtained 1,950 MB/s using Linux software RAID on SATA drives.

- Luke

Re: RAID 0 not as fast as expected

From

"Joshua D. Drake"

Date:

15 September 2006, 00:48:45

Luke Lonergan wrote:
> Josh,
>
> On 9/14/06 11:49 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
>
>> I am assuming linux here, Linux software raid 0 is known not to be super
>> duper.
>
> I've obtained 1,950 MB/s using Linux software RAID on SATA drives.

With what? :)

>
> - Luke
>
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Re: RAID 0 not as fast as expected

From

"Joshua D. Drake"

Date:

15 September 2006, 01:01:53

Luke Lonergan wrote:
> Josh,
>
> On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
>
>>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives.
>> With what? :)
>
> Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually CentOS 4.3) with
> XFS and the linux md driver without lvm.  Here is a summary of the results:
>


Good god!

>
>  Read Test
>  RAID Level Max Readahead (KB) RAID Chunksize Max Readahead on Disks (KB)
> Max Time (s)  Read Bandwidth (MB/s)
>  0 65536 64 256 16.689  1,917.43
>  0 4096 64 256 21.269  1,504.54
>  0 65536 256 256 17.967  1,781.04
>  0 2816 256 256 18.835  1,698.96
>  0 65536 1024 256 18.538  1,726.18
>  0 65536 64 512 18.295  1,749.11
>  0 65536 64 256 18.931  1,690.35
>  0 65536 64 256 18.873  1,695.54
>  0 64768 64 256 18.545  1,725.53
>  0 131172 64 256 18.548  1,725.25
>  0 131172 64 65536 19.046  1,680.14
>  0 131172 64 524288 18.125  1,765.52
>  0 131172 64 1048576 18.701  1,711.14
>  5 2560 64 256 39.933  801.34
>  5 16777216 64 256 37.76  847.46
>  5 524288 64 256 53.497  598.16
>  5 65536 32 256 38.472  831.77
>  5 65536 32 256 38.004  842.02
>  5 65536 32 256 37.884  844.68
>  5 2560 16 256 41.39  773.13
>  5 65536 16 256 48.902  654.37
>  10 65536 64 256 83.256  384.36
>  1+0 65536 64 256 19.394  1,649.99
>  1+0 65536 64 256 19.047  1,680.05
>  1+0 65536 64 256 19.195  1,667.10
>  1+0 65536 64 256 18.806  1,701.58
>  1+0 65536 64 256 18.848  1,697.79
>  1+0 65536 64 256 18.371  1,741.88
>  1+0 65536 64 256 21.446  1,492.12
>  1+0 65536 64 256 20.254  1,579.93
>
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Re: RAID 0 not as fast as expected

From

"Spiegelberg, Greg"

Date:

15 September 2006, 09:54:45

That's an all PCI-X box which makes sense.  There are 6 SATA controllers
in that little beastie also.  You can always count on Sun to provide
over engineered boxes.



> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Joshua D. Drake
> Sent: Friday, September 15, 2006 12:01 AM
> To: Luke Lonergan
> Cc: Craig A. James; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] RAID 0 not as fast as expected
>
> Luke Lonergan wrote:
> > Josh,
> >
> > On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
> >
> >>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives.
> >> With what? :)
> >
> > Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually
> CentOS 4.3)
> > with XFS and the linux md driver without lvm.  Here is a
> summary of the results:
> >
>
>
> Good god!
>
> >
> >  Read Test
> >  RAID Level Max Readahead (KB) RAID Chunksize Max Readahead
> on Disks
> > (KB) Max Time (s)  Read Bandwidth (MB/s)  0 65536 64 256 16.689
> > 1,917.43  0 4096 64 256 21.269  1,504.54  0 65536 256 256 17.967
> > 1,781.04  0 2816 256 256 18.835  1,698.96  0 65536 1024 256 18.538
> > 1,726.18  0 65536 64 512 18.295  1,749.11  0 65536 64 256 18.931
> > 1,690.35  0 65536 64 256 18.873  1,695.54  0 64768 64 256 18.545
> > 1,725.53  0 131172 64 256 18.548  1,725.25  0 131172 64
> 65536 19.046
> > 1,680.14  0 131172 64 524288 18.125  1,765.52  0 131172 64 1048576
> > 18.701  1,711.14
> >  5 2560 64 256 39.933  801.34
> >  5 16777216 64 256 37.76  847.46
> >  5 524288 64 256 53.497  598.16
> >  5 65536 32 256 38.472  831.77
> >  5 65536 32 256 38.004  842.02
> >  5 65536 32 256 37.884  844.68
> >  5 2560 16 256 41.39  773.13
> >  5 65536 16 256 48.902  654.37
> >  10 65536 64 256 83.256  384.36
> >  1+0 65536 64 256 19.394  1,649.99
> >  1+0 65536 64 256 19.047  1,680.05
> >  1+0 65536 64 256 19.195  1,667.10
> >  1+0 65536 64 256 18.806  1,701.58
> >  1+0 65536 64 256 18.848  1,697.79
> >  1+0 65536 64 256 18.371  1,741.88
> >  1+0 65536 64 256 21.446  1,492.12
> >  1+0 65536 64 256 20.254  1,579.93
> >
> >
>
>
> --
>
>     === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
>     Providing the most comprehensive  PostgreSQL solutions since 1997
>               http://www.commandprompt.com/
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: RAID 0 not as fast as expected

From

"Luke Lonergan"

Date:

15 September 2006, 12:42:44

Greg, Josh,

Something I found out while doing this - lvm (and lvm2) slows the block
stream down dramatically.  At first I was using it for convenience sake to
implement partitions on top of the md devices, but I found I was stuck at
about 700 MB/s.  Removing lvm2 from the picture allowed me to get within
chucking distance of 2GB/s.

When we first started working with Solaris ZFS, we were getting about
400-600 MB/s, and after working with the Solaris Engineering team we now get
rates approaching 2GB/s.  The updates needed to Solaris are part of the
Solaris 10 U3 available in October (and already in Solaris Express, aka
Solaris 11).

- Luke


On 9/15/06 5:43 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote:

> That's an all PCI-X box which makes sense.  There are 6 SATA controllers
> in that little beastie also.  You can always count on Sun to provide
> over engineered boxes.
>
>
>
>> -----Original Message-----
>> From: pgsql-performance-owner@postgresql.org
>> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
>> Joshua D. Drake
>> Sent: Friday, September 15, 2006 12:01 AM
>> To: Luke Lonergan
>> Cc: Craig A. James; pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] RAID 0 not as fast as expected
>>
>> Luke Lonergan wrote:
>>> Josh,
>>>
>>> On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
>>>
>>>>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives.
>>>> With what? :)
>>>
>>> Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually
>> CentOS 4.3)
>>> with XFS and the linux md driver without lvm.  Here is a
>> summary of the results:
>>>
>>
>>
>> Good god!
>>
>>>
>>>  Read Test
>>>  RAID Level Max Readahead (KB) RAID Chunksize Max Readahead
>> on Disks
>>> (KB) Max Time (s)  Read Bandwidth (MB/s)  0 65536 64 256 16.689
>>> 1,917.43  0 4096 64 256 21.269  1,504.54  0 65536 256 256 17.967
>>> 1,781.04  0 2816 256 256 18.835  1,698.96  0 65536 1024 256 18.538
>>> 1,726.18  0 65536 64 512 18.295  1,749.11  0 65536 64 256 18.931
>>> 1,690.35  0 65536 64 256 18.873  1,695.54  0 64768 64 256 18.545
>>> 1,725.53  0 131172 64 256 18.548  1,725.25  0 131172 64
>> 65536 19.046
>>> 1,680.14  0 131172 64 524288 18.125  1,765.52  0 131172 64 1048576
>>> 18.701  1,711.14
>>>  5 2560 64 256 39.933  801.34
>>>  5 16777216 64 256 37.76  847.46
>>>  5 524288 64 256 53.497  598.16
>>>  5 65536 32 256 38.472  831.77
>>>  5 65536 32 256 38.004  842.02
>>>  5 65536 32 256 37.884  844.68
>>>  5 2560 16 256 41.39  773.13
>>>  5 65536 16 256 48.902  654.37
>>>  10 65536 64 256 83.256  384.36
>>>  1+0 65536 64 256 19.394  1,649.99
>>>  1+0 65536 64 256 19.047  1,680.05
>>>  1+0 65536 64 256 19.195  1,667.10
>>>  1+0 65536 64 256 18.806  1,701.58
>>>  1+0 65536 64 256 18.848  1,697.79
>>>  1+0 65536 64 256 18.371  1,741.88
>>>  1+0 65536 64 256 21.446  1,492.12
>>>  1+0 65536 64 256 20.254  1,579.93
>>>
>>>
>>
>>
>> --
>>
>>     === The PostgreSQL Company: Command Prompt, Inc. ===
>> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
>>     Providing the most comprehensive  PostgreSQL solutions since 1997
>>               http://www.commandprompt.com/
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>
>

Re: RAID 0 not as fast as expected

From

"Bucky Jordan"

Date:

15 September 2006, 15:28:13

>When we first started working with Solaris ZFS, we were getting about
>400-600 MB/s, and after working with the Solaris Engineering team we
now >get
>rates approaching 2GB/s.  The updates needed to Solaris are part of the
>Solaris 10 U3 available in October (and already in Solaris Express, aka
>Solaris 11).

Luke,

What other file systems have you had good success with? Solaris would be
nice, but it looks like I'm stuck running on FreeBSD (6.1, amd64) so
UFS2 would be the default. Not sure about XFS on BSD, and I'm not sure
at the moment that ext2/3 provide enough benefit over UFS to spend much
time on.

Also, has anyone had any experience with gmirror (good or bad)? I'm
thinking of trying to use it to stripe two hardware mirrored sets since
HW RAID10 wasn't doing as well as I had hoped (Dell Perc5/I controller).
For a 4 disk RAID 10 (10k rpm SAS/SCSI disks) what would be a good
target performance number? Right now, dd shows 224 MB/s.

And lastly, for a more OLAP style database, would I be correct in
assuming that sequential access speed would be more important than is
normally the case? (I have a relatively small number of connections, but
each running on pretty large data sets).

Thanks,

Bucky

Re: RAID 0 not as fast as expected

From

"Luke Lonergan"

Date:

16 September 2006, 02:23:13

Josh,

On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:

>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives.
>
> With what? :)

Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually CentOS 4.3) with
XFS and the linux md driver without lvm.  Here is a summary of the results:


 Read Test
 RAID Level Max Readahead (KB) RAID Chunksize Max Readahead on Disks (KB)
Max Time (s)  Read Bandwidth (MB/s)
 0 65536 64 256 16.689  1,917.43
 0 4096 64 256 21.269  1,504.54
 0 65536 256 256 17.967  1,781.04
 0 2816 256 256 18.835  1,698.96
 0 65536 1024 256 18.538  1,726.18
 0 65536 64 512 18.295  1,749.11
 0 65536 64 256 18.931  1,690.35
 0 65536 64 256 18.873  1,695.54
 0 64768 64 256 18.545  1,725.53
 0 131172 64 256 18.548  1,725.25
 0 131172 64 65536 19.046  1,680.14
 0 131172 64 524288 18.125  1,765.52
 0 131172 64 1048576 18.701  1,711.14
 5 2560 64 256 39.933  801.34
 5 16777216 64 256 37.76  847.46
 5 524288 64 256 53.497  598.16
 5 65536 32 256 38.472  831.77
 5 65536 32 256 38.004  842.02
 5 65536 32 256 37.884  844.68
 5 2560 16 256 41.39  773.13
 5 65536 16 256 48.902  654.37
 10 65536 64 256 83.256  384.36
 1+0 65536 64 256 19.394  1,649.99
 1+0 65536 64 256 19.047  1,680.05
 1+0 65536 64 256 19.195  1,667.10
 1+0 65536 64 256 18.806  1,701.58
 1+0 65536 64 256 18.848  1,697.79
 1+0 65536 64 256 18.371  1,741.88
 1+0 65536 64 256 21.446  1,492.12
 1+0 65536 64 256 20.254  1,579.93

Re: RAID 0 not as fast as expected

From

"Luke Lonergan"

Date:

16 September 2006, 20:47:40

Bucky,

On 9/15/06 11:28 AM, "Bucky Jordan" <bjordan@lumeta.com> wrote:

> What other file systems have you had good success with? Solaris would be
> nice, but it looks like I'm stuck running on FreeBSD (6.1, amd64) so
> UFS2 would be the default. Not sure about XFS on BSD, and I'm not sure
> at the moment that ext2/3 provide enough benefit over UFS to spend much
> time on.

It won't matter much between UFS2 or others until you get past about 350
MB/s.

> Also, has anyone had any experience with gmirror (good or bad)? I'm
> thinking of trying to use it to stripe two hardware mirrored sets since
> HW RAID10 wasn't doing as well as I had hoped (Dell Perc5/I controller).
> For a 4 disk RAID 10 (10k rpm SAS/SCSI disks) what would be a good
> target performance number? Right now, dd shows 224 MB/s.

Each disk should sustain somewhere between 60-80 MB/s (see
http://www.storagereview.com/ for a profile of your disk).

Your dd test sounds suspiciously too fast unless you were running two
simultaneous dd processes.  Did you read from a file that was at least twice
the size of RAM?

A single dd stream would run between 120 and 160 MB/s on a RAID10, two
streams would be between 240 and 320 MB/s.

> And lastly, for a more OLAP style database, would I be correct in
> assuming that sequential access speed would be more important than is
> normally the case? (I have a relatively small number of connections, but
> each running on pretty large data sets).

Yes.  What's pretty large?  We've had to redefine large recently, now we're
talking about systems with between 100TB and 1,000TB.

- Luke

Re: RAID 0 not as fast as expected

From

"Steinar H. Gunderson"

Date:

16 September 2006, 21:08:52

On Sat, Sep 16, 2006 at 04:46:04PM -0700, Luke Lonergan wrote:
> Yes.  What's pretty large?  We've had to redefine large recently, now we're
> talking about systems with between 100TB and 1,000TB.

Do you actually have PostgreSQL databases in that size range?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: RAID 0 not as fast as expected

From

"Joshua D. Drake"

Date:

16 September 2006, 21:44:58

Steinar H. Gunderson wrote:
> On Sat, Sep 16, 2006 at 04:46:04PM -0700, Luke Lonergan wrote:
>> Yes.  What's pretty large?  We've had to redefine large recently, now we're
>> talking about systems with between 100TB and 1,000TB.
>
> Do you actually have PostgreSQL databases in that size range?

No, they have databases in MPP that are that large :)

Joshua D. Drake


>
> /* Steinar */


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Large tables (was: RAID 0 not as fast as expected)

From

"Bucky Jordan"

Date:

18 September 2006, 11:38:43

>Yes.  What's pretty large?  We've had to redefine large recently, now
we're
>talking about systems with between 100TB and 1,000TB.
>
>- Luke

Well, I said large, not gargantuan :) - Largest would probably be around
a few TB, but the problem I'm having to deal with at the moment is large
numbers (potentially > 1 billion) of small records (hopefully I can get
it down to a few int4's and a int2 or so) in a single table. Currently
we're testing for and targeting in the 500M records range, but the
design needs to scale to 2-3 times that at least.

I read one of your presentations on very large databases in PG, and saw
mention of some tables over a billion rows, so that was encouraging. The
new table partitioning in 8.x will be very useful. What's the largest DB
you've seen to date on PG (in terms of total disk storage, and records
in largest table(s) )?

My question is at what point do I have to get fancy with those big
tables? From your presentation, it looks like PG can handle 1.2 billion
records or so as long as you write intelligent queries. (And normal PG
should be able to handle that, correct?)

Also, does anyone know if/when any of the MPP stuff will be ported to
Postgres, or is the plan to keep that separate?

Thanks,

Bucky

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Merlin Moncure"

Date:

18 September 2006, 17:56:50

On 9/18/06, Bucky Jordan <bjordan@lumeta.com> wrote:
> My question is at what point do I have to get fancy with those big
> tables? From your presentation, it looks like PG can handle 1.2 billion
> records or so as long as you write intelligent queries. (And normal PG
> should be able to handle that, correct?)

I would rephrase that: large databses are less forgiving of
unintelligent queries, particularly of the form of your average stupid
database abstracting middleware :-).  seek times on a 1gb database are
going to be zero all the time, not so on a 1tb+ database.

good normalization skills are really important for large databases,
along with materialization strategies for 'denormalized sets'.

regarding the number of rows, there is no limit to how much pg can
handle per se, just some practical limitations, especially vacuum and
reindex times.  these are important because they are required to keep
a handle on mvcc bloat and its very nice to be able to vaccum bits of
your database at a time.

just another fyi, if you have a really big database, you can forget
about doing pg_dump for backups (unless you really don't care about
being x day or days behind)...you simply have to due some type of
replication/failover strategy.  i would start with pitr.

merlin

Re: Large tables (was: RAID 0 not as fast as expected)

From

Alan Hodgson

Date:

18 September 2006, 18:02:06

On Monday 18 September 2006 13:56, "Merlin Moncure" <mmoncure@gmail.com>
wrote:
> just another fyi, if you have a really big database, you can forget
> about doing pg_dump for backups (unless you really don't care about
> being x day or days behind)...you simply have to due some type of
> replication/failover strategy.  i would start with pitr.

And, of course, the biggest problem of all; upgrades.

--
Eat right. Exercise regularly. Die anyway.

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Bucky Jordan"

Date:

18 September 2006, 19:41:30

> good normalization skills are really important for large databases,
> along with materialization strategies for 'denormalized sets'.

Good points- thanks. I'm especially curious what others have done for
the materialization. The matview project on gborg appears dead, and I've
only found a smattering of references on google. My guess is, you roll
your own for optimal performance...

> regarding the number of rows, there is no limit to how much pg can
> handle per se, just some practical limitations, especially vacuum and
> reindex times.  these are important because they are required to keep
> a handle on mvcc bloat and its very nice to be able to vaccum bits of
> your database at a time.

I was hoping for some actual numbers on "practical". Hardware isn't too
much of an issue (within reason- we're not talking an amazon or google
here... the SunFire X4500 looks interesting... )- if a customer wants to
store that much data, and pay for it, we'll figure out how to do it. I'd
just rather not have to re-design the database. Say the requirement is
to keep 12 months of data accessible, each "scan" produces 100M records,
and I run one per month. What happens if the customer wants to run it
once a week? I was more trying to figure out at what point (ballpark)
I'm going to have to look into archive tables and things of that nature
(or at Bizgres/MPP). It's easier for us to add more/bigger hardware, but
not so easy to redesign/add history tables...

>
> just another fyi, if you have a really big database, you can forget
> about doing pg_dump for backups (unless you really don't care about
> being x day or days behind)...you simply have to due some type of
> replication/failover strategy.  i would start with pitr.
>
> merlin
I was originally thinking replication, but I did notice some nice pitr
features in 8.x - I'll have to look into that some more.

Thanks for the pointers though...

- Bucky

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Alex Turner"

Date:

18 September 2006, 20:15:07

Do the basic math:

If you have a table with 100million records, each of which is 200bytes long, that gives you roughtly 20 gig of data (assuming it was all written neatly and hasn't been updated much). If you have to do a full table scan, then it will take roughly 400 seconds with a single 10k RPM SCSI drive with an average read speed of 50MB/sec. If you are going to read indexes, figure out how big your index is, and how many blocks will be returned, and figure out how many blocks this will require transferring from the main table, make an estimate of the seeks, add in the transfer total, and you have a time to get your data. A big array with a good controller can pass 1000MB/sec transfer on the right bus if you buy the write technologies. But be warned, if you buy the wrong ones, your big array can end up being slower than a single drive for sequential transfer. At 1000MB/sec your scan would take 20 seconds.

Be warned, the tech specs page:
http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
doesn't mention RAID 10 as a possible, and this is probably what most would recommend for fast data access if you are doing both read and write operations. If you are doing mostly Read, then RAID 5 is passable, but it's redundancy with large numbers of drives is not so great.

Alex.

On 9/18/06, Bucky Jordan <bjordan@lumeta.com> wrote:

> good normalization skills are really important for large databases,
> along with materialization strategies for 'denormalized sets'.

Good points- thanks. I'm especially curious what others have done for
the materialization. The matview project on gborg appears dead, and I've
only found a smattering of references on google. My guess is, you roll
your own for optimal performance...

> regarding the number of rows, there is no limit to how much pg can
> handle per se, just some practical limitations, especially vacuum and
> reindex times.  these are important because they are required to keep
> a handle on mvcc bloat and its very nice to be able to vaccum bits of
> your database at a time.

I was hoping for some actual numbers on "practical". Hardware isn't too
much of an issue (within reason- we're not talking an amazon or google
here... the SunFire X4500 looks interesting... )- if a customer wants to
store that much data, and pay for it, we'll figure out how to do it. I'd
just rather not have to re-design the database. Say the requirement is
to keep 12 months of data accessible, each "scan" produces 100M records,
and I run one per month. What happens if the customer wants to run it
once a week? I was more trying to figure out at what point (ballpark)
I'm going to have to look into archive tables and things of that nature
(or at Bizgres/MPP). It's easier for us to add more/bigger hardware, but
not so easy to redesign/add history tables...

>
> just another fyi, if you have a really big database, you can forget
> about doing pg_dump for backups (unless you really don't care about
> being x day or days behind)...you simply have to due some type of
> replication/failover strategy.  i would start with pitr.
>
> merlin
I was originally thinking replication, but I did notice some nice pitr
features in 8.x - I'll have to look into that some more.

Thanks for the pointers though...

- Bucky

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: Large tables (was: RAID 0 not as fast as expected)

From

Michael Stone

Date:

18 September 2006, 20:58:46

On Mon, Sep 18, 2006 at 07:14:56PM -0400, Alex Turner wrote:
>If you have a table with 100million records, each of which is 200bytes long,
>that gives you roughtly 20 gig of data (assuming it was all written neatly
>and hasn't been updated much).

If you're in that range it doesn't even count as big or challenging--you
can keep it memory resident for not all that much money.

Mike Stone

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Luke Lonergan"

Date:

18 September 2006, 22:11:17

Bucky,

On 9/18/06 7:37 AM, "Bucky Jordan" <bjordan@lumeta.com> wrote:

> My question is at what point do I have to get fancy with those big
> tables? From your presentation, it looks like PG can handle 1.2 billion
> records or so as long as you write intelligent queries. (And normal PG
> should be able to handle that, correct?)

PG has limitations that will confront you at sizes beyond about a couple
hundred GB of table size, as will Oracle and others.

You should be careful to implement very good disk hardware and leverage
Postgres 8.1 partitioning and indexes intelligently as you go beyond 100GB
per instance.  Also be sure to set the random_page_cost parameter in
postgresql.conf to 100 or even higher when you use indexes, as the actual
seek rate for random access ranges between 50 and 300 for modern disk
hardware.  If this parameter is left at the default of 4, indexes will often
be used inappropriately.

> Also, does anyone know if/when any of the MPP stuff will be ported to
> Postgres, or is the plan to keep that separate?

The plan is to keep that separate for now, though we're contributing
technology like partitioning, faster sorting, bitmap index, adaptive nested
loop, and hybrid hash aggregation to make big databases work better in
Postgres.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

18 September 2006, 22:15:26

Alex,

On 9/18/06 4:14 PM, "Alex Turner" <armtuk@gmail.com> wrote:

> Be warned, the tech specs page:
> http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
> doesn't mention RAID 10 as a possible, and this is probably what most would
> recommend for fast data access if you are doing both read and write
> operations.  If you are doing mostly Read, then RAID 5 is passable, but it's
> redundancy with large numbers of drives is not so great.

RAID10 works great on the X4500  we get 1.6GB/s + per X4500 using RAID10 in
ZFS.  We worked with the Sun Solaris kernel team to make that happen and the
patches are part of Solaris 10 Update 3 due out in November.

- Luke

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Alex Turner"

Date:

19 September 2006, 00:40:29

Sweet - thats good - RAID 10 support seems like an odd thing to leave out.

Alex

On 9/18/06, Luke Lonergan < llonergan@greenplum.com> wrote:

Alex,

On 9/18/06 4:14 PM, "Alex Turner" < armtuk@gmail.com> wrote:

> Be warned, the tech specs page:
> http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
> doesn't mention RAID 10 as a possible, and this is probably what most would
> recommend for fast data access if you are doing both read and write
> operations. If you are doing mostly Read, then RAID 5 is passable, but it's
> redundancy with large numbers of drives is not so great.

RAID10 works great on the X4500 we get 1.6GB/s + per X4500 using RAID10 in
ZFS. We worked with the Sun Solaris kernel team to make that happen and the
patches are part of Solaris 10 Update 3 due out in November.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

19 September 2006, 00:43:01

Yep, Solaris ZFS kicks butt.  It does RAID10/5/6, etc and implements most of
the high end features available on high end SANs...

- Luke


On 9/18/06 8:40 PM, "Alex Turner" <armtuk@gmail.com> wrote:

> Sweet - thats good - RAID 10 support seems like an odd thing to leave out.
>
> Alex
>
> On 9/18/06, Luke Lonergan < llonergan@greenplum.com
> <mailto:llonergan@greenplum.com> > wrote:
>> Alex,
>>
>> On 9/18/06 4:14 PM, "Alex Turner" < armtuk@gmail.com> wrote:
>>
>>> Be warned, the tech specs page:
>>> http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
>>> <http://www.sun.com/servers/x64/x4500/specs.xml#anchor3>
>>> doesn't mention RAID 10 as a possible, and this is probably what most would
>>> recommend for fast data access if you are doing both read and write
>>> operations.  If you are doing mostly Read, then RAID 5 is passable, but it's
>>> redundancy with large numbers of drives is not so great.
>>
>> RAID10 works great on the X4500  we get 1.6GB/s + per X4500 using RAID10 in
>> ZFS.  We worked with the Sun Solaris kernel team to make that happen and the
>> patches are part of Solaris 10 Update 3 due out in November.
>>
>> - Luke
>>
>>
>
>

Re: Large tables (was: RAID 0 not as fast as expected)

From

mark@mark.mielke.cc

Date:

19 September 2006, 00:45:58

On Mon, Sep 18, 2006 at 06:10:13PM -0700, Luke Lonergan wrote:
> Also be sure to set the random_page_cost parameter in
> postgresql.conf to 100 or even higher when you use indexes, as the actual
> seek rate for random access ranges between 50 and 300 for modern disk
> hardware.  If this parameter is left at the default of 4, indexes will often
> be used inappropriately.

Does a tool exist yet to time this for a particular configuration?

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

19 September 2006, 01:02:41

Mark,

On 9/18/06 8:45 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote:

> Does a tool exist yet to time this for a particular configuration?

We're considering building this into ANALYZE on a per-table basis.  The
basic approach times sequential access in page rate, then random seeks as
page rate and takes the ratio of same.

Since PG's heap scan is single threaded, the seek rate is equivalent to a
single disk (even though RAID arrays may have many spindles), the typical
random seek rates are around 100-200 seeks per second from within the
backend.  That means that as sequential scan performance increases, such as
happens when using large RAID arrays, the random_page_cost will range from
50 to 300 linearly as the size of the RAID array increases.

- Luke

Re: Large tables (was: RAID 0 not as fast as expected)

From

"Bucky Jordan"

Date:

19 September 2006, 11:04:46

Mike,

> On Mon, Sep 18, 2006 at 07:14:56PM -0400, Alex Turner wrote:
> >If you have a table with 100million records, each of which is
200bytes
> long,
> >that gives you roughtly 20 gig of data (assuming it was all written
> neatly
> >and hasn't been updated much).
>
I'll keep that in mind (minimizing updates during loads). My plan is
updates will actually be implemented as insert to summary/history table
then delete old records. The OLTP part of this will be limited to a
particular set of tables that I anticipate will not be nearly as large.

> If you're in that range it doesn't even count as big or
challenging--you
> can keep it memory resident for not all that much money.
>
> Mike Stone
>
I'm aware of that, however, *each* scan could be 100m records, and we
need to keep a minimum of 12, and possibly 50 or more. So sure, if I
only have 100m records total, sure, but 500m, or 1b... According to
Alex's calculations, that'd be 100G for 500m records (just that one
table, not including indexes).

From what Luke was saying, there's some issues once you get over a
couple hundred GB in a single table, so in the case of 12 scans, it
looks like I can squeeze it in given sufficient hardware, but more than
that and I'll have to look at history tables or some other solution. I'd
also think doing some sort of summary table/materialized view for
count/sum operations would be a necessity at this point.

I'm not sure that this is a good topic for the list, but in the interest
of sharing info I'll ask, and if someone feels it warrants a private
response, we can discuss off list. Would Bizgres be able to handle
tables > 200GB or so, or is it still quite similar to Postgres (single
threaded/process issues per query type things..)? What about Bizgres
MPP? And also, does switching from Postgres to Bizgres or Bizgres MPP
require any application changes?

Thanks for all the help,

Bucky

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

20 September 2006, 05:09:20

Hi, Luke,

Luke Lonergan wrote:

> Since PG's heap scan is single threaded, the seek rate is equivalent to a
> single disk (even though RAID arrays may have many spindles), the typical
> random seek rates are around 100-200 seeks per second from within the
> backend.  That means that as sequential scan performance increases, such as
> happens when using large RAID arrays, the random_page_cost will range from
> 50 to 300 linearly as the size of the RAID array increases.

Do you think that adding some posix_fadvise() calls to the backend to
pre-fetch some blocks into the OS cache asynchroneously could improve
that situation?

I could imagine that e. G. index bitmap scans could profit in the heap
fetching stage by fadvise()ing the next few blocks.

Maybe asynchroneous I/O could be used for the same benefit, but
posix_fadvise is less() intrusive, and can easily be #define'd out on
platforms that don't support it.

Combine this with the Linux Kernel I/O Scheduler patches (readahead
improvements) that were discussed here in summer...

Regards,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

20 September 2006, 14:06:11

Markus,

On 9/20/06 1:09 AM, "Markus Schaber" <schabi@logix-tt.com> wrote:

> Do you think that adding some posix_fadvise() calls to the backend to
> pre-fetch some blocks into the OS cache asynchroneously could improve
> that situation?

Nope - this requires true multi-threading of the I/O, there need to be
multiple seek operations running simultaneously.  The current executor
blocks on each page request, waiting for the I/O to happen before requesting
the next page.  The OS can't predict what random page is to be requested
next.

We can implement multiple scanners (already present in MPP), or we could
implement AIO and fire off a number of simultaneous I/O requests for
fulfillment.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

20 September 2006, 15:02:28

Hi, Luke,

Luke Lonergan wrote:

>> Do you think that adding some posix_fadvise() calls to the backend to
>> pre-fetch some blocks into the OS cache asynchroneously could improve
>> that situation?
>
> Nope - this requires true multi-threading of the I/O, there need to be
> multiple seek operations running simultaneously.  The current executor
> blocks on each page request, waiting for the I/O to happen before requesting
> the next page.  The OS can't predict what random page is to be requested
> next.

I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
meant for this purpose?

My idea was that the executor could posix_fadvise() the blocks it will
need in the near future, and later, when it actually issues the blocking
read, the block is there already. This could even give speedups in the
single-spindle case, as the I/O scheduler could already fetch the next
blocks while the executor processes the current one.

But there must be some details in the executor that prevent this.

> We can implement multiple scanners (already present in MPP), or we could
> implement AIO and fire off a number of simultaneous I/O requests for
> fulfillment.

AIO is much more intrusive to implement, so I'd preferrably look
whether posix_fadvise() could improve the situation.

Thanks,
Markus

Re: Large tables (was: RAID 0 not as fast as

From

Ron

Date:

20 September 2006, 16:36:03

IMHO, AIO is the architecturally cleaner and more elegant solution.

We in fact have a project on the boards to do this but funding (as
yet) has not been found.

My $.02,
Ron


At 02:02 PM 9/20/2006, Markus Schaber wrote:
>Hi, Luke,
>
>Luke Lonergan wrote:
>
> >> Do you think that adding some posix_fadvise() calls to the backend to
> >> pre-fetch some blocks into the OS cache asynchroneously could improve
> >> that situation?
> >
> > Nope - this requires true multi-threading of the I/O, there need to be
> > multiple seek operations running simultaneously.  The current executor
> > blocks on each page request, waiting for the I/O to happen before
> requesting
> > the next page.  The OS can't predict what random page is to be requested
> > next.
>
>I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
>meant for this purpose?
>
>My idea was that the executor could posix_fadvise() the blocks it will
>need in the near future, and later, when it actually issues the blocking
>read, the block is there already. This could even give speedups in the
>single-spindle case, as the I/O scheduler could already fetch the next
>blocks while the executor processes the current one.
>
>But there must be some details in the executor that prevent this.
>
> > We can implement multiple scanners (already present in MPP), or we could
> > implement AIO and fire off a number of simultaneous I/O requests for
> > fulfillment.
>
>AIO is much more intrusive to implement, so I'd preferrably look
>whether posix_fadvise() could improve the situation.
>
>Thanks,
>Markus
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

20 September 2006, 21:24:17

Markus,

On 9/20/06 11:02 AM, "Markus Schaber" <schabi@logix-tt.com> wrote:

> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
> meant for this purpose?

This is a good idea - I wasn't aware that this was possible.

We'll do some testing and see if it works as advertised on Linux and
Solaris.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

21 September 2006, 04:31:31

Hi, Luke,

Luke Lonergan wrote:

>> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
>> meant for this purpose?
>
> This is a good idea - I wasn't aware that this was possible.

This possibility was the reason for me to propose it. :-)

> We'll do some testing and see if it works as advertised on Linux and
> Solaris.

Fine, I'm looking forward to the results.

According to my small test, it works at least on linux 2.6.17.4.

Btw, posix_fadvise() could even give a small improvement for
multi-threaded backends, given that the I/O subsystem is smart enough to
cope intelligently to cope with large bunches of outstanding requests.

HTH,
Markus

Re: Large tables (was: RAID 0 not as fast as

From

"Bucky Jordan"

Date:

21 September 2006, 16:14:14

> > Do you think that adding some posix_fadvise() calls to the backend
to
> > pre-fetch some blocks into the OS cache asynchroneously could
improve
> > that situation?
>
> Nope - this requires true multi-threading of the I/O, there need to be
> multiple seek operations running simultaneously.  The current executor
> blocks on each page request, waiting for the I/O to happen before
> requesting
> the next page.  The OS can't predict what random page is to be
requested
> next.
>
> We can implement multiple scanners (already present in MPP), or we
could
> implement AIO and fire off a number of simultaneous I/O requests for
> fulfillment.

So this might be a dumb question, but the above statements apply to the
cluster (e.g. postmaster) as a whole, not per postgres
process/transaction correct? So each transaction is blocked waiting for
the main postmaster to retrieve the data in the order it was requested
(i.e. not multiple scanners/aio)?

In this case, the only way to take full advantage of larger hardware
using normal postgres would be to run multiple instances? (Which might
not be a bad idea since it would set your application up to be able to
deal with databases distributed on multiple servers...)

- Bucky

Re: Large tables (was: RAID 0 not as fast as

From

Mark Lewis

Date:

21 September 2006, 16:42:09

> So this might be a dumb question, but the above statements apply to the
> cluster (e.g. postmaster) as a whole, not per postgres
> process/transaction correct? So each transaction is blocked waiting for
> the main postmaster to retrieve the data in the order it was requested
> (i.e. not multiple scanners/aio)?

Each connection runs its own separate back-end process, so these
statements apply per PG connection (=process).

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

21 September 2006, 16:54:15

Hi, Bucky,

Bucky Jordan wrote:

>> We can implement multiple scanners (already present in MPP), or we
> could
>> implement AIO and fire off a number of simultaneous I/O requests for
>> fulfillment.
>
> So this might be a dumb question, but the above statements apply to the
> cluster (e.g. postmaster) as a whole, not per postgres
> process/transaction correct? So each transaction is blocked waiting for
> the main postmaster to retrieve the data in the order it was requested
> (i.e. not multiple scanners/aio)?

No, that's a wrong assumption.

It applies per active backend. When connecting, the Postmaster forks a
new backend process. Each backend process has its own scanner and
executor. The main postmaster is only for coordination (forking, config
reload etc.), all the work is done in the forked per-connection backends.

Furthermore, the PostgreSQL MVCC system ensures that readers are neither
ever blocked nor blocking other backends. Writers can block each other
due to the ACID transaction semantics, however the MVCC limits that to a
minimum.

> In this case, the only way to take full advantage of larger hardware
> using normal postgres would be to run multiple instances? (Which might
> not be a bad idea since it would set your application up to be able to
> deal with databases distributed on multiple servers...)

Typical OLTP applications (Web UIs, Booking systems, etc.) have multiple
connections, and those run fully parallel.

So if your application is of this type, it will take full advantage of
larger hardware. In the list archive, you should find some links to
benchmarks that prove this statement, PostgreSQL scales linearly, up to
8 CPUs and 32 "hyperthreads" in this benchmarks.

Our discussion is about some different type of application, where you
have a single application issuing a single query at a time dealing with
a large amount (several gigs up to teras) of data.

Now, when such a query is generating sequential disk access, the I/O
scheduler of the underlying OS can easily recognize that pattern, and
prefetch the data, thus giving the full speed benefit of the underlying
RAID.

The discussed problem arises when such large queries generate random
(non-continous) disk access (e. G. index scans). Here, the underlying
RAID cannot effectively prefetch data as it does not know what the
application will need next. This effectively limits the speed to that of
a single disk, regardless of the details of the underlying RAID, as it
can only process a request at a time, and has to wait for the
application for the next one.

Now, Bizgres MPP goes the way of having multiple threads per backend,
each one processing a fraction of the data. So there are always several
outstanding read requests that can be scheduled to the disks.

My proposal was to use posix_fadvise() in the single-threaded scanner,
so it can tell the OS "I will need those blocks in the near future". So
the OS can pre-fetch those blocks into the cache, while PostgreSQL still
processes the previous block of data.

Another proposal would be to use so-called asynchroneous I/O. This is
definitely an interesting and promising idea, but needs much more
changes to the code, compared to posix_fadvise().

I hope that this lengthy mail is enlightening, if not, don't hesitate to
ask.

Thanks for your patience,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Large tables (was: RAID 0 not as fast as

From

"Bucky Jordan"

Date:

21 September 2006, 18:16:53

Markus,

First, thanks- your email was very enlightining. But, it does bring up a
few additional questions, so thanks for your patience also- I've listed
them below.

> It applies per active backend. When connecting, the Postmaster forks a
> new backend process. Each backend process has its own scanner and
> executor. The main postmaster is only for coordination (forking,
config
> reload etc.), all the work is done in the forked per-connection
backends.

Each postgres process also uses shared memory (aka the buffer cache) so
as to not fetch data that another process has already requested,
correct?

> Our discussion is about some different type of application, where you
> have a single application issuing a single query at a time dealing
with
> a large amount (several gigs up to teras) of data.
Commonly these are referred to as OLAP applications, correct? Which is
where I believe my application is more focused (it may be handling some
transactions in the future, but at the moment, it follows the "load lots
of data, then analyze it" pattern).

> The discussed problem arises when such large queries generate random
> (non-continous) disk access (e. G. index scans). Here, the underlying
> RAID cannot effectively prefetch data as it does not know what the
> application will need next. This effectively limits the speed to that
of
> a single disk, regardless of the details of the underlying RAID, as it
> can only process a request at a time, and has to wait for the
> application for the next one.
Does this have anything to do with postgres indexes not storing data, as
some previous posts to this list have mentioned? (In otherwords, having
the index in memory doesn't help? Or are we talking about indexes that
are too large to fit in RAM?)

So this issue would be only on a per query basis? Could it be alleviated
somewhat if I ran multiple smaller queries? For example, I want to
calculate a summary table on 500m records- fire off 5 queries that count
100m records each and update the summary table, leaving MVCC to handle
update contention?

Actually, now that I think about it- that would only work if the
sections I mentioned above were on different disks right? So I would
actually have to do table partitioning with tablespaces on different
spindles to get that to be beneficial? (which is basically not feasible
with RAID, since I don't get to pick what disks the data goes on...)

Are there any other workarounds for current postgres?

Thanks again,

Bucky

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

21 September 2006, 18:43:18

Hi, Bucky,

Bucky Jordan wrote:

> Each postgres process also uses shared memory (aka the buffer cache) so
> as to not fetch data that another process has already requested,
> correct?

Yes.

Additinally, the OS caches disk blocks. Most unixoid ones like Linux use
(nearly) all unused memory for this block cache, I don't know about Windows.

> Commonly these are referred to as OLAP applications, correct? Which is
> where I believe my application is more focused (it may be handling some
> transactions in the future, but at the moment, it follows the "load lots
> of data, then analyze it" pattern).

Yes, most OLAP apps fall into this category. But I also think that most
OLAP apps mainly generate sequential data access (sequential scans), for
which the OS prefetching of data works fine.

Btw, some weeks ago, there was a patch mentioned here that improves the
linux kernel I/O scheduler wr/t those prefetching capabilities.

> Does this have anything to do with postgres indexes not storing data, as
> some previous posts to this list have mentioned? (In otherwords, having
> the index in memory doesn't help? Or are we talking about indexes that
> are too large to fit in RAM?)

Yes, it has, but only for the cases where your query fetches only
columns in that index. In case where you fetch other columns, PostgreSQL
has to access the Heap nevertheless to fetch those.

The overhead for checking outdated row versions (those that were updated
or deleted, but not yet vacuumed) is zero, as those "load bulk, then
analyze" applications typically don't create invalid rows, so every row
fetched from the heap is valid. This is very different in OLTP applications.

> So this issue would be only on a per query basis? Could it be alleviated
> somewhat if I ran multiple smaller queries? For example, I want to
> calculate a summary table on 500m records- fire off 5 queries that count
> 100m records each and update the summary table, leaving MVCC to handle
> update contention?

Yes, you could do that, but only if you're CPU bound, and have a
multi-core machine. And you need table partitioning, as LIMIT/OFFSET is
expensive. Btw, the Bizgres people do exactly this under their hood, so
it may be worth a look.

If you're I/O bound, and your query is a full table scan, or something
else that results in (nearly) sequential disk access, the OS prefetch
algorithm will work.

You can use some I/O monitoring tools to compare the actual speed the
data comes in when PostgreSQL does the sequential scan, and compare it
to DD'ing the database table files. For simple aggregates like sum(),
you usually get near the "raw" speed, and the real bottlenecks are the
disk I/O rate, bad RAID implementations or PCI bus contention.

> Actually, now that I think about it- that would only work if the
> sections I mentioned above were on different disks right? So I would
> actually have to do table partitioning with tablespaces on different
> spindles to get that to be beneficial? (which is basically not feasible
> with RAID, since I don't get to pick what disks the data goes on...)

If you really need that much throughput, you can always put the
different partitions on different RAIDs. But hardware gets very
expensive in those dimensions, and it may be better to partition the
data on different machines altogether. AFAIK, Bizgres MPP does exactly that.

> Are there any other workarounds for current postgres?

Are your questions of theoretical nature, or do you have a concrete
problem? In latter case, you could post your details here, and we'll see
whether we can help.

Btw, I'm not related with Bizgres in any way. :-)

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

21 September 2006, 23:20:06

Bucky,

On 9/21/06 2:16 PM, "Bucky Jordan" <bjordan@lumeta.com> wrote:

> Does this have anything to do with postgres indexes not storing data, as
> some previous posts to this list have mentioned? (In otherwords, having
> the index in memory doesn't help? Or are we talking about indexes that
> are too large to fit in RAM?)

Yes, if the index could be scanned without needing to scan the heap to
satisfy a query, that query would benefit from sequential access.  This is
true whether the index fits in RAM or not.

> So this issue would be only on a per query basis? Could it be alleviated
> somewhat if I ran multiple smaller queries? For example, I want to
> calculate a summary table on 500m records- fire off 5 queries that count
> 100m records each and update the summary table, leaving MVCC to handle
> update contention?

Clever, functional and very painful way to do it, but yes, you would get 5
disks worth of seeking.

My goal is to provide for as many disks seeking at the same time as are
available to the RAID.  Note that the Sun Appliance (X4500 based) has 11
disk drives available per CPU core.  Later it will drop to 5-6 disks per
core with the introduction of quad core CPUs, which is more the norm for
now.  Bizgres MPP will achieve one or two concurrent heap scanner per CPU
for a given query in the default configurations, so we're missing out on
lots of potential speedup for index scans in many cases.

With both MPP and stock Postgres you get more seek rate as you add users,
but it would take 44 users to use all of the drives in random seeking for
Postgres, where for MPP it would take more like 5.

> Actually, now that I think about it- that would only work if the
> sections I mentioned above were on different disks right? So I would
> actually have to do table partitioning with tablespaces on different
> spindles to get that to be beneficial? (which is basically not feasible
> with RAID, since I don't get to pick what disks the data goes on...)

On average, for random seeking we can assume that RAID will distribute the
data evenly.  The I/Os should balance out.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

Guy Thornley

Date:

22 September 2006, 00:02:03

> >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
> >> meant for this purpose?
> >
> > This is a good idea - I wasn't aware that this was possible.
>
> This possibility was the reason for me to propose it. :-)

posix_fadvise() features in the TODO list already; I'm not sure if any work
on it has been done for pg8.2.

Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows
pages to be discarded from memory earlier than usual. This is useful, since
it means you can prevent your seqscan from nuking the OS cache.

Of course you could argue the OS should be able to detect this, and prevent
it occuring anyway. I don't know anything about linux's behaviour in this
area.

.Guy

Re: Large tables (was: RAID 0 not as fast as

From

Bruce Momjian

Date:

22 September 2006, 00:07:22

Guy Thornley wrote:
> > >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
> > >> meant for this purpose?
> > >
> > > This is a good idea - I wasn't aware that this was possible.
> >
> > This possibility was the reason for me to propose it. :-)
>
> posix_fadvise() features in the TODO list already; I'm not sure if any work
> on it has been done for pg8.2.
>
> Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows
> pages to be discarded from memory earlier than usual. This is useful, since
> it means you can prevent your seqscan from nuking the OS cache.
>
> Of course you could argue the OS should be able to detect this, and prevent
> it occuring anyway. I don't know anything about linux's behaviour in this
> area.

We tried posix_fadvise() during the 8.2 development cycle, but had
problems as outlined in a comment in xlog.c:

    /*
     * posix_fadvise is problematic on many platforms: on older x86 Linux
     * it just dumps core, and there are reports of problems on PPC platforms
     * as well.  The following is therefore disabled for the time being.
     * We could consider some kind of configure test to see if it's safe to
     * use, but since we lack hard evidence that there's any useful performance
     * gain to be had, spending time on that seems unprofitable for now.
     */

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Large tables (was: RAID 0 not as fast as

From

mark@mark.mielke.cc

Date:

22 September 2006, 00:41:57

On Fri, Sep 22, 2006 at 02:52:09PM +1200, Guy Thornley wrote:
> > >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly
> > >> meant for this purpose?
> > > This is a good idea - I wasn't aware that this was possible.
> > This possibility was the reason for me to propose it. :-)
> posix_fadvise() features in the TODO list already; I'm not sure if any work
> on it has been done for pg8.2.
>
> Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows
> pages to be discarded from memory earlier than usual. This is useful, since
> it means you can prevent your seqscan from nuking the OS cache.
>
> Of course you could argue the OS should be able to detect this, and prevent
> it occuring anyway. I don't know anything about linux's behaviour in this
> area.

I recall either monitoring or participating in the discussion when this
call was added to Linux.

I don't believe the kernel can auto-detect that you do not need a page
any longer. It can only prioritize pages to keep when memory is fully
in use and a new page must be loaded. This is often some sort of LRU
scheme. If the page is truly useless, only the application can know.

I'm not convinced that PostgreSQL can know this. The case where it is
useful is if a single process is sequentially scanning a large file
(much larger than memory). As soon as it is more than one process,
or if it is not a sequential scan, or if it is not a large file, this
call hurts more than it gains. Just because I'm done with the page does
not mean that *you* are done with the page.

I'd advise against using this call unless it can be shown that the page
will not be used in the future, or at least, that the page is less useful
than all other pages currently in memory. This is what the call really means.
It means, "There is no value to keeping this page in memory".

Perhaps certain PostgreSQL loads fit this pattern. None of my uses fit
this pattern, and I have trouble believing that a majority of PostgreSQL
loads fits this pattern.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

22 September 2006, 00:47:21

Mark,

On 9/21/06 8:40 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote:

> I'd advise against using this call unless it can be shown that the page
> will not be used in the future, or at least, that the page is less useful
> than all other pages currently in memory. This is what the call really means.
> It means, "There is no value to keeping this page in memory".

Yes, it's a bit subtle.

I think the topic is similar to "cache bypass", used in cache capable vector
processors (Cray, Convex, Multiflow, etc) in the 90's.  When you are
scanning through something larger than the cache, it should be marked
"non-cacheable" and bypass caching altogether.  This avoids a copy, and
keeps the cache available for things that can benefit from it.

WRT the PG buffer cache, the rule would have to be: "if the heap scan is
going to be larger than "effective_cache_size", then issue the
posix_fadvise(BLOCK_NOT_NEEDED) call".  It doesn't sound very efficient to
do this in block/extent increments though, and it would possibly mess with
subsets of the block space that would be re-used for other queries.

- Luke

Re: Large tables (was: RAID 0 not as fast as

From

Markus Schaber

Date:

22 September 2006, 06:13:08

Hi, Guy,

Guy Thornley wrote:

> Of course you could argue the OS should be able to detect this, and prevent
> it occuring anyway. I don't know anything about linux's behaviour in this
> area.

Yes, one can argue that way.

But a generic Algorithm in the OS can never be as smart as the
application which has more informations about semantics and algorithms.
Everything else would need a crystal ball device :-)

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Large tables (was: RAID 0 not as fast as

From

"Jim C. Nasby"

Date:

22 September 2006, 11:01:24

On Thu, Sep 21, 2006 at 08:46:41PM -0700, Luke Lonergan wrote:
> Mark,
>
> On 9/21/06 8:40 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote:
>
> > I'd advise against using this call unless it can be shown that the page
> > will not be used in the future, or at least, that the page is less useful
> > than all other pages currently in memory. This is what the call really means.
> > It means, "There is no value to keeping this page in memory".
>
> Yes, it's a bit subtle.
>
> I think the topic is similar to "cache bypass", used in cache capable vector
> processors (Cray, Convex, Multiflow, etc) in the 90's.  When you are
> scanning through something larger than the cache, it should be marked
> "non-cacheable" and bypass caching altogether.  This avoids a copy, and
> keeps the cache available for things that can benefit from it.
>
> WRT the PG buffer cache, the rule would have to be: "if the heap scan is
> going to be larger than "effective_cache_size", then issue the
> posix_fadvise(BLOCK_NOT_NEEDED) call".  It doesn't sound very efficient to
> do this in block/extent increments though, and it would possibly mess with
> subsets of the block space that would be re-used for other queries.

Another issue is that if you start two large seqscans on the same table
at about the same time, right now you should only be issuing one set of
reads for both requests, because one of them will just pull the blocks
back out of cache. If we weren't caching then each query would have to
physically read (which would be horrid).

There's been talk of adding code that would have a seqscan detect if
another seqscan is happening on the table at the same time, and if it
is, to start it's seqscan wherever the other seqscan is currently
running. That would probably ensure that we weren't reading from the
table in 2 different places, even if we weren't caching.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Large tables (was: RAID 0 not as fast as

From

"Jim C. Nasby"

Date:

22 September 2006, 11:02:58

On Thu, Sep 21, 2006 at 11:05:39PM -0400, Bruce Momjian wrote:
> We tried posix_fadvise() during the 8.2 development cycle, but had
> problems as outlined in a comment in xlog.c:
>
>     /*
>      * posix_fadvise is problematic on many platforms: on older x86 Linux
>      * it just dumps core, and there are reports of problems on PPC platforms
>      * as well.  The following is therefore disabled for the time being.
>      * We could consider some kind of configure test to see if it's safe to
>      * use, but since we lack hard evidence that there's any useful performance
>      * gain to be had, spending time on that seems unprofitable for now.
>      */

In case it's not clear, that's a call for someone to do some performance
testing. :)

Bruce, you happen to have a URL for a patch to put fadvise in?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Large tables (was: RAID 0 not as fast as

From

Ron Mayer

Date:

24 September 2006, 10:30:08

Luke Lonergan wrote:
>
> I think the topic is similar to "cache bypass", used in cache capable vector
> processors (Cray, Convex, Multiflow, etc) in the 90's.  When you are
> scanning through something larger than the cache, it should be marked
> "non-cacheable" and bypass caching altogether.  This avoids a copy, and
> keeps the cache available for things that can benefit from it.

And 'course some file systems do this automatically when they
detect a sequential scan[1] though it can have unexpected (to some)
negative side effects[2].   For file systems that support freebehind
as a configurable parameter, it might be easier to experiment with
the idea there.

[1]
http://www.ediaudit.com/doc_sol10/Solaris_10_Doc/common/SUNWaadm/reloc/sun_docs/C/solaris_10/SUNWaadm/SOLTUNEPARAMREF/p18.html
[2] http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6207772

Re: Large tables (was: RAID 0 not as fast as

From

"Luke Lonergan"

Date:

25 September 2006, 11:25:09

Jim,

On 9/22/06 7:01 AM, "Jim C. Nasby" <jim@nasby.net> wrote:

> There's been talk of adding code that would have a seqscan detect if
> another seqscan is happening on the table at the same time, and if it
> is, to start it's seqscan wherever the other seqscan is currently
> running. That would probably ensure that we weren't reading from the
> table in 2 different places, even if we weren't caching.

Right, aka "SyncScan"

The optimization you point out that we miss when bypassing cache is a pretty
unlikely event in real world apps, though it makes poorly designed
benchmarks go really fast.  It's much more likely that the second seqscan
will start after the block cache is exhausted, which will cause actuator
thrashing (depending on the readahead that the OS uses).  SyncScan fixes
that.

- Luke