Thread: Quad processor options

Quad processor options

From

Bjoern Metzdorf

Date:

11 May 2004, 16:13:26

Hi,

I am curious if there are any real life production quad processor setups
running postgresql out there. Since postgresql lacks a proper
replication/cluster solution, we have to buy a bigger machine.

Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI
hardware-raid 10.

Has anyone experiences with quad Xeon or quad Opteron setups? I am
looking at the appropriate boards from Tyan, which would be the only
option for us to buy such a beast. The 30k+ setups from Dell etc. don't
fit our budget.

I am thinking of the following:

Quad processor (xeon or opteron)
5 x SCSI 15K RPM for Raid 10 + spare drive
2 x IDE for system
ICP-Vortex battery backed U320 Hardware Raid
4-8 GB Ram

Would be nice to hear from you.

Regards,
Bjoern

Re: [PERFORM] Quad processor options

From

Paul Tuckfield

Date:

11 May 2004, 17:15:29

it's very good to understand specific choke points you're trying to
address by upgrading so you dont get disappointed.  Are you truly CPU
constrained, or is it memory footprint or IO thruput that makes you
want to upgrade?

IMO The best way to begin understanding system choke points is vmstat
output.

Would you mind forwarding the output of "vmstat 10 120" under peak load
period?  (I'm asusming this is linux or unix variant)  a brief
description of what is happening during the vmstat sample would help a
lot too.



> I am curious if there are any real life production quad processor
> setups running postgresql out there. Since postgresql lacks a proper
> replication/cluster solution, we have to buy a bigger machine.
>
> Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI
> hardware-raid 10.
>
> Has anyone experiences with quad Xeon or quad Opteron setups? I am
> looking at the appropriate boards from Tyan, which would be the only
> option for us to buy such a beast. The 30k+ setups from Dell etc.
> don't fit our budget.
>
> I am thinking of the following:
>
> Quad processor (xeon or opteron)
> 5 x SCSI 15K RPM for Raid 10 + spare drive
> 2 x IDE for system
> ICP-Vortex battery backed U320 Hardware Raid
> 4-8 GB Ram
>
> Would be nice to hear from you.
>
> Regards,
> Bjoern
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Re: [PERFORM] Quad processor options

From

"scott.marlowe"

Date:

11 May 2004, 17:17:43

On Tue, 11 May 2004, Bjoern Metzdorf wrote:

> Hi,
>
> I am curious if there are any real life production quad processor setups
> running postgresql out there. Since postgresql lacks a proper
> replication/cluster solution, we have to buy a bigger machine.
>
> Right now we are running on a dual 2.4 Xeon, 3 GB Ram and U160 SCSI
> hardware-raid 10.
>
> Has anyone experiences with quad Xeon or quad Opteron setups? I am
> looking at the appropriate boards from Tyan, which would be the only
> option for us to buy such a beast. The 30k+ setups from Dell etc. don't
> fit our budget.
>
> I am thinking of the following:
>
> Quad processor (xeon or opteron)
> 5 x SCSI 15K RPM for Raid 10 + spare drive
> 2 x IDE for system
> ICP-Vortex battery backed U320 Hardware Raid
> 4-8 GB Ram

Well, from what I've read elsewhere on the internet, it would seem the
Opterons scale better to 4 CPUs than the basic Xeons do.  Of course, the
exception to this is SGI's altix, which uses their own chipset and runs
the itanium with very good memory bandwidth.

But, do you really need more CPU horsepower?

Are you I/O or CPU or memory or memory bandwidth bound?  If you're sitting
at 99% idle, and iostat says your drives are only running at some small
percentage of what you know they could, you might be memory or memory
bandwidth limited.  Adding two more CPUs will not help with that
situation.

If your I/O is saturated, then the answer may well be a better RAID
array, with many more drives plugged into it.  Do you have any spare
drives you can toss on the machine to see if that helps?  Sometimes going
from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in
performance.

In short, don't expect 4 CPUs to solve the problem if the problem isn't
really the CPUs being maxed out.

Also, what type of load are you running?  Mostly read, mostly written, few
connections handling lots of data, lots of connections each handling a
little data, lots of transactions, etc...

If you are doing lots of writing, make SURE you have a controller that
supports battery backed cache and is configured to write-back, not
write-through.

Re: [PERFORM] Quad processor options

From

Bjoern Metzdorf

Date:

11 May 2004, 17:42:48

scott.marlowe wrote:

> Well, from what I've read elsewhere on the internet, it would seem the
> Opterons scale better to 4 CPUs than the basic Xeons do.  Of course, the
> exception to this is SGI's altix, which uses their own chipset and runs
> the itanium with very good memory bandwidth.

This is basically what I read too. But I cannot spent money on a quad
opteron just for testing purposes :)

> But, do you really need more CPU horsepower?
>
> Are you I/O or CPU or memory or memory bandwidth bound?  If you're sitting
> at 99% idle, and iostat says your drives are only running at some small
> percentage of what you know they could, you might be memory or memory
> bandwidth limited.  Adding two more CPUs will not help with that
> situation.

Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid
controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1
Quantum ATLAS_V_18_SCA.

iostat show between 20 and 60 % user avg-cpu. And this is not even peak
time.

I attached a "vmstat 10 120" output for perhaps 60-70% peak load.

> If your I/O is saturated, then the answer may well be a better RAID
> array, with many more drives plugged into it.  Do you have any spare
> drives you can toss on the machine to see if that helps?  Sometimes going
> from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in
> performance.

Next drives I'll buy will certainly be 15k scsi drives.

> In short, don't expect 4 CPUs to solve the problem if the problem isn't
> really the CPUs being maxed out.
>
> Also, what type of load are you running?  Mostly read, mostly written, few
> connections handling lots of data, lots of connections each handling a
> little data, lots of transactions, etc...

In peak times we can get up to 700-800 connections at the same time.
There are quite some updates involved, without having exact numbers I'll
think that we have about 70% selects and 30% updates/inserts.

> If you are doing lots of writing, make SURE you have a controller that
> supports battery backed cache and is configured to write-back, not
> write-through.

Could you recommend a certain controller type? The only battery backed
one that I found on the net is the newest model from icp-vortex.com.

Regards,
Bjoern
~# vmstat 10 120
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  1  0  24180  10584  32468 2332208   0   1     0     2    1     2   2   0   0
 0  2  0  24564  10480  27812 2313528   8   0  7506   574 1199  8674  30   7  63
 2  1  0  24692  10060  23636 2259176   0  18  8099   298 2074  6328  25   7  68
 2  0  0  24584  18576  21056 2299804   3   6 13208   305 1598  8700  23   6  71
 1 21  1  24504  16588  20912 2309468   4   0  1442  1107  754  6874  42  13  45
 6  1  0  24632  13148  19992 2319400   0   0  2627   499 1184  9633  37   6  58
 5  1  0  24488  10912  19292 2330080   5   0  3404   150 1466 10206  32   6  61
 4  1  0  24488  12180  18824 2342280   3   0  2934    40 1052  3866  19   3  78
 0  0  0  24420  14776  19412 2347232   6   0   403   216 1123  4702  22   3  74
 0  0  0  24548  14408  17380 2321780   4   0   522   715  965  6336  25   5  71
 4  0  0  24676  12504  17756 2322988   0   0   564   830  883  7066  31   6  63
 0  3  0  24676  14060  18232 2325224   0   0   483   388 1097  3401  21   3  76
 0  2  1  24676  13044  18700 2322948   0   0   701   195 1078  5187  23   3  74
 2  0  0  24676  21576  18752 2328168   0   0   467   177 1552  3574  18   3  78

Re: [PERFORM] Quad processor options

From

Bjoern Metzdorf

Date:

11 May 2004, 17:53:45

Paul Tuckfield wrote:

> Would you mind forwarding the output of "vmstat 10 120" under peak load
> period?  (I'm asusming this is linux or unix variant)  a brief
> description of what is happening during the vmstat sample would help a
> lot too.

see my other mail.

We are running Linux, Kernel 2.4. As soon as the next debian version
comes out, I'll happily switch to 2.6 :)

Regards,
Bjoern

Re: [PERFORM] Quad processor options

From

"scott.marlowe"

Date:

11 May 2004, 18:10:41

On Tue, 11 May 2004, Bjoern Metzdorf wrote:

> scott.marlowe wrote:
>
> > Well, from what I've read elsewhere on the internet, it would seem the
> > Opterons scale better to 4 CPUs than the basic Xeons do.  Of course, the
> > exception to this is SGI's altix, which uses their own chipset and runs
> > the itanium with very good memory bandwidth.
>
> This is basically what I read too. But I cannot spent money on a quad
> opteron just for testing purposes :)

Wouldn't it be nice to just have a lab full of these things?

> > If your I/O is saturated, then the answer may well be a better RAID
> > array, with many more drives plugged into it.  Do you have any spare
> > drives you can toss on the machine to see if that helps?  Sometimes going
> > from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big boost in
> > performance.
>
> Next drives I'll buy will certainly be 15k scsi drives.

Better to buy more 10k drives than fewer 15k drives.  Other than slightly
faster select times, the 15ks aren't really any faster.

> > In short, don't expect 4 CPUs to solve the problem if the problem isn't
> > really the CPUs being maxed out.
> >
> > Also, what type of load are you running?  Mostly read, mostly written, few
> > connections handling lots of data, lots of connections each handling a
> > little data, lots of transactions, etc...
>
> In peak times we can get up to 700-800 connections at the same time.
> There are quite some updates involved, without having exact numbers I'll
> think that we have about 70% selects and 30% updates/inserts.

Wow, a lot of writes then.

> > If you are doing lots of writing, make SURE you have a controller that
> > supports battery backed cache and is configured to write-back, not
> > write-through.
>
> Could you recommend a certain controller type? The only battery backed
> one that I found on the net is the newest model from icp-vortex.com.

Sure, adaptec makes one, so does lsi megaraid.  Dell resells both of
these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I
believe.  We run the lsi megaraid with 64 megs battery backed cache.

Intel also makes one, but I've heard nothing about it.

If you get the LSI megaraid, make sure you're running the latest megaraid
2 driver, not the older, slower 1.18 series.  If you are running linux,
look for the dkms packaged version.  dkms, (Dynamic Kernel Module System)
automagically compiles and installs source rpms for drivers when you
install them, and configures the machine to use them to boot up.  Most
drivers seem to be slowly headed that way in the linux universe, and I
really like the simplicity and power of dkms.

I haven't directly tested anything but the adaptec and the lsi megaraid.
Here at work we've had massive issues trying to get the adaptec cards
configured and installed on, while the megaraid was a snap.  Installed RH,
installed the dkms rpm, installed the dkms enabled megaraid driver and
rebooted.  Literally, that's all it took.

Re: [PERFORM] Quad processor options

From

Bjoern Metzdorf

Date:

11 May 2004, 18:15:49

scott.marlowe wrote:
>>Next drives I'll buy will certainly be 15k scsi drives.
>
> Better to buy more 10k drives than fewer 15k drives.  Other than slightly
> faster select times, the 15ks aren't really any faster.

Good to know. I'll remember that.

>>In peak times we can get up to 700-800 connections at the same time.
>>There are quite some updates involved, without having exact numbers I'll
>>think that we have about 70% selects and 30% updates/inserts.
>
> Wow, a lot of writes then.

Yes, it certainly could also be only 15-20% updates/inserts, but this is
also not negligible.

> Sure, adaptec makes one, so does lsi megaraid.  Dell resells both of
> these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I
> believe.  We run the lsi megaraid with 64 megs battery backed cache.

The LSI sounds good.

> Intel also makes one, but I've heard nothing about it.

It could well be the ICP Vortex one, ICP was bought by Intel some time ago..

> I haven't directly tested anything but the adaptec and the lsi megaraid.
> Here at work we've had massive issues trying to get the adaptec cards
> configured and installed on, while the megaraid was a snap.  Installed RH,
> installed the dkms rpm, installed the dkms enabled megaraid driver and
> rebooted.  Literally, that's all it took.

I didn't hear anything about dkms for debian, so I will be hand-patching
as usual :)

Regards,
Bjoern

Re: [PERFORM] Quad processor options

From

"scott.marlowe"

Date:

11 May 2004, 18:30:34

On Tue, 11 May 2004, Bjoern Metzdorf wrote:

> scott.marlowe wrote:
> > Sure, adaptec makes one, so does lsi megaraid.  Dell resells both of
> > these, the PERC3DI and the PERC3DC are adaptec, then lsi in that order, I
> > believe.  We run the lsi megaraid with 64 megs battery backed cache.
>
> The LSI sounds good.
>
> > Intel also makes one, but I've heard nothing about it.
>
> It could well be the ICP Vortex one, ICP was bought by Intel some time ago..

Also, there are bigger, faster external RAID boxes as well, that make the
internal cards seem puny.  They're nice because all you need in your main
box is a good U320 controller to plug into the external RAID array.

That URL I mentioned earlier that had prices has some of the external
boxes listed.  No price, not for sale on the web, get out the checkbook
and write a blank check is my guess.  I.e. they're not cheap.

The other nice thing about the LSI cards is that you can install >1 and
the act like one big RAID array.  i.e. install two cards with a 20 drive
RAID0 then make a RAID1 across them, and if one or the other cards itself
fails, you've still got 100% of your data sitting there.  Nice to know you
can survive the complete failure of one half of your chain.

> > I haven't directly tested anything but the adaptec and the lsi megaraid.
> > Here at work we've had massive issues trying to get the adaptec cards
> > configured and installed on, while the megaraid was a snap.  Installed RH,
> > installed the dkms rpm, installed the dkms enabled megaraid driver and
> > rebooted.  Literally, that's all it took.
>
> I didn't hear anything about dkms for debian, so I will be hand-patching
> as usual :)

Yeah, it seems to be an RPM kinda thing.  But, I'm thinking the 2.0
drivers got included in the latest 2.6 kernels, so no biggie. I was
looking around in google, and it definitely appears the 2.x and 1.x
megaraid drivers were merged into "unified" driver in 2.6 kernel.

Re: [PERFORM] Quad processor options

From

Paul Tuckfield

Date:

11 May 2004, 19:53:41

I'm confused why you say the system is 70% busy: the vmstat output
shows 70% *idle*.

The vmstat you sent shows good things and ambiguous things:
- si and so are zero, so your not paging/swapping.  Thats always step
1.  you're fine.
- bi and bo (physical IO) shows pretty high numbers for how many disks
you have.
   (assuming random IO) so please send an "iostat 10" sampling during
peak.
- note that cpu is only 30% busy.  that should mean that adding cpus
will *not* help.
- the "cache" column shows that linux is using 2.3G for cache. (way too
much)
   you generally want to give memory to postgres to keep it "close" to
the user,
   not leave it unused to be claimed by linux cache (need to leave
*some* for linux tho)

My recommendations:
- I'll bet you have a low value for shared buffers, like 10000.  On
your 3G system
   you should ramp up the value to at least 1G (125000 8k buffers)
unless something
   else runs on the system.   It's best to not do things too
drastically, so if Im right and
   you sit at 10000 now, try going to 30000 then 60000 then 125000 or
above.

- if the above is off base, then I wonder why we see high runque
numbers in spite
   of over 60% idle cpu.   Maybe some serialization happening somewhere.
  Also depending
   on how you've laid out your 4 disk drives, you may see all IOs going
to one drive. the 7M/sec
   is on the high side, if that's the case.  iostat numbers will reveal
if it's skewed, and if it's random,
  tho linux iostat doesn't seem to report response times (sigh)
Response times are the golden
  metric when diagnosing IO thruput in OLTP / stripe situation.




On May 11, 2004, at 1:41 PM, Bjoern Metzdorf wrote:

> scott.marlowe wrote:
>
>> Well, from what I've read elsewhere on the internet, it would seem
>> the Opterons scale better to 4 CPUs than the basic Xeons do.  Of
>> course, the exception to this is SGI's altix, which uses their own
>> chipset and runs the itanium with very good memory bandwidth.
>
> This is basically what I read too. But I cannot spent money on a quad
> opteron just for testing purposes :)
>
>> But, do you really need more CPU horsepower?
>> Are you I/O or CPU or memory or memory bandwidth bound?  If you're
>> sitting at 99% idle, and iostat says your drives are only running at
>> some small percentage of what you know they could, you might be
>> memory or memory bandwidth limited.  Adding two more CPUs will not
>> help with that situation.
>
> Right now we have a dual xeon 2.4, 3 GB Ram, Mylex extremeraid
> controller, running 2 Compaq BD018122C0, 1 Seagate ST318203LC and 1
> Quantum ATLAS_V_18_SCA.
>
> iostat show between 20 and 60 % user avg-cpu. And this is not even
> peak time.
>
> I attached a "vmstat 10 120" output for perhaps 60-70% peak load.
>
>> If your I/O is saturated, then the answer may well be a better RAID
>> array, with many more drives plugged into it.  Do you have any spare
>> drives you can toss on the machine to see if that helps?  Sometimes
>> going from 4 drives in a RAID 1+0 to 6 or 8 or more can give a big
>> boost in performance.
>
> Next drives I'll buy will certainly be 15k scsi drives.
>
>> In short, don't expect 4 CPUs to solve the problem if the problem
>> isn't really the CPUs being maxed out.
>> Also, what type of load are you running?  Mostly read, mostly
>> written, few connections handling lots of data, lots of connections
>> each handling a little data, lots of transactions, etc...
>
> In peak times we can get up to 700-800 connections at the same time.
> There are quite some updates involved, without having exact numbers
> I'll think that we have about 70% selects and 30% updates/inserts.
>
>> If you are doing lots of writing, make SURE you have a controller
>> that supports battery backed cache and is configured to write-back,
>> not write-through.
>
> Could you recommend a certain controller type? The only battery backed
> one that I found on the net is the newest model from icp-vortex.com.
>
> Regards,
> Bjoern
> ~# vmstat 10 120
>    procs                      memory    swap          io     system
>      cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs
> us  sy  id
>  1  1  0  24180  10584  32468 2332208   0   1     0     2    1     2
> 2   0   0
>  0  2  0  24564  10480  27812 2313528   8   0  7506   574 1199  8674
> 30   7  63
>  2  1  0  24692  10060  23636 2259176   0  18  8099   298 2074  6328
> 25   7  68
>  2  0  0  24584  18576  21056 2299804   3   6 13208   305 1598  8700
> 23   6  71
>  1 21  1  24504  16588  20912 2309468   4   0  1442  1107  754  6874
> 42  13  45
>  6  1  0  24632  13148  19992 2319400   0   0  2627   499 1184  9633
> 37   6  58
>  5  1  0  24488  10912  19292 2330080   5   0  3404   150 1466 10206
> 32   6  61
>  4  1  0  24488  12180  18824 2342280   3   0  2934    40 1052  3866
> 19   3  78
>  0  0  0  24420  14776  19412 2347232   6   0   403   216 1123  4702
> 22   3  74
>  0  0  0  24548  14408  17380 2321780   4   0   522   715  965  6336
> 25   5  71
>  4  0  0  24676  12504  17756 2322988   0   0   564   830  883  7066
> 31   6  63
>  0  3  0  24676  14060  18232 2325224   0   0   483   388 1097  3401
> 21   3  76
>  0  2  1  24676  13044  18700 2322948   0   0   701   195 1078  5187
> 23   3  74
>  2  0  0  24676  21576  18752 2328168   0   0   467   177 1552  3574
> 18   3  78
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
> majordomo@postgresql.org)

Re: [PERFORM] Quad processor options

From

Dennis Bjorklund

Date:

12 May 2004, 01:04:08

On Tue, 11 May 2004, Bjoern Metzdorf wrote:

> I am curious if there are any real life production quad processor setups
> running postgresql out there. Since postgresql lacks a proper
> replication/cluster solution, we have to buy a bigger machine.

Du you run the latest version of PG? I've read the thread bug have not
seen any information about what pg version. All I've seen was a reference
to debian which might just as well mean that you run pg 7.2 (probably not
but I have to ask).

Some classes of queries run much faster in pg 7.4 then in older versions
so if you are lucky that can help.

--
/Dennis Björklund

Re: [PERFORM] Quad processor options

From

Grega Bremec

Date:

12 May 2004, 01:59:08

...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard:
>
> If you get the LSI megaraid, make sure you're running the latest megaraid
> 2 driver, not the older, slower 1.18 series.  If you are running linux,
> look for the dkms packaged version.  dkms, (Dynamic Kernel Module System)
> automagically compiles and installs source rpms for drivers when you
> install them, and configures the machine to use them to boot up.  Most
> drivers seem to be slowly headed that way in the linux universe, and I
> really like the simplicity and power of dkms.
>

Hi,

Given the fact LSI MegaRAID seems to be a popular solution around here, and
many of you folx use Linux as well, I thought sharing this piece of info
might be of use.

Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_
_all_, as it will silently corrupt your data in the event of a disk failure.

Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure
about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should.

Somehow the replaced disk drives are not _really_ added to the array, which
continues to work in degraded mode for a while and (even worse than that)
then starts to think the replaced disk is in order without actually having
resynced it, thus beginning to issue writes to non-existant areas of it.

The 2.6 megaraid driver indeed seems to be a merged version of the above
driver and the old one, giving both improved performance and correct
functionality in the event of a hotswap taking place.

Hope this helped,
--
    Grega Bremec
    Senior Administrator
    Noviforum Ltd., Software & Media
    http://www.noviforum.si/

Attachment

msg-22763-7139.dat

Re: [PERFORM] Quad processor options

From

Manfred Koizar

Date:

12 May 2004, 07:15:15

On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com>
wrote:
>- the "cache" column shows that linux is using 2.3G for cache. (way too
>much)

There is no such thing as "way too much cache".

>   you generally want to give memory to postgres to keep it "close" to
>the user,

Yes, but only a moderate amount of memory.

>   not leave it unused to be claimed by linux cache

Cache is not unused memory.

>- I'll bet you have a low value for shared buffers, like 10000.  On
>your 3G system
>   you should ramp up the value to at least 1G (125000 8k buffers)

In most cases this is almost the worst thing you can do.  The only thing
even worse would be setting it to 1.5 G.

Postgres is just happy with a moderate shared_buffers setting.  We
usually recommend something like 10000.  You could try 20000, but don't
increase it beyond that without strong evidence that it helps in your
particular case.

This has been discussed several times here, on -hackers and on -general.
Search the archives for more information.

Servus
 Manfred

Re: [PERFORM] Quad processor options

From

Halford Dace

Date:

12 May 2004, 07:23:14

On 12 May 2004, at 12:17 PM, Manfred Koizar wrote:

> On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com>
> wrote:
>
>> - I'll bet you have a low value for shared buffers, like 10000.  On
>> your 3G system
>>   you should ramp up the value to at least 1G (125000 8k buffers)
>
> In most cases this is almost the worst thing you can do.  The only
> thing
> even worse would be setting it to 1.5 G.
>
> Postgres is just happy with a moderate shared_buffers setting.  We
> usually recommend something like 10000.  You could try 20000, but don't
> increase it beyond that without strong evidence that it helps in your
> particular case.
>
> This has been discussed several times here, on -hackers and on
> -general.
> Search the archives for more information.

We have definitely found this to be true here.  We have some fairly
complex queries running on a rather underpowered box (beautiful but
steam-driven old Silicon Graphics Challenge DM).  We ended up using a
very slight increase to shared buffers, but gaining ENORMOUSLY through
proper optimisation of queries, appropriate indices and the use of
optimizer-bludgeons like "SET ENABLE_SEQSCAN = OFF"

Hal

Re: [PERFORM] Quad processor options

From

Daniel Kalchev

Date:

12 May 2004, 08:08:53

It appears that your CPU is 'slow' while your disk subsystem is 'fast'.

I had once such situation with 15 kRPM drives and ~500MHz Pentium III. On that
system, the best solution was to either increase effective_cache_size or
decrease random_page_cost (the latter obviously has to do with the fast disk,
the former with the lots of RAM).

In any case, proper optimization of queries always helps. :-)

Daniel

>>>Halford Dace said:
 >
 > On 12 May 2004, at 12:17 PM, Manfred Koizar wrote:
 >
 > > On Tue, 11 May 2004 15:46:25 -0700, Paul Tuckfield <paul@tuckfield.com>
 > > wrote:
 > >
 > >> - I'll bet you have a low value for shared buffers, like 10000.  On
 > >> your 3G system
 > >>   you should ramp up the value to at least 1G (125000 8k buffers)
 > >
 > > In most cases this is almost the worst thing you can do.  The only
 > > thing
 > > even worse would be setting it to 1.5 G.
 > >
 > > Postgres is just happy with a moderate shared_buffers setting.  We
 > > usually recommend something like 10000.  You could try 20000, but don't
 > > increase it beyond that without strong evidence that it helps in your
 > > particular case.
 > >
 > > This has been discussed several times here, on -hackers and on
 > > -general.
 > > Search the archives for more information.
 >
 > We have definitely found this to be true here.  We have some fairly
 > complex queries running on a rather underpowered box (beautiful but
 > steam-driven old Silicon Graphics Challenge DM).  We ended up using a
 > very slight increase to shared buffers, but gaining ENORMOUSLY through
 > proper optimisation of queries, appropriate indices and the use of
 > optimizer-bludgeons like "SET ENABLE_SEQSCAN = OFF"
 >
 > Hal
 >
 >
 > ---------------------------(end of broadcast)---------------------------
 > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: [PERFORM] Quad processor options

From

"scott.marlowe"

Date:

12 May 2004, 10:46:25

On Wed, 12 May 2004, Grega Bremec wrote:

> ...and on Tue, May 11, 2004 at 03:02:24PM -0600, scott.marlowe used the keyboard:
> >
> > If you get the LSI megaraid, make sure you're running the latest megaraid
> > 2 driver, not the older, slower 1.18 series.  If you are running linux,
> > look for the dkms packaged version.  dkms, (Dynamic Kernel Module System)
> > automagically compiles and installs source rpms for drivers when you
> > install them, and configures the machine to use them to boot up.  Most
> > drivers seem to be slowly headed that way in the linux universe, and I
> > really like the simplicity and power of dkms.
> >
>
> Hi,
>
> Given the fact LSI MegaRAID seems to be a popular solution around here, and
> many of you folx use Linux as well, I thought sharing this piece of info
> might be of use.
>
> Running v2 megaraid driver on a 2.4 kernel is actually not a good idea _at_
> _all_, as it will silently corrupt your data in the event of a disk failure.
>
> Sorry to have to say so, but we tested it (on kernels up to 2.4.25, not sure
> about 2.4.26 yet) and it comes out it doesn't do hotswap the way it should.
>
> Somehow the replaced disk drives are not _really_ added to the array, which
> continues to work in degraded mode for a while and (even worse than that)
> then starts to think the replaced disk is in order without actually having
> resynced it, thus beginning to issue writes to non-existant areas of it.
>
> The 2.6 megaraid driver indeed seems to be a merged version of the above
> driver and the old one, giving both improved performance and correct
> functionality in the event of a hotswap taking place.

This doesn't make any sense to me, since the hot swapping is handled by
the card autonomously.  I also tested it with a hot spare and pulled one
drive and it worked fine during our acceptance testing.

However, I've got a hot spare machine I can test on, so I'll try it again
and see if I can make it fail.

when testing it, was the problem present in certain RAID configurations or
only one type or what?  I'm curious to try and reproduce this problem,
since I've never heard of it before.

Also, what firmware version were those megaraid cards, ours is fairly
new, as we got it at the beginning of this year, and I'm wondering if it
is a firmware issue.

Quad processor options - summary

From

Bjoern Metzdorf

Date:

12 May 2004, 19:41:24

Hi,

at first, many thanks for your valuable replies. On my quest for the
ultimate hardware platform I'll try to summarize the things I learned.

-------------------------------------------------------------

This is our current setup:

Hardware:
Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled
3 GB Ram (2 x 1 GB + 2 x 512 MB)
Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI
10K RPM, no other drives involved (system, pgdata and wal are all on the
same volume).

Software:
Debian 3.0 Woody
Postgresql 7.4.1 (selfcompiled, no special optimizations)
Kernel 2.4.22 + fixes

Database specs:
Size of a gzipped -9 full dump is roughly 1 gb
70-80% selects, 20-30% updates (roughly estimated)
up to 700-800 connections during peak times
kernel.shmall = 805306368
kernel.shmmax = 805306368
max_connections = 900
shared_buffers = 20000
sort_mem = 16384
checkpoint_segments = 6
statistics collector is enabled (for pg_autovacuum)

Loads:
We are experiencing average CPU loads of up to 70% during peak hours. As
Paul Tuckfield correctly pointed out, my vmstat output didn't support
this. This output was not taken during peak times, it was freshly
grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak
time load (30% cpu usage). iostat does not give results about disk
usage, I don't know exactly why, the blk_read/wrtn columns are just
empty. (Perhaps due to the Mylex rd driver, I don't know).

-------------------------------------------------------------

Suggestions and solutions given:

Anjan Dave reported, that he is pretty confident with his Quad Xeon
setups, which will cost less than $20K at Dell with a reasonable
hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal
drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller)

Scott Marlowe pointed out, that one should consider more than 4 drives
(6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because
that can boost performance quite a lot. One should also be using a
battery backed raid controller. Scott has good experiences with the LSI
Megaraid single channel controller, which is reasonably priced at ~
$500. He also stated, that 20-30% writes on a database is quite a lot.

Next Rob Sell told us about his research on more-than-2-way Intel based
systems. The memory bandwidth on the xeon platform is always shared
between the cpus. While a 2way xeon may perform quite well, a 4way
system will be suffering due to the reduced memory bandwith available
for each processor.

J. Andrew Roberts supports this. He said that 4way opteron systems scale
much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on
the opteron platform. He also says that a fully equipped dual channel
LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete
4way opteron system will be at $10K-$12K.

Paul Tuckfield then gave the suggestion to bump up my shared_buffers.
With a 3GB memory system, I could happily be using 1GB for shared
buffers (125000). This was questioned by Andrew McMillian, Manfred
Kolzar and Halford Dace, who say that common tuning advices limit
reasonable settings to 10000-20000 shared buffers, because the OS is
better at caching than the database.

-------------------------------------------------------------

Conclusion:

After having read some comparisons between n-way xeon and opteron systems:

http://www.anandtech.com/IT/showdoc.html?i=1982
http://www.aceshardware.com/read.jsp?id=60000275

I was given the impression, that an opteron system is the way to go.

This is what I am considering the ultimate platform for postgresql:

Hardware:
Tyan Thunder K8QS board
2-4 x Opteron 848 in NUMA mode
4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
LSI Megaraid 320-2 with 256 MB cache ram and battery backup
6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both
channels (3 + 4) for pgdata including indexes and wal.
2 x 80 GB S-ATA IDE for system, running linux software raid 1 or
available onboard hardware raid (perhaps also 2 x 36 GB SCSI)

Software:
Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise
64bit distributions.
Kernel 2.6
Postgres 7.4.2 in 64bit mode
shared_buffers = 20000
a bumbed up effective_cache_size

Now the only problem left (besides my budget) is the availability of
such a system.

I have found some vendors which ship similar systems, so I will have to
talk to them about my dream configuration. I will not self build this
system, there are too many obstacles.

I expect this system to come out on about 12-15K Euro. Very optimistic,
I know :)

These are the vendors I found up to now:

http://www.appro.com/product/server_4144h.asp
http://www.appro.com/product/server_4145h.asp
http://www.pyramid.de/d/builttosuit/server/4opteron.shtml
http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2
http://www.quadopteron.com/

They all seem to sell more or less the same system. I found also some
other vendors which built systems on celestica or amd boards, but they
are way too expensive.

Buying such a machine is worth some good thoughts. If budget is a limit
and such a machine might not be maxed out during the next few months, it
would make more sense to go for a slightly slower system and an upgrade
when more power is needed.

Thanks again for all your replies. I hope to have given a somehow clear
summary.

Regards,
Bjoern

Off Topic - Re: [PERFORM] Quad processor options - summary

From

Greg Spiegelberg

Date:

13 May 2004, 10:15:54

This is somthing I wish more of us did on the lists.  The list archives
have solutions and workarounds for every variety of problem but very few
summary emails exist.  A good example of this practice is in the
sun-managers mailling list.  The original poster sends a "SUMMARY" reply
to the list with the original problem included and all solutions found.
Also makes searching the list archives easier.

Simply a suggestion for us all including myself.

Greg


Bjoern Metzdorf wrote:
> Hi,
>
> at first, many thanks for your valuable replies. On my quest for the
> ultimate hardware platform I'll try to summarize the things I learned.
>
> -------------------------------------------------------------
>
> This is our current setup:
>
> Hardware:
> Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled
> 3 GB Ram (2 x 1 GB + 2 x 512 MB)
> Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI
> 10K RPM, no other drives involved (system, pgdata and wal are all on the
> same volume).
>
> Software:
> Debian 3.0 Woody
> Postgresql 7.4.1 (selfcompiled, no special optimizations)
> Kernel 2.4.22 + fixes
>
> Database specs:
> Size of a gzipped -9 full dump is roughly 1 gb
> 70-80% selects, 20-30% updates (roughly estimated)
> up to 700-800 connections during peak times
> kernel.shmall = 805306368
> kernel.shmmax = 805306368
> max_connections = 900
> shared_buffers = 20000
> sort_mem = 16384
> checkpoint_segments = 6
> statistics collector is enabled (for pg_autovacuum)
>
> Loads:
> We are experiencing average CPU loads of up to 70% during peak hours. As
> Paul Tuckfield correctly pointed out, my vmstat output didn't support
> this. This output was not taken during peak times, it was freshly
> grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak
> time load (30% cpu usage). iostat does not give results about disk
> usage, I don't know exactly why, the blk_read/wrtn columns are just
> empty. (Perhaps due to the Mylex rd driver, I don't know).
>
> -------------------------------------------------------------
>
> Suggestions and solutions given:
>
> Anjan Dave reported, that he is pretty confident with his Quad Xeon
> setups, which will cost less than $20K at Dell with a reasonable
> hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal
> drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller)
>
> Scott Marlowe pointed out, that one should consider more than 4 drives
> (6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because
> that can boost performance quite a lot. One should also be using a
> battery backed raid controller. Scott has good experiences with the LSI
> Megaraid single channel controller, which is reasonably priced at ~
> $500. He also stated, that 20-30% writes on a database is quite a lot.
>
> Next Rob Sell told us about his research on more-than-2-way Intel based
> systems. The memory bandwidth on the xeon platform is always shared
> between the cpus. While a 2way xeon may perform quite well, a 4way
> system will be suffering due to the reduced memory bandwith available
> for each processor.
>
> J. Andrew Roberts supports this. He said that 4way opteron systems scale
> much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on
> the opteron platform. He also says that a fully equipped dual channel
> LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete
> 4way opteron system will be at $10K-$12K.
>
> Paul Tuckfield then gave the suggestion to bump up my shared_buffers.
> With a 3GB memory system, I could happily be using 1GB for shared
> buffers (125000). This was questioned by Andrew McMillian, Manfred
> Kolzar and Halford Dace, who say that common tuning advices limit
> reasonable settings to 10000-20000 shared buffers, because the OS is
> better at caching than the database.
>
> -------------------------------------------------------------
>
> Conclusion:
>
> After having read some comparisons between n-way xeon and opteron systems:
>
> http://www.anandtech.com/IT/showdoc.html?i=1982
> http://www.aceshardware.com/read.jsp?id=60000275
>
> I was given the impression, that an opteron system is the way to go.
>
> This is what I am considering the ultimate platform for postgresql:
>
> Hardware:
> Tyan Thunder K8QS board
> 2-4 x Opteron 848 in NUMA mode
> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
> LSI Megaraid 320-2 with 256 MB cache ram and battery backup
> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both
> channels (3 + 4) for pgdata including indexes and wal.
> 2 x 80 GB S-ATA IDE for system, running linux software raid 1 or
> available onboard hardware raid (perhaps also 2 x 36 GB SCSI)
>
> Software:
> Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise
> 64bit distributions.
> Kernel 2.6
> Postgres 7.4.2 in 64bit mode
> shared_buffers = 20000
> a bumbed up effective_cache_size
>
> Now the only problem left (besides my budget) is the availability of
> such a system.
>
> I have found some vendors which ship similar systems, so I will have to
> talk to them about my dream configuration. I will not self build this
> system, there are too many obstacles.
>
> I expect this system to come out on about 12-15K Euro. Very optimistic,
> I know :)
>
> These are the vendors I found up to now:
>
> http://www.appro.com/product/server_4144h.asp
> http://www.appro.com/product/server_4145h.asp
> http://www.pyramid.de/d/builttosuit/server/4opteron.shtml
> http://www.rainbow-it.co.uk/productslist.aspx?CategoryID=4&selection=2
> http://www.quadopteron.com/
>
> They all seem to sell more or less the same system. I found also some
> other vendors which built systems on celestica or amd boards, but they
> are way too expensive.
>
> Buying such a machine is worth some good thoughts. If budget is a limit
> and such a machine might not be maxed out during the next few months, it
> would make more sense to go for a slightly slower system and an upgrade
> when more power is needed.
>
> Thanks again for all your replies. I hope to have given a somehow clear
> summary.
>
> Regards,
> Bjoern
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html


--
Greg Spiegelberg
  Product Development Manager
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@cranel.com
Technology. Integrity. Focus. V-Solve!

Re: [PERFORM] Quad processor options - summary

From

James Thornton

Date:

13 May 2004, 18:01:09

Bjoern Metzdorf wrote:

> Hi,
>
> at first, many thanks for your valuable replies. On my quest for the
> ultimate hardware platform I'll try to summarize the things I learned.
>
> -------------------------------------------------------------

> This is what I am considering the ultimate platform for postgresql:
>
> Hardware:
> Tyan Thunder K8QS board
> 2-4 x Opteron 848 in NUMA mode
> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
> LSI Megaraid 320-2 with 256 MB cache ram and battery backup
> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both
> channels (3 + 4) for pgdata including indexes and wal.

You might also consider configuring the Postgres data drives for a RAID
10 SAME configuration as described in the Oracle paper "Optimal Storage
Configuration Made Easy"
(http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has
anyone delved into this before?

--

  James Thornton
______________________________________________________
Internet Business Consultant, http://jamesthornton.com

Re: [PERFORM] Quad processor options - summary

From

Bjoern Metzdorf

Date:

13 May 2004, 18:53:58

James Thornton wrote:

>> This is what I am considering the ultimate platform for postgresql:
>>
>> Hardware:
>> Tyan Thunder K8QS board
>> 2-4 x Opteron 848 in NUMA mode
>> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
>> LSI Megaraid 320-2 with 256 MB cache ram and battery backup
>> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both
>> channels (3 + 4) for pgdata including indexes and wal.
>
> You might also consider configuring the Postgres data drives for a RAID
> 10 SAME configuration as described in the Oracle paper "Optimal Storage
> Configuration Made Easy"
> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has
> anyone delved into this before?

Ok, if I understand it correctly the papers recommends the following:

1. Get many drives and stripe them into a RAID0 with a stripe width of
1MB. I am not quite sure if this stripe width is to be controlled at the
application level (does postgres support this?) or if e.g. the "chunk
size" of the linux software driver is meant. Normally a chunk size of
4KB is recommended, so 1MB sounds fairly large.

2. Mirror your RAID0 and get a RAID10.

3. Use primarily the fast, outer regions of your disks. In practice this
might be achieved by putting only half of the disk (the outer half) into
your stripe set. E.g. put only the outer 18GB of your 36GB disks into
the stripe set. Btw, is it common for all drives that the outer region
is on the higher block numbers? Or is it sometimes on the lower block
numbers?

4. Subset data by partition, not disk. If you have 8 disks, then don't
take a 4 disk RAID10 for data and the other one for log or indexes, but
make a global 8 drive RAID10 and have it partitioned the way that data
and log + indexes are located on all drives.

They say, which is very interesting, as it is really contrary to what is
normally recommended, that it is good or better to have one big stripe
set over all disks available, than to put log + indexes on a separated
stripe set. Having one big stripe set means that the speed of this big
stripe set is available to all data. In practice this setup is as fast
as or even faster than the "old" approach.

----------------------------------------------------------------

Bottom line for a normal, less than 10 disk setup:

Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to
the other 4 disks for a RAID10. Make sure to create the RAID on the
outer half of the disks (setup may depend on the disk model and raid
controller used), leaving the inner half empty.
Use a logical volume manager (LVM), which always helps when adding disk
space, and create 2 partitions on your RAID10. One for data and one for
log + indexes. This should look like this:

----- ----- ----- -----
| 1 | | 1 | | 1 | | 1 |
----- ----- ----- -----  <- outer, faster half of the disk
| 2 | | 2 | | 2 | | 2 |     part of the RAID10
----- ----- ----- -----
|   | |   | |   | |   |
|   | |   | |   | |   |  <- inner, slower half of the disk
|   | |   | |   | |   |     not used at all
----- ----- ----- -----

Partition 1 for data, partition 2 for log + indexes. All mirrored to the
other 4 disks not shown.

If you take 36GB disks, this should end up like this:

RAID10 has size of 36 / 2 * 4 = 72GB
Partition 1 is 36 GB
Partition 2 is 36 GB

If 36GB is not enough for your pgdata set, you might consider moving to
72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks,
which both will end up in a size of 72GB for your data (but the 16 drive
version will be faster).

Any comments?

Regards,
Bjoern

Re: [PERFORM] Quad processor options - summary

From

James Thornton

Date:

13 May 2004, 19:49:41

Bjoern Metzdorf wrote:

>> You might also consider configuring the Postgres data drives for a
>> RAID 10 SAME configuration as described in the Oracle paper "Optimal
>> Storage Configuration Made Easy"
>> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has
>> anyone delved into this before?
>
> Ok, if I understand it correctly the papers recommends the following:
>
> 1. Get many drives and stripe them into a RAID0 with a stripe width of
> 1MB. I am not quite sure if this stripe width is to be controlled at the
> application level (does postgres support this?) or if e.g. the "chunk
> size" of the linux software driver is meant. Normally a chunk size of
> 4KB is recommended, so 1MB sounds fairly large.
>
> 2. Mirror your RAID0 and get a RAID10.

Don't use RAID 0+1 -- use RAID 1+0 instead. Performance is the same, but
if a disk fails in a RAID 0+1 configuration, you are left with a RAID 0
array. In a RAID 1+0 configuration, multiple disks can fail.

A few weeks ago I called LSI asking about the Dell PERC4-Di card, which
is actually an LSI Megaraid 320-2. Dell's documentation said that its
support for RAID 10 was in the form of RAID-1 concatenated, but LSI said
that this is incorrect and that it supports RAID 10 proper.

> 3. Use primarily the fast, outer regions of your disks. In practice this
> might be achieved by putting only half of the disk (the outer half) into
> your stripe set. E.g. put only the outer 18GB of your 36GB disks into
> the stripe set.

You can still use the inner-half of the drives, just relegate it to
less-frequently accessed data.

You also need to consider the filesystem.

SGI and IBM did a detailed study on Linux filesystem performance, which
included XFS, ext2, ext3 (various modes), ReiserFS, and JFS, and the
results are presented in a paper entitled "Filesystem Performance and
Scalability in Linux 2.4.17"
(http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf).

The scaling and load are key factors when selecting a filesystem. Since
Postgres data is stored in large files, ReiserFS is not the ideal choice
since it has been optimized for small files. XFS is probably the best
choice for a database server running on a quad processor box.

However, Dr. Bert Scalzo of Quest argues that general file system
benchmarks aren't ideal for benchmarking a filesystem for a database
server. In a paper entitled "Tuning an Oracle8i Database running Linux"
(http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux02.html),
  he says, "The trouble with these tests-for example, Bonnie, Bonnie++,
Dbench, Iobench, Iozone, Mongo, and Postmark-is that they are basic file
system throughput tests, so their results generally do not pertain in
any meaningful fashion to the way relational database systems access
data files." Instead he suggests using these two well-known and widely
accepted database benchmarks:

* AS3AP: a scalable, portable ANSI SQL relational database benchmark
that provides a comprehensive set of tests of database-processing power;
has built-in scalability and portability for testing a broad range of
systems; minimizes human effort in implementing and running benchmark
tests; and provides a uniform, metric, straightforward interpretation of
the results.

* TPC-C: an online transaction processing (OLTP) benchmark that involves
a mix of five concurrent transactions of various types and either
executes completely online or queries for deferred execution. The
database comprises nine types of tables, having a wide range of record
and population sizes. This benchmark measures the number of transactions
per second.

In the paper, Scalzo benchmarks ext2, ext3, ReiserFS, JFS, but not XFS.
Surprisingly ext3 won, but Scalzo didn't address scaling/load. The
results are surprising because most think ext3 is just ext2 with
journaling, thus having extra overhead from journaling.

If you read papers on ext3, you'll discover that has some optimizations
that reduce disk head movement. For example, Daniel Robbins' "Advanced
filesystem implementor's guide, Part 7: Introducing ext3"
(http://www-106.ibm.com/developerworks/library/l-fs7/) says:

"The approach that the [ext3 Journaling Block Device layer API] uses is
called physical journaling, which means that the JBD uses complete
physical blocks as the underlying currency for implementing the
journal...the use of full blocks allows ext3 to perform some additional
optimizations, such as "squishing" multiple pending IO operations within
a single block into the same in-memory data structure. This, in turn,
allows ext3 to write these multiple changes to disk in a single write
operation, rather than many. In addition, because the literal block data
is stored in memory, little or no massaging of the in-memory data is
required before writing it to disk, greatly reducing CPU overhead."

I suspect that less writes may be the key factor in ext3 winning
Scalzo's DB benchmark. But as I said, Scalzo didn't benchmark XFS and he
didn't address scaling.

XFS has a feature called delayed allocation that reduces IO
(http://www-106.ibm.com/developerworks/library/l-fs9/), and it scales
much better than ext3 so while I haven't tested it, I suspect that it
may be the ideal choice for large Linux DB servers:

"XFS handles allocation by breaking it into a two-step process. First,
when XFS receives new data to be written, it records the pending
transaction in RAM and simply reserves an appropriate amount of space on
the underlying filesystem. However, while XFS reserves space for the new
data, it doesn't decide what filesystem blocks will be used to store the
data, at least not yet. XFS procrastinates, delaying this decision to
the last possible moment, right before this data is actually written to
disk.

By delaying allocation, XFS gains many opportunities to optimize write
performance. When it comes time to write the data to disk, XFS can now
allocate free space intelligently, in a way that optimizes filesystem
performance. In particular, if a bunch of new data is being appended to
a single file, XFS can allocate a single, contiguous region on disk to
store this data. If XFS hadn't delayed its allocation decision, it may
have unknowingly written the data into multiple non-contiguous chunks,
reducing write performance significantly. But, because XFS delayed its
allocation decision, it was able to write the data in one fell swoop,
improving write performance as well as reducing overall filesystem
fragmentation.

Delayed allocation also has another performance benefit. In situations
where many short-lived temporary files are created, XFS may never need
to write these files to disk at all. Since no blocks are ever allocated,
there's no need to deallocate any blocks, and the underlying filesystem
metadata doesn't even get touched."

For further study, I have compiled a list of Linux filesystem resources
at: http://jamesthornton.com/hotlist/linux-filesystems/.

--

  James Thornton
______________________________________________________
Internet Business Consultant, http://jamesthornton.com

Re: [PERFORM] Quad processor options - summary

From

Hadley Willan

Date:

13 May 2004, 20:00:29

I see you've got an LSI Megaraid card with oodles of Cache. However, don't underestimate the power of the software RAID implementation that Red Hat Linux comes with.

We're using RHE 2.1 and I can recommend Red Hat Enterprise Linux if you want an excellent implementation of software RAID. In fact we have found the software implementation more flexible than that of some expensive hardware controllers. In addition there are also tools to enhance the base implementation even further, making setup and maintenance even easier. An advantage of the software implementation is being able to RAID by partition, not necessarily entire disks.

To answer question 1, if you use software raid the chunk size is part of the /etc/raidtab file that is used on initial container creation. 4KB is the standard and a LARGE chunk size of 1MB may affect performance if you're not writing down to blocks in that size continuously. If you make it to big and you're constantly needing to write out smaller chunks of information, then you will find the disk "always" working and would be an inefficient use of the blocks. There is some free info around about calculating the ideal chunk size. Looking for "Calculating chunk size for RAID" through google.

In the software implementation, after setup the raidtab is uncessary as the superblocks of the disks now contain their relevant information.
As for the application knowing any of this, no, the application layers are entirely unaware of the lower implementation. They simply function as normal by writing to directories that are now mounted a different way. The kernel takes care of the underlying RAID writes and syncs.
3 is easy to implement with software raid under linux. You simply partition the drive like normal, mark the partitions you want to "raid" as 'fd' 'linux raid autodetect', then configure the /etc/raidtab and do a mkraid /dev/mdxx where mdxx is the matching partition for the raid setup. You can map them anyway you want, but it can get confusing if you're mapping /dev/sda6 > /dev/sdb8 and calling it /dev/md7.
We've found it easier to make them all line up, /dev/sda6 > /dev/sdb6 > /dev/md6

FYI, if you want better performance, use 15K SCSI disks, and make sure you've got more than 8MB of cache per disk. Also, you're correct in splitting the drives across the channel, that's a trap for young players ;-)

Bjoern is right to recommend an LVM, it will allow you to dynamically allocate new size to the RAID volume when you add more disks. However I've no experience in implementation with an LVM under the software RAID for Linux, though I believe it can be done.

The software RAID implementation allows you to stop and start software RAID devices as desired, add new hot spare disks to the containers as needed and rebuild containers on the fly. You can even change kernel options to speed up or slow down the sync speed when rebuilding the container.

Anyway, have fun, cause striping is the hot rod of the RAID implementations ;-)

Regards.
Hadley

On Fri, 2004-05-14 at 09:53, Bjoern Metzdorf wrote:

James Thornton wrote:

>> This is what I am considering the ultimate platform for postgresql:
>>
>> Hardware:
>> Tyan Thunder K8QS board
>> 2-4 x Opteron 848 in NUMA mode
>> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
>> LSI Megaraid 320-2 with 256 MB cache ram and battery backup
>> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both 
>> channels (3 + 4) for pgdata including indexes and wal.
> 
> You might also consider configuring the Postgres data drives for a RAID 
> 10 SAME configuration as described in the Oracle paper "Optimal Storage 
> Configuration Made Easy" 
> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has 
> anyone delved into this before?

Ok, if I understand it correctly the papers recommends the following:

1. Get many drives and stripe them into a RAID0 with a stripe width of 
1MB. I am not quite sure if this stripe width is to be controlled at the 
application level (does postgres support this?) or if e.g. the "chunk 
size" of the linux software driver is meant. Normally a chunk size of 
4KB is recommended, so 1MB sounds fairly large.

2. Mirror your RAID0 and get a RAID10.

3. Use primarily the fast, outer regions of your disks. In practice this 
might be achieved by putting only half of the disk (the outer half) into 
your stripe set. E.g. put only the outer 18GB of your 36GB disks into 
the stripe set. Btw, is it common for all drives that the outer region 
is on the higher block numbers? Or is it sometimes on the lower block 
numbers?

4. Subset data by partition, not disk. If you have 8 disks, then don't 
take a 4 disk RAID10 for data and the other one for log or indexes, but 
make a global 8 drive RAID10 and have it partitioned the way that data 
and log + indexes are located on all drives.

They say, which is very interesting, as it is really contrary to what is 
normally recommended, that it is good or better to have one big stripe 
set over all disks available, than to put log + indexes on a separated 
stripe set. Having one big stripe set means that the speed of this big 
stripe set is available to all data. In practice this setup is as fast 
as or even faster than the "old" approach.

----------------------------------------------------------------

Bottom line for a normal, less than 10 disk setup:

Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to 
the other 4 disks for a RAID10. Make sure to create the RAID on the 
outer half of the disks (setup may depend on the disk model and raid 
controller used), leaving the inner half empty.
Use a logical volume manager (LVM), which always helps when adding disk 
space, and create 2 partitions on your RAID10. One for data and one for 
log + indexes. This should look like this:

----- ----- ----- -----
| 1 | | 1 | | 1 | | 1 |
----- ----- ----- -----  <- outer, faster half of the disk
| 2 | | 2 | | 2 | | 2 |     part of the RAID10
----- ----- ----- -----
|   | |   | |   | |   |
|   | |   | |   | |   |  <- inner, slower half of the disk
|   | |   | |   | |   |     not used at all
----- ----- ----- -----

Partition 1 for data, partition 2 for log + indexes. All mirrored to the 
other 4 disks not shown.

If you take 36GB disks, this should end up like this:

RAID10 has size of 36 / 2 * 4 = 72GB
Partition 1 is 36 GB
Partition 2 is 36 GB

If 36GB is not enough for your pgdata set, you might consider moving to 
72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks, 
which both will end up in a size of 72GB for your data (but the 16 drive 
version will be faster).

Any comments?

Regards,
Bjoern

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
              http://archives.postgresql.org

Re: [PERFORM] Quad processor options - summary

From

James Thornton

Date:

13 May 2004, 20:35:16

Hadley Willan wrote:

> To answer question 1, if you use software raid the chunk size is part of
> the /etc/raidtab file that is used on initial container creation. 4KB is
> the standard and a LARGE chunk size of 1MB may affect performance if
> you're not writing down to blocks in that size continuously.  If you
> make it to big and you're constantly needing to write out smaller chunks
> of information, then you will find the disk "always" working and would
> be an inefficient use of the blocks. There is some free info around
> about calculating the ideal chunk size. Looking for "Calculating chunk
> size for RAID" through google.

"Why does the SAME configuration recommend a one megabyte stripe width?
Let’s examine the reasoning behind this choice. Why not use a stripe
depth smaller than one megabyte? Smaller stripe depths can improve disk
throughput for a single process by spreading a single IO across multiple
disks. However IOs that are much smaller than a megabyte can cause seek
time to becomes a large fraction of the total IO time. Therefore, the
overall efficiency of the storage system is reduced. In some cases it
may be worth trading off some efficiency for the increased throughput
that smaller stripe depths provide. In general it is not necessary to do
this though. Parallel execution at database level achieves high disk
throughput while keeping efficiency high. Also, remember that the degree
of parallelism can be dynamically tuned, whereas the stripe depth is
very costly to change.

Why not use a stripe depth bigger than one megabyte? One megabyte is
large enough that a sequential scan will spend most of its time
transferring data instead of positioning the disk head. A bigger stripe
depth will improve scan efficiency but only modestly. One megabyte is
small enough that a large IO operation will not “hog” a single disk for
very long before moving to the next one. Further, one megabyte is small
enough that Oracle’s asynchronous readahead operations access multiple
disks. One megabyte is also small enough that a single stripe unit will
not become a hot-spot. Any access hot-spot that is smaller than a
megabyte should fit comfortably in the database buffer cache. Therefore
it will not create a hot-spot on disk."

The SAME configuration paper says to ensure that that large IO
operations aren't broken up between the DB and the disk, you need to be
able to ensure that the database file multi-block read count (Oracle has
a param called db_file_multiblock_read_count, does Postgres?) is the
same size as the stripe width and the OS IO limits should be at least
this size.

Also, it says, "Ideally we would like to stripe the log files using the
same one megabyte stripe width as the rest of the files. However, the
log files are written sequentially, and many storage systems limit the
maximum size of a single write operation to one megabyte (or even less).
If the maximum write size is limited, then using a one megabyte stripe
width for the log files may not work well. In this case, a smaller
stripe width such as 64K may work better. Caching RAID controllers are
an exception to this. If the storage subsystem can cache write
operations in nonvolatile RAM, then a one megabyte stripe width will
work well for the log files. In this case, the write operation will be
buffered in cache and the next log writes can be issued before the
previous write is destaged to disk."

--

  James Thornton
______________________________________________________
Internet Business Consultant, http://jamesthornton.com

Re: [PERFORM] Quad processor options - summary

From

Paul Tuckfield

Date:

13 May 2004, 21:52:07

One big caveat re. the "SAME" striping strategy, is that readahead can
really hurt an OLTP you.

Mind you, if you're going from a few disks to a caching array with many
disks, it'll be hard to not have a big improvement

But if you push the envelope of the array with a "SAME" configuration,
readahead will hurt.  Readahead is good for sequential reads but bad
for random reads, because the various caches (array and filesystem) get
flooded with all the blocks that happen to come after whatever random
blocks  you're reading.  Because they're random reads these extra
blocks are genarally *not* read by subsequent queries if the database
is large enough to be much larger than the cache itself.   Of course,
the readahead blocks are good if you're doing sequential scans, but
you're not doing sequential scans because it's an OLTP database, right?


So this'll probably incite flames but:
In an OLTP environment of decent size, readahead is bad.  The ideal
would be to adjust it dynamically til optimum (likely no readahead)  if
the array allows it, but most people are fooled by good performance of
readahead on simple singlethreaded or small dataset tests, and get
bitten by this under concurrent loads or large datasets.


James Thornton wrote:
>
>>> This is what I am considering the ultimate platform for postgresql:
>>>
>>> Hardware:
>>> Tyan Thunder K8QS board
>>> 2-4 x Opteron 848 in NUMA mode
>>> 4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
>>> LSI Megaraid 320-2 with 256 MB cache ram and battery backup
>>> 6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over
>>> both channels (3 + 4) for pgdata including indexes and wal.
>> You might also consider configuring the Postgres data drives for a
>> RAID 10 SAME configuration as described in the Oracle paper "Optimal
>> Storage Configuration Made Easy"
>> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has
>> anyone delved into this before?
>
> Ok, if I understand it correctly the papers recommends the following:
>
> 1. Get many drives and stripe them into a RAID0 with a stripe width of
> 1MB. I am not quite sure if this stripe width is to be controlled at
> the application level (does postgres support this?) or if e.g. the
> "chunk size" of the linux software driver is meant. Normally a chunk
> size of 4KB is recommended, so 1MB sounds fairly large.
>
> 2. Mirror your RAID0 and get a RAID10.
>
> 3. Use primarily the fast, outer regions of your disks. In practice
> this might be achieved by putting only half of the disk (the outer
> half) into your stripe set. E.g. put only the outer 18GB of your 36GB
> disks into the stripe set. Btw, is it common for all drives that the
> outer region is on the higher block numbers? Or is it sometimes on the
> lower block numbers?
>
> 4. Subset data by partition, not disk. If you have 8 disks, then don't
> take a 4 disk RAID10 for data and the other one for log or indexes,
> but make a global 8 drive RAID10 and have it partitioned the way that
> data and log + indexes are located on all drives.
>
> They say, which is very interesting, as it is really contrary to what
> is normally recommended, that it is good or better to have one big
> stripe set over all disks available, than to put log + indexes on a
> separated stripe set. Having one big stripe set means that the speed
> of this big stripe set is available to all data. In practice this
> setup is as fast as or even faster than the "old" approach.
>
> ----------------------------------------------------------------
>
> Bottom line for a normal, less than 10 disk setup:
>
> Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it
> to the other 4 disks for a RAID10. Make sure to create the RAID on the
> outer half of the disks (setup may depend on the disk model and raid
> controller used), leaving the inner half empty.
> Use a logical volume manager (LVM), which always helps when adding
> disk space, and create 2 partitions on your RAID10. One for data and
> one for log + indexes. This should look like this:
>
> ----- ----- ----- -----
> | 1 | | 1 | | 1 | | 1 |
> ----- ----- ----- -----  <- outer, faster half of the disk
> | 2 | | 2 | | 2 | | 2 |     part of the RAID10
> ----- ----- ----- -----
> |   | |   | |   | |   |
> |   | |   | |   | |   |  <- inner, slower half of the disk
> |   | |   | |   | |   |     not used at all
> ----- ----- ----- -----
>
> Partition 1 for data, partition 2 for log + indexes. All mirrored to
> the other 4 disks not shown.
>
> If you take 36GB disks, this should end up like this:
>
> RAID10 has size of 36 / 2 * 4 = 72GB
> Partition 1 is 36 GB
> Partition 2 is 36 GB
>
> If 36GB is not enough for your pgdata set, you might consider moving
> to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB
> disks, which both will end up in a size of 72GB for your data (but the
> 16 drive version will be faster).
>
> Any comments?
>
> Regards,
> Bjoern
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>

Re: [PERFORM] Quad processor options - summary

From

Mark Kirkwood

Date:

14 May 2004, 07:15:39

I would recommend trying out several stripe sizes, and making your own
measurements.

A while ago I was involved in building a data warehouse system (Oracle,
DB2) and after several file and db benchmark exercises we used 256K
stripes, as these gave the best overall performance results for both
systems.

I am not saying "1M is wrong", but I am saying "1M may not be right" :-)

regards

Mark

Bjoern Metzdorf wrote:

>
> 1. Get many drives and stripe them into a RAID0 with a stripe width of
> 1MB. I am not quite sure if this stripe width is to be controlled at
> the application level (does postgres support this?) or if e.g. the
> "chunk size" of the linux software driver is meant. Normally a chunk
> size of 4KB is recommended, so 1MB sounds fairly large.
>
>