Thread: Re: Hardware/OS recommendations for large databases (

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Adam,

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Claus Guttesen
> Sent: Tuesday, November 15, 2005 12:29 AM
> To: Adam Weisberg
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Hardware/OS recommendations for large
> databases ( 5TB)
>
> > Does anyone have recommendations for hardware and/or OS to
> work with
> > around 5TB datasets?
>
> Hardware-wise I'd say dual core opterons. One
> dual-core-opteron performs better than two single-core at the
> same speed. Tyan makes some boards that have four sockets,
> thereby giving you 8 cpu's (if you need that many). Sun and
> HP also makes nice hardware although the Tyan board is more
> competetive priced.
>
> OS wise I would choose the FreeBSD amd64 port but
> partititions larger than 2 TB needs some special care, using
> gpt rather than disklabel etc., tools like fsck may not be
> able to completely check partitions larger than 2 TB. Linux
> or Solaris with either LVM or Veritas FS sounds like candidates.

I agree - you can get a very good one from www.acmemicro.com or
www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
performance on these (with tuning) on Linux using the xfs filesystem,
which is one of the most critical factors for large databases.

Note that you want to have your DBMS use all of the CPU and disk channel
bandwidth you have on each query, which takes a parallel database like
Bizgres MPP to achieve.

Regards,

- Luke


Re: Hardware/OS recommendations for large databases (

From
Dave Cramer
Date:
Luke,

Have you tried the areca cards, they are slightly faster yet.

Dave
On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:


I agree - you can get a very good one from www.acmemicro.com or

www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA

RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM

on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read

performance on these (with tuning) on Linux using the xfs filesystem,

which is one of the most critical factors for large databases.  


Note that you want to have your DBMS use all of the CPU and disk channel

bandwidth you have on each query, which takes a parallel database like

Bizgres MPP to achieve.


Regards,


Re: Hardware/OS recommendations for large databases (

From
Alex Turner
Date:
On 11/15/05, Luke Lonergan <LLonergan@greenplum.com> wrote:
> Adam,
>
> > -----Original Message-----
> > From: pgsql-performance-owner@postgresql.org
> > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> > Claus Guttesen
> > Sent: Tuesday, November 15, 2005 12:29 AM
> > To: Adam Weisberg
> > Cc: pgsql-performance@postgresql.org
> > Subject: Re: [PERFORM] Hardware/OS recommendations for large
> > databases ( 5TB)
> >
> > > Does anyone have recommendations for hardware and/or OS to
> > work with
> > > around 5TB datasets?
> >
> > Hardware-wise I'd say dual core opterons. One
> > dual-core-opteron performs better than two single-core at the
> > same speed. Tyan makes some boards that have four sockets,
> > thereby giving you 8 cpu's (if you need that many). Sun and
> > HP also makes nice hardware although the Tyan board is more
> > competetive priced.
> >
> > OS wise I would choose the FreeBSD amd64 port but
> > partititions larger than 2 TB needs some special care, using
> > gpt rather than disklabel etc., tools like fsck may not be
> > able to completely check partitions larger than 2 TB. Linux
> > or Solaris with either LVM or Veritas FS sounds like candidates.
>
> I agree - you can get a very good one from www.acmemicro.com or
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> performance on these (with tuning) on Linux using the xfs filesystem,
> which is one of the most critical factors for large databases.
>

Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
for most applications this system will be IO bound, and you will see a
nice lot of drive failures in the first year of operation with
consumer grade drives.

Spend your money on better Disks, and don't bother with Dual Core IMHO
unless you can prove the need for it.

Alex

Re: Hardware/OS recommendations for large databases (

From
Alex Turner
Date:
Not at random access in RAID 10 they aren't, and anyone with their
head screwed on right is using RAID 10.  The 9500S will still beat the
Areca cards at RAID 10 database access patern.

Alex.

On 11/15/05, Dave Cramer <pg@fastcrypt.com> wrote:
> Luke,
>
> Have you tried the areca cards, they are slightly faster yet.
>
> Dave
>
> On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
>
>
>
>
>
> I agree - you can get a very good one from www.acmemicro.com or
>
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
>
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>
> on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
>
> performance on these (with tuning) on Linux using the xfs filesystem,
>
> which is one of the most critical factors for large databases.
>
>
>
>
> Note that you want to have your DBMS use all of the CPU and disk channel
>
> bandwidth you have on each query, which takes a parallel database like
>
> Bizgres MPP to achieve.
>
>
>
>
> Regards,
>

Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Alex Turner wrote:
> Not at random access in RAID 10 they aren't, and anyone with their
> head screwed on right is using RAID 10.  The 9500S will still beat the
> Areca cards at RAID 10 database access patern.

The max 256MB onboard for 3ware cards is disappointing though. While
good enough for 95% of cases, there's that 5% that could use a gig or
two of onboard ram for ultrafast updates. For example, I'm specing out
an upgrade to our current data processing server. Instead of the
traditional 6xFast-Server-HDs, we're gonna go for broke and do
32xConsumer-HDs. This will give us mega I/O bandwidth but we're
vulnerable to random access since consumer-grade HDs don't have the RPMs
or the queueing-smarts. This means we're very dependent on the
controller using onboard RAM to do I/O scheduling. 256MB divided over
4/6/8 drives -- OK. 256MB divided over 32 drives -- ugh, the HD's
buffers are bigger than the RAM alotted to it.

At least this is how it seems it would work from thinking through all
the factors. Unfortunately, I haven't found anybody else who has gone
this route and reported their results so I guess we're the guinea pig.

Re: Hardware/OS recommendations for large databases (

From
"Joshua D. Drake"
Date:
>> I agree - you can get a very good one from www.acmemicro.com or
>> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
>> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>> on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
>> performance on these (with tuning) on Linux using the xfs filesystem,
>> which is one of the most critical factors for large databases.
>>
>>
>
> Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
> for most applications this system will be IO bound, and you will see a
> nice lot of drive failures in the first year of operation with
> consumer grade drives.
>
There is nothing wrong with using SATA disks and they perform very well.
The catch is, make sure
you have a battery back up on the raid controller.

> Spend your money on better Disks, and don't bother with Dual Core IMHO
> unless you can prove the need for it.
>
The reason you want the dual core cpus is that PostgreSQL can only
execute 1 query per cpu
at a time, so the application will see a big boost in overall
transactional velocity if you push two
dual-core cpus into the machine.


Joshua D. Drake


> Alex
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>


Re: Hardware/OS recommendations for large databases

From
Ron
Date:
Got some hard numbers to back your statement up?  IME, the Areca
1160's with  >= 1GB of cache beat any other commodity RAID
controller.  This seems to be in agreement with at least one
independent testing source:

http://print.tweakers.net/?reviews/557

RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
commodity HW solution, but their price point is considerably higher.

...on another note, I completely agree with the poster who says we
need more cache on RAID controllers.  We should all be beating on the
RAID HW manufacturers to use standard DIMMs for their caches and to
provide 2 standard DIMM slots in their full height cards (allowing
for up to 8GB of cache using 2 4GB DIMMs as of this writing).

It should also be noted that 64 drive chassis' are going to become
possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
year (48's are the TOTL now).  We need controller technology to keep up.

Ron

At 12:16 AM 11/16/2005, Alex Turner wrote:
>Not at random access in RAID 10 they aren't, and anyone with their
>head screwed on right is using RAID 10.  The 9500S will still beat the
>Areca cards at RAID 10 database access patern.
>
>Alex.
>
>On 11/15/05, Dave Cramer <pg@fastcrypt.com> wrote:
> > Luke,
> >
> > Have you tried the areca cards, they are slightly faster yet.
> >
> > Dave
> >
> > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> >
> >
> >
> >
> >
> > I agree - you can get a very good one from www.acmemicro.com or
> >
> > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> >
> > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> >
> > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> >
> > performance on these (with tuning) on Linux using the xfs filesystem,
> >
> > which is one of the most critical factors for large databases.
> >
> >
> >
> >
> > Note that you want to have your DBMS use all of the CPU and disk channel
> >
> > bandwidth you have on each query, which takes a parallel database like
> >
> > Bizgres MPP to achieve.
> >
> >
> >
> >
> > Regards,
> >
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster




Re: Hardware/OS recommendations for large databases (

From
Alex Stapleton
Date:
On 16 Nov 2005, at 12:51, William Yu wrote:

> Alex Turner wrote:
>
>> Not at random access in RAID 10 they aren't, and anyone with their
>> head screwed on right is using RAID 10.  The 9500S will still beat
>> the
>> Areca cards at RAID 10 database access patern.
>>
>
> The max 256MB onboard for 3ware cards is disappointing though.
> While good enough for 95% of cases, there's that 5% that could use
> a gig or two of onboard ram for ultrafast updates. For example, I'm
> specing out an upgrade to our current data processing server.
> Instead of the traditional 6xFast-Server-HDs, we're gonna go for
> broke and do 32xConsumer-HDs. This will give us mega I/O bandwidth
> but we're vulnerable to random access since consumer-grade HDs
> don't have the RPMs or the queueing-smarts. This means we're very
> dependent on the controller using onboard RAM to do I/O scheduling.
> 256MB divided over 4/6/8 drives -- OK. 256MB divided over 32 drives
> -- ugh, the HD's buffers are bigger than the RAM alotted to it.
>
> At least this is how it seems it would work from thinking through
> all the factors. Unfortunately, I haven't found anybody else who
> has gone this route and reported their results so I guess we're the
> guinea pig.
>

Your going to have to factor in the increased failure rate in your
cost measurements, including any downtime or performance degradation
whilst rebuilding parts of your RAID array. It depends on how long
your planning for this system to be operational as well of course.

Pick two: Fast, cheap, reliable.

Re: Hardware/OS recommendations for large databases (

From
Steve Wampler
Date:
Joshua D. Drake wrote:
> The reason you want the dual core cpus is that PostgreSQL can only
> execute 1 query per cpu at a time,...

Is that true?  I knew that PG only used one cpu per query, but how
does PG know how many CPUs there are to limit the number of queries?

--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: Hardware/OS recommendations for large databases (

From
David Boreham
Date:
Steve Wampler wrote:

>Joshua D. Drake wrote:
>
>
>>The reason you want the dual core cpus is that PostgreSQL can only
>>execute 1 query per cpu at a time,...
>>
>>
>
>Is that true?  I knew that PG only used one cpu per query, but how
>does PG know how many CPUs there are to limit the number of queries?
>
>
>
He means only one query can be executing on each cpu at any particular
instant.



Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Alex Stapleton wrote:
> Your going to have to factor in the increased failure rate in your  cost
> measurements, including any downtime or performance degradation  whilst
> rebuilding parts of your RAID array. It depends on how long  your
> planning for this system to be operational as well of course.

If we go 32xRAID10, rebuild time should be the same as rebuild time in a
4xRAID10 system. Only the hard drive that was replaced needs rebuild --
not the entire array.

And yes, definitely need a bunch of drives lying around as spares.

Re: Hardware/OS recommendations for large databases (

From
Steve Wampler
Date:
David Boreham wrote:
> Steve Wampler wrote:
>
>> Joshua D. Drake wrote:
>>
>>
>>> The reason you want the dual core cpus is that PostgreSQL can only
>>> execute 1 query per cpu at a time,...
>>>
>>
>>
>> Is that true?  I knew that PG only used one cpu per query, but how
>> does PG know how many CPUs there are to limit the number of queries?
>>
>>
>>
> He means only one query can be executing on each cpu at any particular
> instant.

Got it - the cpu is only acting on one query in any instant but may be
switching between many 'simultaneous' queries.  PG isn't really involved
in the decision.  That makes sense.

--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: Hardware/OS recommendations for large databases (

From
David Boreham
Date:
 >Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
 >for most applications this system will be IO bound, and you will see a
 >nice lot of drive failures in the first year of operation with
 >consumer grade drives.

I guess I've never bought into the vendor story that there are
two reliability grades. Why would they bother making two
different kinds of bearing, motor etc ? Seems like it's more
likely an excuse to justify higher prices. In my experience the
expensive SCSI drives I own break frequently while the cheapo
desktop drives just keep chunking along (modulo certain products
that have a specific known reliability problem).

I'd expect that a larger number of hotter drives will give a less reliable
system than a smaller number of cooler ones.



Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Alex Turner wrote:
> Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
> for most applications this system will be IO bound, and you will see a
> nice lot of drive failures in the first year of operation with
> consumer grade drives.
>
> Spend your money on better Disks, and don't bother with Dual Core IMHO
> unless you can prove the need for it.

I would say the opposite -- you always want Dual Core nowadays. DC
Opterons simply give you better bang for the buck than single core
Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
those mega-CPU motherboards are astronomically expensive.

DC also gives you a better upgrade path. Let's say you do testing and
figure 2x246 is the right setup to handle the load. Well instead of
getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
DC/270. Now you have a server that can be upgraded to +80% more CPU by
popping in another DC/270 versus throwing out the entire thing to get a
4x1P setup.

The only questions would be:
(1) Do you need a SMP server at all? I'd claim yes -- you always need 2+
cores whether it's DC or 2P to avoid IO interrupts blocking other
processes from running.

(2) Does a DC system perform better than it's Nx1P cousin? My experience
is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.

(3) Do you need an insane amount of memory? Well here's the case where
the more expensive motherboard will serve you better since each CPU slot
has its own bank of memory. Spend more money on memory, get cheaper
single-core CPUs.

Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
DCs, while cheaper than their corresponding single-core SMPs, don't have
the same performance profile of Opteron DCs. Basically, you're paying a
bit extra so your server can generate a ton more heat.

Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
David Boreham wrote:
>  >Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
>  >for most applications this system will be IO bound, and you will see a
>  >nice lot of drive failures in the first year of operation with
>  >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

Our SCSI drives have failed maybe a little less than our IDE drives.
Hell, some of the SCSIs even came bad when we bought them. Of course,
the IDE drive failure % is inflated by all the IBM Deathstars we got -- ugh.

Basically, I've found it's cooling that's most important. Packing the
drives together into really small rackmounts? Good for your density, not
good for the drives. Now we do larger rackmounts -- drives have more
space in between each other plus fans in front and back of the drives.

Re: Hardware/OS recommendations for large databases (

From
"Joshua D. Drake"
Date:
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).

I don't know if the reliability grade is true or not but what I can tell
you is that I have scsi drives that are 5+ years old that still work without
issue.

I have never had an IDE drive last longer than 3 years (when used in
production).

That being said, so what. That is what raid is for. You loose a drive
and hot swap
it back in. Heck keep a hotspare in the trays.

Joshua D. Drake


>
> I'd expect that a larger number of hotter drives will give a less
> reliable
> system than a smaller number of cooler ones.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org


Re: Hardware/OS recommendations for large databases (

From
"Joshua D. Drake"
Date:
>
> The only questions would be:
> (1) Do you need a SMP server at all? I'd claim yes -- you always need
> 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other
> processes from running.

I would back this up. Even for smaller installations (single raid 1, 1
gig of ram). Why? Well because many applications are going to be CPU
bound. For example
we have a PHP application that is a CMS. On a single CPU machine, RAID 1
it takes about 300ms to deliver a single page, point to point. We are
not IO bound.
So what happens is that under reasonable load we are actually waiting
for the CPU to process the code.

A simple upgrade to an SMP machine literally doubles our performance
because we are still not IO bound. I strongly suggest that everyone use
at least a single dual core because of this experience.

>
> (3) Do you need an insane amount of memory? Well here's the case where
> the more expensive motherboard will serve you better since each CPU
> slot has its own bank of memory. Spend more money on memory, get
> cheaper single-core CPUs.
Agreed. A lot of times the slowest dual-core is 5x what you actually
need. So get the slowest, and bulk up on memory. If nothing else memory
is cheap today and it might not be tomorrow.

> Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
> DCs, while cheaper than their corresponding single-core SMPs, don't
> have the same performance profile of Opteron DCs. Basically, you're
> paying a bit extra so your server can generate a ton more heat.
>
Well if you are an Intel/Dell shop running PostgreSQL you have bigger
problems ;)

Sincerely,

Joshua D. Drake




> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend


Re: Hardware/OS recommendations for large databases (

From
Scott Marlowe
Date:
On Wed, 2005-11-16 at 08:51, David Boreham wrote:
>  >Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
>  >for most applications this system will be IO bound, and you will see a
>  >nice lot of drive failures in the first year of operation with
>  >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

My experience has mirrored this.

Anyone remember back when HP made their SureStore drives?  We built 8
drive RAID arrays to ship to customer sites, pre-filled with data.  Not
a single one arrived fully operational.  The failure rate on those
drives was something like 60% in the first year, and HP quit making hard
drives because of it.

Those were SCSI Server class drives, supposedly built to last 5 years.

OTOH, I remember putting a pair of 60 Gig IDEs into a server that had
lots of ventilation and fans and such, and having no problems
whatsoever.

There was a big commercial EMC style array in the hosting center at the
same place that had something like a 16 wide by 16 tall array of IDE
drives for storing pdf / tiff stuff on it, and we had at least one
failure a month in it.  Of course, that's 256 drives, so you're gonna
have failures, and it was configured with a spare on every other row or
some such.  We just had a big box of hard drives and it was smart enough
to rebuild automagically when you put a new one in, so the maintenance
wasn't really that bad.  The performance was quite impressive too.

Re: Hardware/OS recommendations for large databases (

From
Scott Marlowe
Date:
On Wed, 2005-11-16 at 09:33, William Yu wrote:
> Alex Turner wrote:
> > Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
> > for most applications this system will be IO bound, and you will see a
> > nice lot of drive failures in the first year of operation with
> > consumer grade drives.
> >
> > Spend your money on better Disks, and don't bother with Dual Core IMHO
> > unless you can prove the need for it.
>
> I would say the opposite -- you always want Dual Core nowadays. DC
> Opterons simply give you better bang for the buck than single core
> Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
> be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
> versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
> those mega-CPU motherboards are astronomically expensive.

The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
Opterons or genuine dual CPU mobo, not "hyperthreaded").  Part of the
issue isn't just raw CPU processing power.  The second CPU allows the
machine to be more responsive because it doesn't have to context switch
as much.

While I've seen plenty of single CPU servers start to bog under load
running one big query, the dual CPU machines always seem more than just
twice as snappy under similar loads.

Re: Hardware/OS recommendations for large databases

From
Alex Turner
Date:
Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
the 3ware controllers beat the Areca card.

Alex.

On 11/16/05, Ron <rjpeace@earthlink.net> wrote:
> Got some hard numbers to back your statement up?  IME, the Areca
> 1160's with  >= 1GB of cache beat any other commodity RAID
> controller.  This seems to be in agreement with at least one
> independent testing source:
>
> http://print.tweakers.net/?reviews/557
>
> RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> commodity HW solution, but their price point is considerably higher.
>
> ...on another note, I completely agree with the poster who says we
> need more cache on RAID controllers.  We should all be beating on the
> RAID HW manufacturers to use standard DIMMs for their caches and to
> provide 2 standard DIMM slots in their full height cards (allowing
> for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>
> It should also be noted that 64 drive chassis' are going to become
> possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> year (48's are the TOTL now).  We need controller technology to keep up.
>
> Ron
>
> At 12:16 AM 11/16/2005, Alex Turner wrote:
> >Not at random access in RAID 10 they aren't, and anyone with their
> >head screwed on right is using RAID 10.  The 9500S will still beat the
> >Areca cards at RAID 10 database access patern.
> >
> >Alex.
> >
> >On 11/15/05, Dave Cramer <pg@fastcrypt.com> wrote:
> > > Luke,
> > >
> > > Have you tried the areca cards, they are slightly faster yet.
> > >
> > > Dave
> > >
> > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > >
> > >
> > >
> > >
> > >
> > > I agree - you can get a very good one from www.acmemicro.com or
> > >
> > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> > >
> > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > >
> > > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> > >
> > > performance on these (with tuning) on Linux using the xfs filesystem,
> > >
> > > which is one of the most critical factors for large databases.
> > >
> > >
> > >
> > >
> > > Note that you want to have your DBMS use all of the CPU and disk channel
> > >
> > > bandwidth you have on each query, which takes a parallel database like
> > >
> > > Bizgres MPP to achieve.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 2: Don't 'kill -9' the postmaster
>
>
>
>

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Scott,

On 11/16/05 9:09 AM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:

The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
Opterons or genuine dual CPU mobo, not "hyperthreaded").  Part of the
issue isn't just raw CPU processing power.  The second CPU allows the
machine to be more responsive because it doesn't have to context switch
as much.

While I've seen plenty of single CPU servers start to bog under load
running one big query, the dual CPU machines always seem more than just
twice as snappy under similar loads.

I agree, 2 CPUs are better than one in most cases.

The discussion was kicked off by the suggestion to get 8 dual core CPUs to process a large database with postgres.  Say your decision support query takes 15 minutes to run with one CPU.  Add another and it still takes 15 minutes.  Add 15 and the same ...

OLTP is so different from Business intelligence and Decision Support that very little of this thread’s discussion is relevant IMO.

The job is to design a system that can process sequential scan as fast as possible and uses all resources (CPUs, mem, disk channels) on each query.  Sequential scan is 100x more important than random seeks.

Here are the facts so far:
  • Postgres can only use 1 CPU on each query
  • Postgres I/O for sequential scan is CPU limited to 110-120 MB/s on the fastest modern CPUs
  • Postgres disk-based sort speed is 1/10 or more slower than commercial databases and memory doesn’t improve it (much)

These are the conclusions that follow about decision support / BI system architecture for normal Postgres:
  • I/O hardware with more than 110MB/s of read bandwidth is not useful
  • More than 1 CPU is not useful
  • More RAM than a nominal amount for small table caching is not useful

In other words, big SMP doesn’t address the problem at all.  By contrast, having all CPUs on multiple machines, or even on a big SMP with lots of I/O channels, solves all of the above issues.

Regards,

- Luke

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Oops,

Last point should be worded: “All CPUs on all machines used by a parallel database”

- Luke


On 11/16/05 9:47 AM, "Luke Lonergan" <llonergan@greenplum.com> wrote:

Scott,

On 11/16/05 9:09 AM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:

The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
Opterons or genuine dual CPU mobo, not "hyperthreaded").  Part of the
issue isn't just raw CPU processing power.  The second CPU allows the
machine to be more responsive because it doesn't have to context switch
as much.

While I've seen plenty of single CPU servers start to bog under load
running one big query, the dual CPU machines always seem more than just
twice as snappy under similar loads.

I agree, 2 CPUs are better than one in most cases.

The discussion was kicked off by the suggestion to get 8 dual core CPUs to process a large database with postgres.  Say your decision support query takes 15 minutes to run with one CPU.  Add another and it still takes 15 minutes.  Add 15 and the same ...

OLTP is so different from Business intelligence and Decision Support that very little of this thread’s discussion is relevant IMO.

The job is to design a system that can process sequential scan as fast as possible and uses all resources (CPUs, mem, disk channels) on each query.  Sequential scan is 100x more important than random seeks.

Here are the facts so far:
  • Postgres can only use 1 CPU on each query
  • Postgres I/O for sequential scan is CPU limited to 110-120 MB/s on the fastest modern CPUs
  • Postgres disk-based sort speed is 1/10 or more slower than commercial databases and memory doesn’t improve it (much)

These are the conclusions that follow about decision support / BI system architecture for normal Postgres:
  • I/O hardware with more than 110MB/s of read bandwidth is not useful
  • More than 1 CPU is not useful
  • More RAM than a nominal amount for small table caching is not useful

In other words, big SMP doesn’t address the problem at all.  By contrast, having all CPUs on multiple machines, or even on a big SMP with lots of I/O channels, solves all of the above issues.

Regards,

- Luke

Re: Hardware/OS recommendations for large databases (

From
Scott Marlowe
Date:
On Wed, 2005-11-16 at 11:47, Luke Lonergan wrote:
> Scott,

Some cutting for clarity...  I agree on the OLTP versus OLAP
discussion.

> Here are the facts so far:
>       * Postgres can only use 1 CPU on each query
>       * Postgres I/O for sequential scan is CPU limited to 110-120
>         MB/s on the fastest modern CPUs
>       * Postgres disk-based sort speed is 1/10 or more slower than
>         commercial databases and memory doesn’t improve it (much)

But PostgreSQL only spills to disk if the data set won't fit into the
amount of memory allocated by working_mem / sort_mem.  And for most
Business analysis stuff, this can be quite large, and you can even crank
it up for a single query.

I've written reports that were horrifically slow, hitting the disk and
all, and I upped sort_mem to hundreds of megabytes until it fit into
memory, and suddenly, a slow query is running factors faster than
before.

Or did you mean something else by "disk base sort speed"???

Re: Hardware/OS recommendations for large databases (

From
Matthew Nuzum
Date:
On 11/16/05, David Boreham <david_list@boreham.org> wrote:
>  >Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
>  >for most applications this system will be IO bound, and you will see a
>  >nice lot of drive failures in the first year of operation with
>  >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

Of all the SCSI and IDE drives I've used, and I've used a lot, there
is a definite difference in quality. The SCSI drives primarily use
higher quality components that are intended to last longer under 24/7
work loads. I've taken several SCSI and IDE drives apart and you can
tell from the guts that the SCSI drives are built with sturdier
components.

I haven't gotten my hands on the Raptor line of ATA drives yet, but
I've heard they share this in common with the SCSI drives - they are
built with components made to be used day and night for years straight
without ending.

That doesn't mean they will last longer than IDE drives, that just
means they've been designed to withstand higher amounts of heat and
sustained activity. I've got some IDE drives that have lasted years++
and I've got some IDE drives that have lasted months. However, my SCSI
drives I've had over the years all lasted longer than the server they
were installed in.

I will say that in the last 10 years, the MTBF of IDE/ATA drives has
improved dramatically, so I regularly use them in servers, however I
have also shifted my ideology so that a server should be replaced
after 3 years, where before I aimed for 5.

It seems to me that the least reliable components in servers these
days are the fans.

--
Matthew Nuzum
www.bearfruit.org

Re: Hardware/OS recommendations for large databases (

From
"Steinar H. Gunderson"
Date:
On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> There was a big commercial EMC style array in the hosting center at the
> same place that had something like a 16 wide by 16 tall array of IDE
> drives for storing pdf / tiff stuff on it, and we had at least one
> failure a month in it.  Of course, that's 256 drives, so you're gonna
> have failures, and it was configured with a spare on every other row or
> some such.  We just had a big box of hard drives and it was smart enough
> to rebuild automagically when you put a new one in, so the maintenance
> wasn't really that bad.  The performance was quite impressive too.

If you have a cool SAN, it alerts you and removes all data off a disk
_before_ it starts giving hard failures :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Hardware/OS recommendations for large databases (

From
Matthew Nuzum
Date:
On 11/16/05, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> If you have a cool SAN, it alerts you and removes all data off a disk
> _before_ it starts giving hard failures :-)
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/

Good point. I have avoided data loss *twice* this year by using SMART
hard drive monitoring software.

I can't tell you how good it feels to replace a drive that is about to
die, as compared to restoring data because a drive died.
--
Matthew Nuzum
www.bearfruit.org

OT Re: Hardware/OS recommendations for large databases (

From
"Douglas J. Trainor"
Date:
AMD added quad-core processors to their public roadmap for 2007.

Beyond 2007, the quad-cores will scale up to 32 sockets!!!!!!!!
(using Direct Connect Architecture 2.0)

Expect Intel to follow.

    douglas

On Nov 16, 2005, at 9:38 AM, Steve Wampler wrote:

> [...]
>
> Got it - the cpu is only acting on one query in any instant but may be
> switching between many 'simultaneous' queries.  PG isn't really
> involved
> in the decision.  That makes sense.

Re: Hardware/OS recommendations for large databases (

From
Scott Marlowe
Date:
On Wed, 2005-11-16 at 12:51, Steinar H. Gunderson wrote:
> On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> > There was a big commercial EMC style array in the hosting center at the
> > same place that had something like a 16 wide by 16 tall array of IDE
> > drives for storing pdf / tiff stuff on it, and we had at least one
> > failure a month in it.  Of course, that's 256 drives, so you're gonna
> > have failures, and it was configured with a spare on every other row or
> > some such.  We just had a big box of hard drives and it was smart enough
> > to rebuild automagically when you put a new one in, so the maintenance
> > wasn't really that bad.  The performance was quite impressive too.
>
> If you have a cool SAN, it alerts you and removes all data off a disk
> _before_ it starts giving hard failures :-)

Yeah, I forget who made the unit we used, but it was pretty much fully
automated.  IT was something like a large RAID 5+0 (0+5???) and would
send an alert when a drive died or started getting errors, and the bad
drive's caddy would be flashing read instead of steady green.

I just remember thinking that I'd never used a drive array that was
taller than I was before that.

Re: Hardware/OS recommendations for large databases

From
Ron
Date:
You _ARE_ kidding right?  In what hallucination?

The performance numbers for the 1GB cache version of the Areca 1160
are the _grey_ line in the figures, and were added after the original
article was published:

"Note: Since the original Dutch article was published in late
January, we have finished tests of the 16-port Areca ARC-1160 using
128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to
12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The
performance graphs have been updated to include the ARC-1160 results.
Discussions of the results have not been updated, however. "

With 1GB of cache, the 1160's beat everything else in almost all of
the tests they participated in.  For the few where they do not win
hands down, the Escalade's (very occasionally) essentially tie.

These are very easy to read full color graphs where higher is better
and the grey line representing the 1GB 1160's is almost always higher
on the graph than anything else.  Granted the Escalades seem to give
them the overall best run for their money, but they still are clearly
second best when looking at all the graphs and the CPU utilization
numbers in aggregate.

Ron



At 12:08 PM 11/16/2005, Alex Turner wrote:
>Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
>the 3ware controllers beat the Areca card.
>
>Alex.
>
>On 11/16/05, Ron <rjpeace@earthlink.net> wrote:
> > Got some hard numbers to back your statement up?  IME, the Areca
> > 1160's with  >= 1GB of cache beat any other commodity RAID
> > controller.  This seems to be in agreement with at least one
> > independent testing source:
> >
> > http://print.tweakers.net/?reviews/557
> >
> > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> > commodity HW solution, but their price point is considerably higher.
> >
> > ...on another note, I completely agree with the poster who says we
> > need more cache on RAID controllers.  We should all be beating on the
> > RAID HW manufacturers to use standard DIMMs for their caches and to
> > provide 2 standard DIMM slots in their full height cards (allowing
> > for up to 8GB of cache using 2 4GB DIMMs as of this writing).
> >
> > It should also be noted that 64 drive chassis' are going to become
> > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> > year (48's are the TOTL now).  We need controller technology to keep up.
> >
> > Ron
> >
> > At 12:16 AM 11/16/2005, Alex Turner wrote:
> > >Not at random access in RAID 10 they aren't, and anyone with their
> > >head screwed on right is using RAID 10.  The 9500S will still beat the
> > >Areca cards at RAID 10 database access patern.
> > >
> > >Alex.
> > >
> > >On 11/15/05, Dave Cramer <pg@fastcrypt.com> wrote:
> > > > Luke,
> > > >
> > > > Have you tried the areca cards, they are slightly faster yet.
> > > >
> > > > Dave
> > > >
> > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I agree - you can get a very good one from www.acmemicro.com or
> > > >
> > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> > > >
> > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > > >
> > > > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> > > >
> > > > performance on these (with tuning) on Linux using the xfs filesystem,
> > > >
> > > > which is one of the most critical factors for large databases.
> > > >
> > > >
> > > >
> > > >
> > > > Note that you want to have your DBMS use all of the CPU and
> disk channel
> > > >
> > > > bandwidth you have on each query, which takes a parallel database like
> > > >
> > > > Bizgres MPP to achieve.
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > >
> > >---------------------------(end of broadcast)---------------------------
> > >TIP 2: Don't 'kill -9' the postmaster
> >
> >
> >
> >




Re: Hardware/OS recommendations for large databases (

From
Ron Mayer
Date:
William Yu wrote:
>
> Our SCSI drives have failed maybe a little less than our IDE drives.

Microsoft in their database showcase terraserver project has
had the same experience.  They studied multiple configurations
including a SCSI/SAN solution as well as a cluster of SATA boxes.

They measured a
   6.4% average annual failure rate of their SATA version and a
   5.5% average annual failure rate on their SCSI implementation.

ftp://ftp.research.microsoft.com/pub/tr/TR-2004-107.pdf

   "We lost 9 drives out of 140 SATA drives on the Web and
    Storage Bricks in one year. This is a 6.4% annual failure rate.
    In contrast, the Compaq Storageworks SAN and Web servers lost
    approximately 32 drives in three years out of a total of 194
    drives.13 This is a 5.5% annual failure rate.

    The failure rates indicate that SCSI drives are more
    reliable than SATA. SATA drives are substantially
    cheaper than SCSI drives. Because the SATA failure rate
    is so close to the SCSI failure rate gives SATA a
    substantial return on investment advantage."

So unless your system is extremely sensitive to single drive
failures, the difference is pretty small.   And for the cost
it seems you can buy enough extra spindles of SATA drives to
easily make up for the performance difference.


> Basically, I've found it's cooling that's most important. Packing the
> drives together into really small rackmounts? Good for your density, not
> good for the drives.

Indeed that was their guess for their better-than-expected
life of their SATA drives as well.  From the same paper:

  "We were careful about disk cooling – SATA
   drives are rarely cooled with the same care that a SCSI
   array receives."

Re: Hardware/OS recommendations for large databases

From
Ron
Date:
Amendment: there are graphs where the 1GB Areca 1160's do not do as
well.  Given that they are mySQL specific and that similar usage
scenarios not involving mySQL (as well as most of the usage scenarios
involving mySQL; as I said these did not follow the pattern of the
rest of the benchmarks) show the usual pattern of the 1GB 1160's in
1st place or tied for 1st place, it seems reasonable that mySQL has
something to due with the aberrant results in those 2 (IIRC) cases.

Ron

At 03:57 PM 11/16/2005, Ron wrote:
>You _ARE_ kidding right?  In what hallucination?
>
>The performance numbers for the 1GB cache version of the Areca 1160
>are the _grey_ line in the figures, and were added after the
>original article was published:
>
>"Note: Since the original Dutch article was published in late
>January, we have finished tests of the 16-port Areca ARC-1160 using
>128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to
>12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The
>performance graphs have been updated to include the ARC-1160
>results. Discussions of the results have not been updated, however. "
>
>With 1GB of cache, the 1160's beat everything else in almost all of
>the tests they participated in.  For the few where they do not win
>hands down, the Escalade's (very occasionally) essentially tie.
>
>These are very easy to read full color graphs where higher is better
>and the grey line representing the 1GB 1160's is almost always
>higher on the graph than anything else.  Granted the Escalades seem
>to give them the overall best run for their money, but they still
>are clearly second best when looking at all the graphs and the CPU
>utilization numbers in aggregate.
>
>Ron
>
>
>
>At 12:08 PM 11/16/2005, Alex Turner wrote:
>>Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
>>the 3ware controllers beat the Areca card.
>>
>>Alex.
>>
>>On 11/16/05, Ron <rjpeace@earthlink.net> wrote:
>> > Got some hard numbers to back your statement up?  IME, the Areca
>> > 1160's with  >= 1GB of cache beat any other commodity RAID
>> > controller.  This seems to be in agreement with at least one
>> > independent testing source:
>> >
>> > http://print.tweakers.net/?reviews/557
>> >
>> > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
>> > commodity HW solution, but their price point is considerably higher.
>> >
>> > ...on another note, I completely agree with the poster who says we
>> > need more cache on RAID controllers.  We should all be beating on the
>> > RAID HW manufacturers to use standard DIMMs for their caches and to
>> > provide 2 standard DIMM slots in their full height cards (allowing
>> > for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>> >
>> > It should also be noted that 64 drive chassis' are going to become
>> > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
>> > year (48's are the TOTL now).  We need controller technology to keep up.
>> >
>> > Ron
>> >
>> > At 12:16 AM 11/16/2005, Alex Turner wrote:
>> > >Not at random access in RAID 10 they aren't, and anyone with their
>> > >head screwed on right is using RAID 10.  The 9500S will still beat the
>> > >Areca cards at RAID 10 database access patern.
>> > >
>> > >Alex.
>> > >
>> > >On 11/15/05, Dave Cramer <pg@fastcrypt.com> wrote:
>> > > > Luke,
>> > > >
>> > > > Have you tried the areca cards, they are slightly faster yet.
>> > > >
>> > > > Dave
>> > > >
>> > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > I agree - you can get a very good one from www.acmemicro.com or
>> > > >
>> > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware
>> 9550SX SATA
>> > > >
>> > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>> > > >
>> > > > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
>> > > >
>> > > > performance on these (with tuning) on Linux using the xfs filesystem,
>> > > >
>> > > > which is one of the most critical factors for large databases.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Note that you want to have your DBMS use all of the CPU and
>> disk channel
>> > > >
>> > > > bandwidth you have on each query, which takes a parallel database like
>> > > >
>> > > > Bizgres MPP to achieve.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > >
>> > >---------------------------(end of broadcast)---------------------------
>> > >TIP 2: Don't 'kill -9' the postmaster
>> >
>> >
>> >
>> >
>
>
>




Re: Hardware/OS recommendations for large databases (

From
mudfoot@rawbw.com
Date:
Yeah those big disks arrays are real sweet.

One day last week I was in a data center in Arizona when the big LSI/Storagetek
array in the cage next to mine had a hard drive failure.  So the alarm shrieked
at like 13225535 decibles continuously for hours.  BEEEP BEEEEP BEEEEP BEEEEP.
Of course since this was a colo facility it wasn't staffed on site by the idiots
who own the array.  BEEEEP BEEEEEEEP BEEEEEEEP for hours.  So I had to stand
next to this thing--only separated by a few feet and a little wire mesh--while
it shrieked for hours until a knuckle-dragger arrived on site to swap the drive.

Yay.

So if you're going to get a fancy array (they're worth it if somebody else is
paying) then make sure to *turn off the @#%@#SF'ing audible alarm* if you deploy
it in a colo facility.

Quoting Scott Marlowe <smarlowe@g2switchworks.com>:

> On Wed, 2005-11-16 at 12:51, Steinar H. Gunderson wrote:
> > On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> > > There was a big commercial EMC style array in the hosting center at the
> > > same place that had something like a 16 wide by 16 tall array of IDE
> > > drives for storing pdf / tiff stuff on it, and we had at least one
> > > failure a month in it.  Of course, that's 256 drives, so you're gonna
> > > have failures, and it was configured with a spare on every other row or
> > > some such.  We just had a big box of hard drives and it was smart
> enough
> > > to rebuild automagically when you put a new one in, so the maintenance
> > > wasn't really that bad.  The performance was quite impressive too.
> >
> > If you have a cool SAN, it alerts you and removes all data off a disk
> > _before_ it starts giving hard failures :-)
>
> Yeah, I forget who made the unit we used, but it was pretty much fully
> automated.  IT was something like a large RAID 5+0 (0+5???) and would
> send an alert when a drive died or started getting errors, and the bad
> drive's caddy would be flashing read instead of steady green.
>
> I just remember thinking that I'd never used a drive array that was
> taller than I was before that.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>



Re: Hardware/OS recommendations for large databases (

From
Alex Turner
Date:
On 11/16/05, William Yu <wyu@talisys.com> wrote:
> Alex Turner wrote:
> > Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
> > for most applications this system will be IO bound, and you will see a
> > nice lot of drive failures in the first year of operation with
> > consumer grade drives.
> >
> > Spend your money on better Disks, and don't bother with Dual Core IMHO
> > unless you can prove the need for it.
>
> I would say the opposite -- you always want Dual Core nowadays. DC
> Opterons simply give you better bang for the buck than single core
> Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
> be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
> versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
> those mega-CPU motherboards are astronomically expensive.
>

Opteron 242 - $178.00
Opteron 242 - $178.00
Tyan S2882 - $377.50
Total: $733.50

Opteron 265 - $719.00
Tyan K8E - $169.00
Total: $888.00

Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples.
Infact I couldn't find a single CPU slot board that did, so you pretty
much have to buy a dual CPU board to get PCI-X.

1xDC is _not_ cheaper.

Our DB application does about 5 queries/second peak, plus a heavy
insert job once per day.  We only _need_ two CPUs, which is true for a
great many DB applications.  Unless you like EJB of course, which will
thrash the crap out of your system.

Consider the two most used regions for DBs:

a) OLTP - probably IO bound, large number of queries/sec updating info
on _disks_, not requiring much CPU activity except to retrieve item
infomration which is well indexed and normalized.

b) Data wharehouse - needs CPU, but probably still IO bound, large
data set that won't fit in RAM will required large amounts of disk
reads.  CPU can easily keep up with disk reads.

I have yet to come across a DB system that wasn't IO bound.

> DC also gives you a better upgrade path. Let's say you do testing and
> figure 2x246 is the right setup to handle the load. Well instead of
> getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
> DC/270. Now you have a server that can be upgraded to +80% more CPU by
> popping in another DC/270 versus throwing out the entire thing to get a
> 4x1P setup.

No argument there.  But it's pointless if you are IO bound.

>
> The only questions would be:
> (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+
> cores whether it's DC or 2P to avoid IO interrupts blocking other
> processes from running.

At least 2CPUs is always good for precisely those reasons.  More than
2CPUs gives diminishing returns.

>
> (2) Does a DC system perform better than it's Nx1P cousin? My experience
> is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
> and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
> etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.

Maybe true, but the 265 does have a 25% faster FSB than the 244, which
might perhaps play a role.

>
> (3) Do you need an insane amount of memory? Well here's the case where
> the more expensive motherboard will serve you better since each CPU slot
> has its own bank of memory. Spend more money on memory, get cheaper
> single-core CPUs.

Remember - large DB is going to be IO bound.  Memory will get thrashed
for file block buffers, even if you have large amounts, it's all gonna
be cycled in and out again.

>
> Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
> DCs, while cheaper than their corresponding single-core SMPs, don't have
> the same performance profile of Opteron DCs. Basically, you're paying a
> bit extra so your server can generate a ton more heat.

Dell/Xeon/Postgres is just a bad combination any day of the week ;)

Alex.

Re: Hardware/OS recommendations for large databases (

From
Alex Turner
Date:
On 11/16/05, Joshua D. Drake <jd@commandprompt.com> wrote:
> >
> > The only questions would be:
> > (1) Do you need a SMP server at all? I'd claim yes -- you always need
> > 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other
> > processes from running.
>
> I would back this up. Even for smaller installations (single raid 1, 1
> gig of ram). Why? Well because many applications are going to be CPU
> bound. For example
> we have a PHP application that is a CMS. On a single CPU machine, RAID 1
> it takes about 300ms to deliver a single page, point to point. We are
> not IO bound.
> So what happens is that under reasonable load we are actually waiting
> for the CPU to process the code.
>

This is the performance profile for PHP, not for Postgresql.  This is
the postgresql mailing list.

> A simple upgrade to an SMP machine literally doubles our performance
> because we are still not IO bound. I strongly suggest that everyone use
> at least a single dual core because of this experience.
>

Performance of PHP, not postgresql.

> >
> > (3) Do you need an insane amount of memory? Well here's the case where
> > the more expensive motherboard will serve you better since each CPU
> > slot has its own bank of memory. Spend more money on memory, get
> > cheaper single-core CPUs.
> Agreed. A lot of times the slowest dual-core is 5x what you actually
> need. So get the slowest, and bulk up on memory. If nothing else memory
> is cheap today and it might not be tomorrow.
[snip]

Running postgresql on a single drive RAID 1 with PHP on the same
machine is not a typical installation.

300ms for PHP in CPU time?  wow dude - that's quite a page.  PHP
typical can handle up to 30-50 pages per second for a typical OLTP
application on a single CPU box.  Something is really wrong with that
system if it takes 300ms per page.

Alex.

Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Alex Turner wrote:
> Opteron 242 - $178.00
> Opteron 242 - $178.00
> Tyan S2882 - $377.50
> Total: $733.50
>
> Opteron 265 - $719.00
> Tyan K8E - $169.00
> Total: $888.00

You're comparing the wrong CPUs. The 265 is the 2x of the 244 so you'll
have to bump up the price more although not enough to make a difference.

Looks like the price of the 2X MBs have dropped since I last looked at
it. Just a few months back, Tyan duals were $450-$500 which is what I
was basing my "priced less" statement from.

> Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples.
> Infact I couldn't find a single CPU slot board that did, so you pretty
> much have to buy a dual CPU board to get PCI-X.

You can get single CPU boards w/ PCIe and use PCIe controller cards.
Probably expensive right now because they're so bleeding-edge new but
definitely on the downswing.

> a) OLTP - probably IO bound, large number of queries/sec updating info
> on _disks_, not requiring much CPU activity except to retrieve item
> infomration which is well indexed and normalized.

Not in my experience. I find on our OLTP servers, we run 98% in RAM and
hence are 100% CPU-bound. Our DB is about 50GB in size now, servers run
w/ 8GB of RAM. We were *very* CPU limited running 2x244. During busy
hours of the day, our avg "user transaction" time were jumping from
0.8sec to 1.3+sec. Did the 2x265 and now we're always in the 0.7sec to
0.9sec range.

>>DC also gives you a better upgrade path. Let's say you do testing and
>>figure 2x246 is the right setup to handle the load. Well instead of
>>getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
>>DC/270. Now you have a server that can be upgraded to +80% more CPU by
>>popping in another DC/270 versus throwing out the entire thing to get a
>>4x1P setup.
>
>
> No argument there.  But it's pointless if you are IO bound.

Why would you just accept "we're IO bound, nothing we can do"? I'd do
everything in my power to make my app go from IO bound to CPU bound --
whether by optimizing my code or buying more hardware. I can tell you if
our OLTP servers were IO bound, it would run like crap. Instead of < 1
sec, we'd be looking at 5-10 seconds per "user transaction" and our
users would be screaming bloody murder.

In theory, you can always convert your IO bound DB to CPU bound by
stuffing more and more RAM into your server. (Or partitioning the DB
across multiple servers.) Whether it's cost effective depends on the DB
and how much your users are paying you -- and that's a case-by-case
analysis. Not a global statement of "IO-bound, pointless".

>>(2) Does a DC system perform better than it's Nx1P cousin? My experience
>>is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
>>and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
>>etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.
>
>
> Maybe true, but the 265 does have a 25% faster FSB than the 244, which
> might perhaps play a role.

Nope. There's no such thing as FSB on Opterons. On-die memory controller
runs @ CPU speed and hence connects at whatever the memory runs at
(rounded off to some multiplier math). There's the HT speed that
controls the max IO bandwidth but that's based on the motherboard, not
the CPU. Plus the 265 and 244 both run at 1.8Ghz so the memory
multiplier & HT IO are both the same.

Re: Hardware/OS recommendations for large databases (

From
"Joshua D. Drake"
Date:
>> So what happens is that under reasonable load we are actually waiting
>> for the CPU to process the code.
>>
>>
>
> This is the performance profile for PHP, not for Postgresql.  This is
> the post
And your point? PostgreSQL benefits directly from what I am speaking
about as well.

>> Performance of PHP, not postgresql.
>>
>>
Actually both.

> [snip]
>
> Running postgresql on a single drive RAID 1 with PHP on the same
> machine is not a typical installation.
>
Want to bet? What do you think the majority of people hosting at
rackshack, rackspace,
superrack etc... are doing? Or how about all those virtual hosts?

> 300ms for PHP in CPU time?  wow dude - that's quite a page.  PHP
> typical can handle up to 30-50 pages per second for a typical OLTP
> application on a single CPU box.  Something is really wrong with that
> system if it takes 300ms per page.
>
There is wait time associated with that because we are hitting it with
50-100 connections at a time.

Joshua D. Drake

> Alex.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>


Re: Hardware/OS recommendations for large databases (

From
Joshua Marsh
Date:

On 11/17/05, William Yu <wyu@talisys.com> wrote:
> No argument there.  But it's pointless if you are IO bound.

Why would you just accept "we're IO bound, nothing we can do"? I'd do
everything in my power to make my app go from IO bound to CPU bound --
whether by optimizing my code or buying more hardware. I can tell you if
our OLTP servers were IO bound, it would run like crap. Instead of < 1
sec, we'd be looking at 5-10 seconds per "user transaction" and our
users would be screaming bloody murder.

In theory, you can always convert your IO bound DB to CPU bound by
stuffing more and more RAM into your server. (Or partitioning the DB
across multiple servers.) Whether it's cost effective depends on the DB
and how much your users are paying you -- and that's a case-by-case
analysis. Not a global statement of "IO-bound, pointless".

We all want our systems to be CPU bound, but it's not always possible.  Remember, he is managing a 5 TB Databse.  That's quite a bit different than a 100 GB or even 500 GB database.
 

Re: Hardware/OS recommendations for large databases (

From
William Yu
Date:
Joshua Marsh wrote:
>
> On 11/17/05, *William Yu* <wyu@talisys.com <mailto:wyu@talisys.com>> wrote:
>
>      > No argument there.  But it's pointless if you are IO bound.
>
>     Why would you just accept "we're IO bound, nothing we can do"? I'd do
>     everything in my power to make my app go from IO bound to CPU bound --
>     whether by optimizing my code or buying more hardware. I can tell you if
>     our OLTP servers were IO bound, it would run like crap. Instead of < 1
>     sec, we'd be looking at 5-10 seconds per "user transaction" and our
>     users would be screaming bloody murder.
>
>     In theory, you can always convert your IO bound DB to CPU bound by
>     stuffing more and more RAM into your server. (Or partitioning the DB
>     across multiple servers.) Whether it's cost effective depends on the DB
>     and how much your users are paying you -- and that's a case-by-case
>     analysis. Not a global statement of "IO-bound, pointless".
>
>
> We all want our systems to be CPU bound, but it's not always possible.
> Remember, he is managing a 5 TB Databse.  That's quite a bit different
> than a 100 GB or even 500 GB database.

I did say "in theory". :) I'm pretty sure google is more CPU bound than
IO bound -- they just spread their DB over 50K servers or whatever. Not
everybody is willing to pay for that but it's always in the realm of
plausibility.

Plus we have to go back to the statement I was replying to which was "I
have yet to come across a DB system that wasn't IO bound".


Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Joshua Marsh <icub3d@gmail.com> writes:

> We all want our systems to be CPU bound, but it's not always possible.

Sure it is, let me introduce you to my router, a 486DX100...



Ok, I guess that wasn't very helpful, I admit.


--
greg

Re: Hardware/OS recommendations for large databases (

From
Greg Stark
Date:
Joshua Marsh <icub3d@gmail.com> writes:

> We all want our systems to be CPU bound, but it's not always possible.
> Remember, he is managing a 5 TB Databse. That's quite a bit different than a
> 100 GB or even 500 GB database.

Ok, a more productive point: it's not really the size of the database that
controls whether you're I/O bound or CPU bound. It's the available I/O
bandwidth versus your CPU speed.

If your I/O subsystem can feed data to your CPU as fast as it can consume it
then you'll be CPU bound no matter how much data you have in total. It's
harder to scale up I/O subsystems than CPUs, instead of just replacing a CPU
it tends to mean replacing the whole system to get a better motherboard with a
faster, better bus, as well as adding more controllers and more disks.

--
greg

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Greg,


On 11/17/05 9:17 PM, "Greg Stark" <gsstark@mit.edu> wrote:

> Ok, a more productive point: it's not really the size of the database that
> controls whether you're I/O bound or CPU bound. It's the available I/O
> bandwidth versus your CPU speed.

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0 will
perform exactly the same as a $80,000 system with 8 dual core CPUs and the
world's best SCSI RAID hardware on a large database for decision support
(what the poster asked about).

Regards,

- Luke



Re: Hardware/OS recommendations for large databases (

From
Dave Cramer
Date:
On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:

> Greg,
>
>
> On 11/17/05 9:17 PM, "Greg Stark" <gsstark@mit.edu> wrote:
>
>> Ok, a more productive point: it's not really the size of the
>> database that
>> controls whether you're I/O bound or CPU bound. It's the available
>> I/O
>> bandwidth versus your CPU speed.
>
> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound
> after
> 110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.
>
> A $1,000 system with one CPU and two SATA disks in a software RAID0
> will
> perform exactly the same as a $80,000 system with 8 dual core CPUs
> and the
> world's best SCSI RAID hardware on a large database for decision
> support
> (what the poster asked about).

Now there's an interesting line drawn in the sand. I presume you have
numbers to back this up ?

This should draw some interesting posts.

Dave
>
> Regards,
>
> - Luke
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>


Re: Hardware/OS recommendations for large databases (

From
Richard Huxton
Date:
Dave Cramer wrote:
>
> On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:
>
>> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound  after
>> 110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.
>>
>> A $1,000 system with one CPU and two SATA disks in a software RAID0  will
>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>> and the
>> world's best SCSI RAID hardware on a large database for decision  support
>> (what the poster asked about).
>
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
system if he's got one going :-)

--
   Richard Huxton
   Archonet Ltd

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Dave,

On 11/18/05 5:00 AM, "Dave Cramer" <pg@fastcrypt.com> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Part 2: The answer

System A:
This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that’s 66 MB/s.  Not the efficiency I’d hope from the onboard SATA controller that I’d like, I would have expected to get 85% of the 100MB/s raw read performance.

So that’s $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s

Raw data:
[llonergan@kite4 IVP]$ cat scan.sh
#!/bin/bash

time psql -c "select count(*) from ivp.bigtable1" dgtestdb
[llonergan@kite4 IVP]$ cat sysout1
  count   
----------
 10000000
(1 row)


real    0m32.565s
user    0m0.002s
sys     0m0.003s

Size of the table data:
[llonergan@kite4 IVP]$ du -sk dgtestdb/base
2121648 dgtestdb/base

System B:
This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead.  It’s running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel.

Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)  That’s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.  This system is running with a 16MB Linux readahead setting, let’s try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.

So, using the tuned setting of “blockdev —setra 16384” we get $6,000 / 244MB/s = 24.6 $/MB/s
If we use the default Linux setting it’s 2.5x worse.

Raw data:
[llonergan@modena2 IVP]$ cat scan.sh
#!/bin/bash

time psql -c "select count(*) from ivp.bigtable1" dgtestdb
[llonergan@modena2 IVP]$ cat sysout3
  count   
----------
 80000000
(1 row)


real    1m9.875s
user    0m0.000s
sys     0m0.004s
[llonergan@modena2 IVP]$ !du
du -sk dgtestdb/base
17021260        dgtestdb/base

Summary:

<cough, cough> OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead.  This is a cheap alternative to overhauling the executor to use asynch I/O.

Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.

- Luke

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Dave,


On 11/18/05 5:00 AM, "Dave Cramer" <pg@fastcrypt.com> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

OK, here we go:

The $1,000 system (System A):

- I bought 16 of these in 2003 for $1,200 each. They have Intel or Asus motherboards, Intel P4 3.0GHz CPUs with an 800MHz FSB.  They have a system drive and two RAID0 SATA drives, the Western Digital 74GB Raptor (10K RPM).  They have 1GB of RAM.

  • A test of write and read performance on the RAID0:

[llonergan@kite4 raid0]$ time dd if=/dev/zero of=bigfile bs=8k count=250000
250000+0 records in
250000+0 records out

real    0m17.453s
user    0m0.249s
sys     0m10.246s

[llonergan@kite4 raid0]$ time dd if=bigfile of=/dev/null bs=8k
250000+0 records in
250000+0 records out

real    0m18.930s
user    0m0.130s
sys     0m3.590s

So, the write performance is 114MB/s and read performance is 106MB/s.

The $6,000 system (System B):

  • I just bought 5 of these systems for $6,000 each.  They are dual Opteron systems with 8GB of RAM and 2x 250 model CPUs, which are close to the fastest.  They have the new 3Ware 9550SX SATA RAID adapters coupled to  Western Digital 400GB RE2 model hard drives.  They are organized as a RAID5.

  • A test of write and read performance on the RAID5:

[root@modena2 dbfast1]# time dd if=/dev/zero of=bigfile bs=8k count=2000000
2000000+0 records in
2000000+0 records out

real    0m51.441s
user    0m0.288s
sys     0m29.119s

[root@modena2 dbfast1]# time dd if=bigfile of=/dev/null bs=8k
2000000+0 records in
2000000+0 records out

real    0m39.605s
user    0m0.244s
sys     0m19.207s

So, the write performance is 314MB/s and read performance is 404MB/s (!)  This is the fastest I’ve seen 8 disk drives perform.

So, the question is: which of these systems (A or B) can scan a large table faster using non-MPP postgres?  How much faster would you wager?

Send your answer, and I’ll post the result.

Regards,

- Luke

Re: Hardware/OS recommendations for large databases

From
Ron
Date:
While I agree with you in principle that pg becomes CPU bound
relatively easily compared to other DB products (at ~110-120MBps
according to a recent thread), there's a bit of hyperbole in your post.

a. There's a big difference between the worst performing 1C x86 ISA
CPU available and the best performing 2C one (IIRC, that's the
2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing)

b. Two 2C CPU's vs one 1C CPU means that a pg process will almost
never be waiting on other non pg processes.  It also means that 3-4
pg processes, CPU bound or not, can execute in parallel.  Not an
option with one 1C CPU.

c. Mainboards with support for multiple CPUs and lots' of RAM are
_not_ the cheap ones.

d.  No one should ever use RAID 0 for valuable data.  Ever.  So at
the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good
option unless write performance is unimportant.  4HD RAID 5 is
particularly not a good option.)

e. The server usually needs to talk to things over a network
connection.  Often performance here matters.  Mainboards with 2 1GbE
NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones.

f. Trash HDs mean poor IO performance and lower reliability.  While
TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always.  It depends
on context.),
you at least want SATA II HDs with NCQ or TCQ support.  And you want
them to have a decent media warranty- preferably a 5 year one if you
can get it.  Again, these are not the cheapest HD's available.

g. Throughput limitations say nothing about latency
considerations.  OLTP-like systems _want_ HD spindles.  AMAP.  Even
non OLTP-like systems need a fair number of spindles to optimize HD
IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and
swap space set, etc, etc.  At 50MBps ASTR, you need 16 HD's operating
in parallel to saturate the bandwidth of a PCI-X channel.
That's ~8 independent pg tasks (queries using different tables,
dedicated WAL IO, etc) running in parallel.  Regardless of application domain.

h. Decent RAID controllers and HBAs are not cheap either.  Even SW
RAID benefits from having a big dedicated RAM buffer to talk to.

While the above may not cost you $80K, it sure isn't costing you $1K either.
Maybe ~$15-$20K, but not $1K.

Ron


At 01:07 AM 11/18/2005, Luke Lonergan wrote:
>Greg,
>
>
>On 11/17/05 9:17 PM, "Greg Stark" <gsstark@mit.edu> wrote:
>
> > Ok, a more productive point: it's not really the size of the database that
> > controls whether you're I/O bound or CPU bound. It's the available I/O
> > bandwidth versus your CPU speed.
>
>Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
>110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.
>
>A $1,000 system with one CPU and two SATA disks in a software RAID0 will
>perform exactly the same as a $80,000 system with 8 dual core CPUs and the
>world's best SCSI RAID hardware on a large database for decision support
>(what the poster asked about).
>
>Regards,
>
>- Luke
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match




Re: Hardware/OS recommendations for large databases (

From
Vivek Khera
Date:

On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:

Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.

Yeah, and mysql would probably be faster on your trivial queries.  Try concurrent large joins and updates and see which system is faster.

Re: Hardware/OS recommendations for large databases (

From
Vivek Khera
Date:
On Nov 18, 2005, at 1:07 AM, Luke Lonergan wrote:

> A $1,000 system with one CPU and two SATA disks in a software RAID0
> will
> perform exactly the same as a $80,000 system with 8 dual core CPUs
> and the
> world's best SCSI RAID hardware on a large database for decision
> support
> (what the poster asked about).

Hahahahahahahahahahahahaha! Whooo... needed to fall out of my chair
laughing this morning.

I can tell you from direct personal experience that you're just plain
wrong.

I've had to move my primary DB server from a dual P3 1GHz with 4-disk
RAID10 SCSI, to Dual P3 2GHz with 14-disk RAID10 and faster drives,
to Dual Opteron 2GHz with 8-disk RAID10 and even faster disks to keep
up with my load on a 60+ GB database.  The Dual opteron system has
just a little bit of extra capacity if I offload some of the
reporting operations to a replicated copy (via slony1).  If I run all
the queries on the one DB it can't keep up.

One most telling point about the difference in speed is that the 14-
disk array system cannot keep up with the replication being generated
by the dual opteron, even when it is no doing any other queries of
its own.  The I/O system just ain't fast enough.


Re: Hardware/OS recommendations for large databases (

From
Bill McGonigle
Date:
On Nov 18, 2005, at 08:00, Dave Cramer wrote:

>> A $1,000 system with one CPU and two SATA disks in a software RAID0
>> will
>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>> and the
>> world's best SCSI RAID hardware on a large database for decision
>> support
>> (what the poster asked about).
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
> This should draw some interesting posts.

There is some truth to it.  For an app I'm currently running (full-text
search using tsearch2 on ~100MB of data) on:

Dev System:
Asus bare-bones bookshelf case/mobo
3GHz P4 w/ HT
800MHz memory Bus
Fedora Core 3 (nightly update)
1GB RAM
1 SATA Seagate disk (7200RPM, 8MB Cache)
$800
worst-case query: 7.2 seconds

now, the machine I'm deploying to:

Dell SomthingOrOther
(4) 2.4GHz Xeons
533MHz memory bus
RedHat Enterprise 3.6
1GB RAM
(5) 150000 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller
 > $10000
same worst-case query: 9.6 seconds

Now it's not apples-to-apples.  There's a kernel 2.4 vs. 2.6 difference
and the memory bus is much faster and I'm not sure what kind of context
switching hit you get with the Xeon MP memory controller.  On a
previous postgresql app I did I ran nearly identically spec'ed machines
except for the memory bus and saw about a 30% boost in performance just
with the 800MHz bus.  I imagine the Opteron bus does even better.

So the small machine is probably slower on disk but makes up for it in
single-threaded access to CPU and memory speed. But if this app were to
be scaled it would make much more sense to cluster several $800
machines than it would to buy 'big-iron'.

-Bill
-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill@bfccomputing.com           Mobile: 603.252.2606
http://www.bfccomputing.com/    Pager: 603.442.1833
Jabber: flowerpt@gmail.com      Text: bill+text@bfccomputing.com
Blog: http://blog.bfccomputing.com/


Re: Hardware/OS recommendations for large databases (

From
Alan Stange
Date:
Richard Huxton wrote:
> Dave Cramer wrote:
>>
>> On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:
>>
>>> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound
>>> after
>>> 110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.
>>>
>>> A $1,000 system with one CPU and two SATA disks in a software RAID0
>>> will
>>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>>> and the
>>> world's best SCSI RAID hardware on a large database for decision
>>> support
>>> (what the poster asked about).
>>
>>
>> Now there's an interesting line drawn in the sand. I presume you
>> have  numbers to back this up ?
>>
>> This should draw some interesting posts.

That's interesting, as I occasionally see more than 110MB/s of
postgresql IO on our system.  I'm using a 32KB block size, which has
been a huge win in performance for our usage patterns.   300GB database
with a lot of turnover.  A vacuum analyze now takes about 3 hours, which
is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory,
Linux 2.6.11, FC drives.

-- Alan

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Richard,

On 11/18/05 5:22 AM, "Richard Huxton" <dev@archonet.com> wrote:

Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
system if he's got one going :-)

Finally, a game worth playing!

Except it’s backward – I’ll show you 80 $1,000 systems performing 80 times faster than one $80,000 system.

On your proposition – I don’t have any $80,000 systems for trade, do you?

- Luke

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Bill,

On 11/18/05 7:55 AM, "Bill McGonigle" <bill@bfccomputing.com> wrote:
>
> There is some truth to it.  For an app I'm currently running (full-text
> search using tsearch2 on ~100MB of data) on:

Do you mean 100GB?  Sounds like you are more like a decision support
/warehousing application.

> Dev System:
> Asus bare-bones bookshelf case/mobo
> 3GHz P4 w/ HT
> 800MHz memory Bus
> Fedora Core 3 (nightly update)
> 1GB RAM
> 1 SATA Seagate disk (7200RPM, 8MB Cache)
> $800
> worst-case query: 7.2 seconds

About the same machine I posted results for, except I had two faster disks.

> now, the machine I'm deploying to:
>
> Dell SomthingOrOther
> (4) 2.4GHz Xeons
> 533MHz memory bus
> RedHat Enterprise 3.6
> 1GB RAM
> (5) 150000 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller
>> $10000
> same worst-case query: 9.6 seconds

Your problem here is the HW RAID controller - if you dump it and use the
onboard SCSI channels and Linux RAID you will see a jump from 40MB/s to
about 220MB/s in read performance and from 20MB/s to 110MB/s write
performance.  It will use less CPU too.

> Now it's not apples-to-apples.  There's a kernel 2.4 vs. 2.6 difference
> and the memory bus is much faster and I'm not sure what kind of context
> switching hit you get with the Xeon MP memory controller.  On a
> previous postgresql app I did I ran nearly identically spec'ed machines
> except for the memory bus and saw about a 30% boost in performance just
> with the 800MHz bus.  I imagine the Opteron bus does even better.

Memory bandwidth is so high on both that it's not a factor.  Context
switching / memory bus contention isn't either.

> So the small machine is probably slower on disk but makes up for it in
> single-threaded access to CPU and memory speed. But if this app were to
> be scaled it would make much more sense to cluster several $800
> machines than it would to buy 'big-iron'.

Yes it does - by a lot too.  Also, having a multiprocessing executor gets
all of each machine by having multiple CPUs scan simultaneously.

- Luke



Re: Hardware/OS recommendations for large databases (

From
Alex Turner
Date:
Ok - so I ran the same test on my system and get a total speed of
113MB/sec.  Why is this?  Why is the system so limited to around just
110MB/sec?  I tuned read ahead up a bit, and my results improve a
bit..

Alex


On 11/18/05, Luke Lonergan <llonergan@greenplum.com> wrote:
>  Dave,
>
>  On 11/18/05 5:00 AM, "Dave Cramer" <pg@fastcrypt.com> wrote:
>  >
>  > Now there's an interesting line drawn in the sand. I presume you have
>  > numbers to back this up ?
>  >
>  > This should draw some interesting posts.
>
>  Part 2: The answer
>
>  System A:
>
> This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.
>
>  On a single table with 15 columns (the Bizgres IVP) at a size double memory
> (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan
> the table: that's 66 MB/s.  Not the efficiency I'd hope from the onboard
> SATA controller that I'd like, I would have expected to get 85% of the
> 100MB/s raw read performance.
>
>  So that's $1,200 / 66 MB/s (without adjusting for 2003 price versus now) =
> 18.2 $/MB/s
>
>  Raw data:
>  [llonergan@kite4 IVP]$ cat scan.sh
>  #!/bin/bash
>
>  time psql -c "select count(*) from ivp.bigtable1" dgtestdb
>  [llonergan@kite4 IVP]$ cat sysout1
>    count
>  ----------
>   10000000
>  (1 row)
>
>
>  real    0m32.565s
>  user    0m0.002s
>  sys     0m0.003s
>
>  Size of the table data:
>  [llonergan@kite4 IVP]$ du -sk dgtestdb/base
>  2121648 dgtestdb/base
>
>  System B:
>
> This system is running an XFS filesystem, and has been tuned to use very
> large (16MB) readahead.  It's running the Centos 4.1 distro, which uses a
> Linux 2.6.9 kernel.
>
>  Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)
> That's 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.
>  This system is running with a 16MB Linux readahead setting, let's try it
> with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds
> or 99.3MB/s.
>
>  So, using the tuned setting of "blockdev —setra 16384" we get $6,000 /
> 244MB/s = 24.6 $/MB/s
>  If we use the default Linux setting it's 2.5x worse.
>
>  Raw data:
>  [llonergan@modena2 IVP]$ cat scan.sh
>  #!/bin/bash
>
>  time psql -c "select count(*) from ivp.bigtable1" dgtestdb
>  [llonergan@modena2 IVP]$ cat sysout3
>    count
>  ----------
>   80000000
>  (1 row)
>
>
>  real    1m9.875s
>  user    0m0.000s
>  sys     0m0.004s
>  [llonergan@modena2 IVP]$ !du
>  du -sk dgtestdb/base
>  17021260        dgtestdb/base
>
>  Summary:
>
>  <cough, cough> OK – you can get more I/O bandwidth out of the current I/O
> path for sequential scan if you tune the filesystem for large readahead.
> This is a cheap alternative to overhauling the executor to use asynch I/O.
>
>  Still, there is a CPU limit here – this is not I/O bound, it is CPU limited
> as evidenced by the sensitivity to readahead settings.   If the filesystem
> could do 1GB/s, you wouldn't go any faster than 244MB/s.
>
>  - Luke

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Vivek,

On 11/18/05 8:05 AM, "Vivek Khera" <vivek@khera.org> wrote:

I can tell you from direct personal experience that you're just plain
wrong.

up with my load on a 60+ GB database.  The Dual opteron system has

I’m always surprised by what passes for a large database.  The poster is talking about 5,000GB, or almost 100 times the data you have.

Post your I/O numbers on sequential scan.  Sequential scan is critical for Decision Support / Data Warehousing.

- Luke

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Vivek,

On 11/18/05 8:07 AM, "Vivek Khera" <vivek@khera.org> wrote:


On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:

Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.

Yeah, and mysql would probably be faster on your trivial queries.  Try concurrent large joins and updates and see which system is faster.

That’s what we do to make a living.  And it’s Oracle that a lot faster because they implemented a much tighter, optimized I/O path to disk than Postgres.

Since you asked, we bought the 5 systems as a cluster – and with Bizgres MPP we get close to 400MB/s per machine on complex queries.

- Luke  

Re: Hardware/OS recommendations for large databases (

From
Dave Cramer
Date:
Luke,

Interesting numbers. I'm a little concerned about the use of blockdev —setra 16384. If I understand this correctly it assumes that the table is contiguous on the disk does it not ?


Dave
On 18-Nov-05, at 10:13 AM, Luke Lonergan wrote:

Dave,

On 11/18/05 5:00 AM, "Dave Cramer" <pg@fastcrypt.com> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Part 2: The answer

System A:
This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that’s 66 MB/s.  Not the efficiency I’d hope from the onboard SATA controller that I’d like, I would have expected to get 85% of the 100MB/s raw read performance.

So that’s $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s

Raw data:
[llonergan@kite4 IVP]$ cat scan.sh
#!/bin/bash

time psql -c "select count(*) from ivp.bigtable1" dgtestdb
[llonergan@kite4 IVP]$ cat sysout1
  count   
----------
 10000000
(1 row)


real    0m32.565s
user    0m0.002s
sys     0m0.003s

Size of the table data:
[llonergan@kite4 IVP]$ du -sk dgtestdb/base
2121648 dgtestdb/base

System B:
This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead.  It’s running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel.

Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)  That’s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.  This system is running with a 16MB Linux readahead setting, let’s try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.

So, using the tuned setting of “blockdev —setra 16384” we get $6,000 / 244MB/s = 24.6 $/MB/s
If we use the default Linux setting it’s 2.5x worse.

Raw data:
[llonergan@modena2 IVP]$ cat scan.sh
#!/bin/bash

time psql -c "select count(*) from ivp.bigtable1" dgtestdb
[llonergan@modena2 IVP]$ cat sysout3
  count   
----------
 80000000
(1 row)


real    1m9.875s
user    0m0.000s
sys     0m0.004s
[llonergan@modena2 IVP]$ !du
du -sk dgtestdb/base
17021260        dgtestdb/base

Summary:

<cough, cough> OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead.  This is a cheap alternative to overhauling the executor to use asynch I/O.

Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.

- Luke

Re: Hardware/OS recommendations for large databases

From
Ron
Date:
Breaking the ~120MBps pg IO ceiling by any means
is an important result.  Particularly when you
get a ~2x improvement.  I'm curious how far we
can get using simple approaches like this.

At 10:13 AM 11/18/2005, Luke Lonergan wrote:
>Dave,
>
>On 11/18/05 5:00 AM, "Dave Cramer" <pg@fastcrypt.com> wrote:
> >
> > Now there's an interesting line drawn in the sand. I presume you have
> > numbers to back this up ?
> >
> > This should draw some interesting posts.
>
>Part 2: The answer
>
>System A:
>This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.
>
>On a single table with 15 columns (the Bizgres
>IVP) at a size double memory (2.12GB), Postgres
>8.0.3 with Bizgres enhancements takes 32 seconds
>to scan the table: that’s 66 MB/s.  Not the
>efficiency I’d hope from the onboard SATA
>controller that I’d like, I would have expected
>to get 85% of the 100MB/s raw read performance.
Have you tried the large read ahead trick with
this system?  It would be interesting to see how
much it would help.  It might even be worth it to
do the experiment at all of [default, 2x default,
4x default, 8x default, etc] read ahead until
either a) you run out of resources to support the
desired read ahead, or b) performance levels
off.  I can imagine the results being very enlightening.


>System B:
>This system is running an XFS filesystem, and
>has been tuned to use very large (16MB)
>readahead.  It’s running the Centos 4.1 distro,
>which uses a Linux 2.6.9 kernel.
>
>Same test as above, but with 17GB of data takes
>69.7 seconds to scan (!)  That’s 244.2MB/s,
>which is obviously double my earlier point of
>110-120MB/s.  This system is running with a 16MB
>Linux readahead setting, let’s try it with the
>default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.
The above experiment would seem useful here as well.


>Summary:
>
><cough, cough> OK – you can get more I/O
>bandwidth out of the current I/O path for
>sequential scan if you tune the filesystem for
>large readahead.  This is a cheap alternative to
>overhauling the executor to use asynch I/O.
>
>Still, there is a CPU limit here – this is not
>I/O bound, it is CPU limited as evidenced by the
>sensitivity to readahead settings.   If the
>filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.
>
>- Luke

I respect your honesty in reporting results that
were different then your expectations or
previously taken stance.  Alan Stange's comment
re: the use of direct IO along with your comments
re: async IO and mem copies plus the results of
these experiments could very well point us
directly at how to most easily solve pg's CPU boundness during IO.

[HACKERS] are you watching this?

Ron



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

>     (mass snippage)
>     time psql -c "select count(*) from ivp.bigtable1" dgtestdb
>     [llonergan@modena2 IVP]$ cat sysout3
>       count
>     ----------
>      80000000
>     (1 row)
>
>
>     real    1m9.875s
>     user    0m0.000s
>     sys     0m0.004s
>     [llonergan@modena2 IVP]$ !du
>     du -sk dgtestdb/base
>     17021260        dgtestdb/base
>
>
> Summary:
>
> <cough, cough> OK – you can get more I/O bandwidth out of the current
> I/O path for sequential scan if you tune the filesystem for large
> readahead.  This is a cheap alternative to overhauling the executor to
> use asynch I/O.
>
> Still, there is a CPU limit here – this is not I/O bound, it is CPU
> limited as evidenced by the sensitivity to readahead settings.   If the
> filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.
>
>

Luke,

Interesting - but possibly only representative for a workload consisting
entirely of one executor doing "SELECT ... FROM my_single_table".

If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?

Cheers

Mark

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

On 11/18/05 3:46 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> If you alter this to involve more complex joins (e.g 4. way star) and
> (maybe add a small number of concurrent executors too) - is it still the
> case?

4-way star, same result, that's part of my point.  With Bizgres MPP, the
4-way star uses 4 concurrent scanners, though not all are active all the
time.  And that's per segment instance - we normally use one segment
instance per CPU, so our concurrency is NCPUs plus some.

The trick is the "small number of concurrent executors" part.  The only way
to get this with normal postgres is to have concurrent users, and normally
they are doing different things, scanning different parts of the disk.
These are competing things, and for concurrency enhancement something like
"sync scan" would be an effective optimization.

But in reporting, business analytics and warehousing in general, there are
reports that take hours to run.  If you can knock that down by factors of 10
using parallelism, it's a big win.  That's the reason that Teradata did $1.5
Billion in business last year.

More importantly - that's the kind of work that everyone using internet data
for analytics wants right now.

- Luke



Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

On 11/18/05 3:46 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?

I may not have listened to you - are you asking about whether the readahead works for these cases?

I’ll be running some massive TPC-H benchmarks on these machines soon – we’ll see then.

- Luke

Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
> On 11/18/05 3:46 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:
>
>     If you alter this to involve more complex joins (e.g 4. way star) and
>     (maybe add a small number of concurrent executors too) - is it still the
>     case?
>
>
> I may not have listened to you - are you asking about whether the
> readahead works for these cases?
>
> I’ll be running some massive TPC-H benchmarks on these machines soon –
> we’ll see then.


That too, meaning the business of 1 executor random reading a given
relation file whilst another is sequentially scanning (some other) part
of it....

Cheers

Mark

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
Mark,

On 11/18/05 6:27 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> That too, meaning the business of 1 executor random reading a given
> relation file whilst another is sequentially scanning (some other) part
> of it....

I think it should actually improve things - each I/O will read 16MB into the
I/O cache, then the next scanner will seek for 10ms to get the next 16MB
into cache, etc.  It should minimize the seek/data ratio nicely.  As long as
the tables are contiguous it should rock and roll.

- Luke



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
> On 11/18/05 3:46 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:
>
>
>>If you alter this to involve more complex joins (e.g 4. way star) and
>>(maybe add a small number of concurrent executors too) - is it still the
>>case?
>
>
> 4-way star, same result, that's part of my point.  With Bizgres MPP, the
> 4-way star uses 4 concurrent scanners, though not all are active all the
> time.  And that's per segment instance - we normally use one segment
> instance per CPU, so our concurrency is NCPUs plus some.
>

Luke - I don't think I was clear enough about what I was asking, sorry.

I added the more "complex joins" comment because:

- I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on
my old P3 system even earlier than that....)
- I am curious if the *other* access methods (indexscan, nested loop,
hash, merge, bitmap) also suffer then same fate.

I'm guessing from your comment that you have tested this too, but I
think its worth clarifying!

With respect to Bizgres MPP, scan parallelism is a great addition...
very nice! (BTW - is that in 0.8, or are we talking a new product variant?)

regards

Mark



Re: Hardware/OS recommendations for large databases (

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:

>
> - I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on
> my old P3 system even earlier than that....)

Ahem - after reading Alan's postings I am not so sure, ISTM that there
is some more investigation required here too :-).