Thread: Re: explain hashAggregate

Re: explain hashAggregate

From
martial.bizel@free.fr
Date:
Good morning,

I try to understand how optimizer uses HashAggregate instead of GroupAggregate
and I want to know what is exactly this two functionnality (benefits
/inconvenients)

In my case, I've this explain plan.
-----------------------
 Nested Loop  (cost=2451676.23..2454714.73 rows=1001 width=34)
   ->  Subquery Scan "day"  (cost=2451676.23..2451688.73 rows=1000 width=16)
         ->  Limit  (cost=2451676.23..2451678.73 rows=1000 width=12)
               ->  Sort  (cost=2451676.23..2451684.63 rows=3357 width=12)
                     Sort Key: sum(occurence)
                     ->  HashAggregate  (cost=2451471.24..2451479.63 rows=3357
width=12)
                           ->  Index Scan using test_date on
queries_detail_statistics  (cost=0.00..2449570.55 rows=380138 width=12)
                                 Index Cond: ((date >= '2006-01-01'::date) AND
(date <= '2006-01-30'::date))
                                 Filter: (((portal)::text = '1'::text) OR
((portal)::text = '2'::text))
   ->  Index Scan using query_string_pkey on query_string  (cost=0.00..3.01
rows=1 width=34)
         Index Cond: ("outer".query = query_string.id)
----------------------------
How to get necessary memory RAM for this explain plan ?



thanks a lot


Reliability recommendations

From
"Jeremy Haile"
Date:
We are a small company looking to put together the most cost effective
solution for our production database environment.  Currently in
production Postgres 8.1 is running on this machine:

Dell 2850
2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
4 GB DDR2 400 Mhz
2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
4 x 146 GB 10K SCSI RAID 10 (for postgres data)
Perc4ei controller

The above is a standard Dell box with nothing added or modified beyond
the options available directly through Dell. We had a bad processor last
week that effectively put us down for an entire weekend. Though it was
the web server that failed, the experience has caused us to step back
and spend time coming up with a more reliable/fail-safe solution that
can reduce downtime.

Our load won't be substantial so extreme performance and load balancing
are not huge concerns. We are looking for good performance, at a good
price, configured in the most redundant, high availability manner
possible. Availability is the biggest priority.

I sent our scenario to our sales team at Dell and they came back with
all manner of SAN, DAS, and configuration costing as much as $50k.

We have the budget to purchase 2-3 additional machines along the lines
of the one listed above. As a startup with a limited budget, what would
this list suggest as options for clustering/replication or setting our
database up well in general?

Re: Reliability recommendations

From
"Craig A. James"
Date:
Jeremy Haile wrote:
> We are a small company looking to put together the most cost effective
> solution for our production database environment.  Currently in
> production Postgres 8.1 is running on this machine:
>
> Dell 2850
> 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> 4 GB DDR2 400 Mhz
> 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> Perc4ei controller
>
> ... I sent our scenario to our sales team at Dell and they came back with
> all manner of SAN, DAS, and configuration costing as much as $50k.

Given what you've told us, a $50K machine is not appropriate.

Instead, think about a simple system with several clones of the database and a load-balancing web server, even if one
machinecould handle your load.  If a machine goes down, the load balancer automatically switches to the other. 

Look at the MTBF figures of two hypothetical machines:

 Machine 1: Costs $2,000, MTBF of 2 years, takes two days to fix on average.
 Machine 2: Costs $50,000, MTBF of 100 years (!), takes one hour to fix on average.

Now go out and buy three of the $2,000 machines.  Use a load-balancer front end web server that can send requests
round-robinfashion to a "server farm".  Clone your database.  In fact, clone the load-balancer too so that all three
machineshave all software and databases installed.  Call these A, B, and C machines. 

At any given time, your Machine A is your web front end, serving requests to databases on A, B and C.  If B or C goes
down,no problem - the system keeps running.  If A goes down, you switch the IP address of B or C and make it your web
frontend, and you're back in business in a few minutes. 

Now compare the reliability -- in order for this system to be disabled, you'd have to have ALL THREE computers fail at
thesame time.  With the MTBF and repair time of two days, each machine has a 99.726% uptime.  The "MTBF", that is, the
expectedtime until all three machines will fail simultaneously, is well over 100,000 years!  Of course, this is silly,
machinesdon't last that long, but it illustrates the point:  Redundancy is beats reliability (which is why RAID is so
useful). 

All for $6,000.

Craig

Re: Reliability recommendations

From
Mark Lewis
Date:
Machine 1: $2000
Machine 2: $2000
Machine 3: $2000

Knowing how to rig them together and maintain them in a fully fault-
tolerant way: priceless.


(Sorry for the off-topic post, I couldn't resist).

-- Mark Lewis

On Wed, 2006-02-15 at 09:19 -0800, Craig A. James wrote:
> Jeremy Haile wrote:
> > We are a small company looking to put together the most cost effective
> > solution for our production database environment.  Currently in
> > production Postgres 8.1 is running on this machine:
> >
> > Dell 2850
> > 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> > 4 GB DDR2 400 Mhz
> > 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> > 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> > Perc4ei controller
> >
> > ... I sent our scenario to our sales team at Dell and they came back with
> > all manner of SAN, DAS, and configuration costing as much as $50k.
>
> Given what you've told us, a $50K machine is not appropriate.
>
> Instead, think about a simple system with several clones of the database and a load-balancing web server, even if one
machinecould handle your load.  If a machine goes down, the load balancer automatically switches to the other. 
>
> Look at the MTBF figures of two hypothetical machines:
>
>  Machine 1: Costs $2,000, MTBF of 2 years, takes two days to fix on average.
>  Machine 2: Costs $50,000, MTBF of 100 years (!), takes one hour to fix on average.
>
> Now go out and buy three of the $2,000 machines.  Use a load-balancer front end web server that can send requests
round-robinfashion to a "server farm".  Clone your database.  In fact, clone the load-balancer too so that all three
machineshave all software and databases installed.  Call these A, B, and C machines. 
>
> At any given time, your Machine A is your web front end, serving requests to databases on A, B and C.  If B or C goes
down,no problem - the system keeps running.  If A goes down, you switch the IP address of B or C and make it your web
frontend, and you're back in business in a few minutes. 
>
> Now compare the reliability -- in order for this system to be disabled, you'd have to have ALL THREE computers fail
atthe same time.  With the MTBF and repair time of two days, each machine has a 99.726% uptime.  The "MTBF", that is,
theexpected time until all three machines will fail simultaneously, is well over 100,000 years!  Of course, this is
silly,machines don't last that long, but it illustrates the point:  Redundancy is beats reliability (which is why RAID
isso useful).  
>
> All for $6,000.
>
> Craig
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

Re: Reliability recommendations

From
"Joshua D. Drake"
Date:
Jeremy Haile wrote:
> We are a small company looking to put together the most cost effective
> solution for our production database environment.  Currently in
> production Postgres 8.1 is running on this machine:
>
> Dell 2850
> 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> 4 GB DDR2 400 Mhz
> 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> Perc4ei controller
>
> The above is a standard Dell box with nothing added or modified beyond
> the options available directly through Dell.
You should probably review the archives for PostgreSQL user
experience with Dell's before you purchase one.

> I sent our scenario to our sales team at Dell and they came back with
> all manner of SAN, DAS, and configuration costing as much as $50k.
>
HAHAHAHAHA.... Don't do that. Dell is making the assumption
you won't do your homework. Make sure you cross quote with IBM,
Compaq and Penguin Computing...

Sincerely,

Joshua D. Drake

--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


Re: Reliability recommendations

From
Greg Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Jeremy Haile wrote:
> > We are a small company looking to put together the most cost effective
> > solution for our production database environment.  Currently in
> > production Postgres 8.1 is running on this machine:
> >
> > Dell 2850
> > 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> > 4 GB DDR2 400 Mhz
> > 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> > 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> > Perc4ei controller

You don't say how this box is performing. There's no way to give
recommendations in a vacuum. Some users really need $50k boxes and others
(most) don't. This looks like a pretty good setup for Postgres and you would
have to be pushing things pretty hard to need much more.

That said some users have reported problems with Dell's raid controllers even
when the same brand's regular controllers worked well. That's what Joshua is
referring to.

> > The above is a standard Dell box with nothing added or modified beyond
> > the options available directly through Dell.
>
> You should probably review the archives for PostgreSQL user
> experience with Dell's before you purchase one.


--
greg

Re: Reliability recommendations

From
Christopher Browne
Date:
After takin a swig o' Arrakan spice grog, jd@commandprompt.com ("Joshua D. Drake") belched out:
> Jeremy Haile wrote:
>> We are a small company looking to put together the most cost effective
>> solution for our production database environment.  Currently in
>> production Postgres 8.1 is running on this machine:
>>
>> Dell 2850
>> 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
>> 4 GB DDR2 400 Mhz
>> 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
>> 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
>> Perc4ei controller
>>
>> The above is a standard Dell box with nothing added or modified beyond
>> the options available directly through Dell.

> You should probably review the archives for PostgreSQL user
> experience with Dell's before you purchase one.

Hear, hear!  We found Dell servers were big-time underperformers.

Generic hardware put together with generally the same brand names of
components (e.g. - for SCSI controllers and such) would generally play
much better.

For the cheapo desktop boxes they obviously have to buy the "cheapest
hardware available this week;" it sure seems as though they engage in
the same sort of thing with the "server class" hardware.

I don't think anyone has been able to forcibly point out any
completely precise shortcoming; just that they underperform what the
specs suggest they ought to be able to provide.

>> I sent our scenario to our sales team at Dell and they came back
>> with all manner of SAN, DAS, and configuration costing as much as
>> $50k.

> HAHAHAHAHA.... Don't do that. Dell is making the assumption you
> won't do your homework. Make sure you cross quote with IBM, Compaq
> and Penguin Computing...

Indeed.
--
(format nil "~S@~S" "cbbrowne" "gmail.com")
http://linuxdatabases.info/info/rdbms.html
Rules of  the Evil  Overlord #141.  "As  an alternative to  not having
children, I  will have _lots_ of  children.  My sons will  be too busy
jockeying for  position to  ever be a  real threat, and  the daughters
will all sabotage each other's attempts to win the hero."
<http://www.eviloverlord.com/>

Re: Reliability recommendations

From
Scott Marlowe
Date:
On Wed, 2006-02-15 at 12:44, Christopher Browne wrote:
> After takin a swig o' Arrakan spice grog, jd@commandprompt.com ("Joshua D. Drake") belched out:

> > You should probably review the archives for PostgreSQL user
> > experience with Dell's before you purchase one.
>
> Hear, hear!  We found Dell servers were big-time underperformers.
>
> Generic hardware put together with generally the same brand names of
> components (e.g. - for SCSI controllers and such) would generally play
> much better.

My experience has been that:

A:  Their rebranded LSI and Adaptec RAID controllers underperform.
B:  Their BIOS updates for said cards and the mobos for the 26xx series
comes in a format that requires you to have a friggin bootable DOS
floppy.  What is this, 1987???
C:  They use poorly performing mobo chipsets.

We had a dual P-III-750 with a REAL LSI RAID card and an intel mobo, and
replaced it with a dual P-IV 2800 Dell 2600 with twice the RAM.  As a
database server the P-III-750 was easily a match for the new dell, and
in some ways (i/o) outran it.

We also had a dual 2400 PIV Intel generic box, and it spanked the Dell
handily at everything, was easier to work on, the parts cost less, and
it used bog standard RAID cards and such.  I would highly recommend the
Intel Generic hardware over Dell any day.

Re: Reliability recommendations

From
Ron
Date:
At 11:21 AM 2/15/2006, Jeremy Haile wrote:
>We are a small company looking to put together the most cost effective
>solution for our production database environment.  Currently in
>production Postgres 8.1 is running on this machine:
>
>Dell 2850
>2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
>4 GB DDR2 400 Mhz
>2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
>4 x 146 GB 10K SCSI RAID 10 (for postgres data)
>Perc4ei controller
>
>The above is a standard Dell box with nothing added or modified beyond
>the options available directly through Dell. We had a bad processor last
>week that effectively put us down for an entire weekend. Though it was
>the web server that failed, the experience has caused us to step back
>and spend time coming up with a more reliable/fail-safe solution that
>can reduce downtime.
>
>Our load won't be substantial so extreme performance and load balancing
>are not huge concerns. We are looking for good performance, at a good
>price, configured in the most redundant, high availability manner
>possible. Availability is the biggest priority.
>
>I sent our scenario to our sales team at Dell and they came back with
>all manner of SAN, DAS, and configuration costing as much as $50k.
>
>We have the budget to purchase 2-3 additional machines along the lines
>of the one listed above. As a startup with a limited budget, what would
>this list suggest as options for clustering/replication or setting our
>database up well in general?

1= Tell Dell "Thanks but no thanks." and do not buy any more
equipment from them.  Their value per $$ is less than other options
available to you.

2= The current best bang for the buck HW (and in many cases, best
performing as well) for pg:
   a= AMD K8 and K9 (dual core) CPUs.  Particularly the A64 X2 3800+
when getting the most for your $$ matters a lot
        pg gets a nice performance boost from running in 64b.
   b= Decent Kx server boards are available from Gigabyte, IWill,
MSI, Supermicro, and Tyan to name a few.
        IWill has a 2P 16 DIMM slot board that is particularly nice
for a server that needs lots of RAM.
   c= Don't bother with SCSI or FC HD's unless you are doing the most
demanding kind of OLTP.  SATA II HD's provide better value.
   d= HW RAID controllers are only worth it in certain
scenarios.  Using RAID 5 almost always means you should use a HW RAID
controller.
   e= The only HW RAID controllers worth the $$ for you are 3ware
Escalade 9550SX's and Areca ARC-11xx or ARC-12xx's.
   *For the vast majority of throughput situations, the ARC-1xxx's
with >= 1GB of battery backed WB cache are the best value*
   f= 1GB RAM sticks are cheap enough and provide enough value that
you should max out any system you get with them.
   g= for +high+ speed fail over, Chelsio and others are now making
PCI-X and PCI-E 10GbE NICs at reasonable prices.
The above should serve as a good "pick list" for the components of
any servers you need.

3= The most economically sound HW and SW architecture that best suits
your performance and reliability needs is context dependent to your
specific circumstances.


Where are you located?
Ron




Re: Reliability recommendations

From
"Jeremy Haile"
Date:
Thanks for everyone's feedback.  I will definitely take the hardware
comments into consideration when purchasing future hardware.  I am
located in Atlanta, GA.  If Dell has such a bad reputation with this
list, does anyone have good vendor recommendations?

Although most of the responses were hardware-oriented (which was
probably my fault for not clearly stating my question), I am mostly
interested in replication/clustering ways of solving the issue.  My
example of Dell quoting us $50k for a SAN was meant to sound ridiculous
and is definitely not something we are considering.

What we are really after is a good clustering or replication solution
where we can run PostgreSQL on a small set of servers and have failover
capabilities.  While RAID is great, our last failure was a CPU failure
so a multi-server approach is something we want.  Does anyone have any
recommendations as far as a clustering/replication solutions, regardless
of hardware?  I know there are several open-source and commercial
postgres replication solutions - any good or bad experiences?  Also, any
opinions on shared storage and clustering vs separate internal storage.

Since performance is not our current bottleneck, I would imagine
Master->Slave replication would be sufficient, although performance
gains are always welcome.  I don't have much experience with setting
PostgreSQL in a replicated or clustered manner, so anything to point me
in the right direction both hardware and software wise would be
appreciated!

Thanks for all of the responses!

On Wed, 15 Feb 2006 14:53:28 -0500, "Ron" <rjpeace@earthlink.net> said:
> At 11:21 AM 2/15/2006, Jeremy Haile wrote:
> >We are a small company looking to put together the most cost effective
> >solution for our production database environment.  Currently in
> >production Postgres 8.1 is running on this machine:
> >
> >Dell 2850
> >2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> >4 GB DDR2 400 Mhz
> >2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> >4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> >Perc4ei controller
> >
> >The above is a standard Dell box with nothing added or modified beyond
> >the options available directly through Dell. We had a bad processor last
> >week that effectively put us down for an entire weekend. Though it was
> >the web server that failed, the experience has caused us to step back
> >and spend time coming up with a more reliable/fail-safe solution that
> >can reduce downtime.
> >
> >Our load won't be substantial so extreme performance and load balancing
> >are not huge concerns. We are looking for good performance, at a good
> >price, configured in the most redundant, high availability manner
> >possible. Availability is the biggest priority.
> >
> >I sent our scenario to our sales team at Dell and they came back with
> >all manner of SAN, DAS, and configuration costing as much as $50k.
> >
> >We have the budget to purchase 2-3 additional machines along the lines
> >of the one listed above. As a startup with a limited budget, what would
> >this list suggest as options for clustering/replication or setting our
> >database up well in general?
>
> 1= Tell Dell "Thanks but no thanks." and do not buy any more
> equipment from them.  Their value per $$ is less than other options
> available to you.
>
> 2= The current best bang for the buck HW (and in many cases, best
> performing as well) for pg:
>    a= AMD K8 and K9 (dual core) CPUs.  Particularly the A64 X2 3800+
> when getting the most for your $$ matters a lot
>         pg gets a nice performance boost from running in 64b.
>    b= Decent Kx server boards are available from Gigabyte, IWill,
> MSI, Supermicro, and Tyan to name a few.
>         IWill has a 2P 16 DIMM slot board that is particularly nice
> for a server that needs lots of RAM.
>    c= Don't bother with SCSI or FC HD's unless you are doing the most
> demanding kind of OLTP.  SATA II HD's provide better value.
>    d= HW RAID controllers are only worth it in certain
> scenarios.  Using RAID 5 almost always means you should use a HW RAID
> controller.
>    e= The only HW RAID controllers worth the $$ for you are 3ware
> Escalade 9550SX's and Areca ARC-11xx or ARC-12xx's.
>    *For the vast majority of throughput situations, the ARC-1xxx's
> with >= 1GB of battery backed WB cache are the best value*
>    f= 1GB RAM sticks are cheap enough and provide enough value that
> you should max out any system you get with them.
>    g= for +high+ speed fail over, Chelsio and others are now making
> PCI-X and PCI-E 10GbE NICs at reasonable prices.
> The above should serve as a good "pick list" for the components of
> any servers you need.
>
> 3= The most economically sound HW and SW architecture that best suits
> your performance and reliability needs is context dependent to your
> specific circumstances.
>
>
> Where are you located?
> Ron
>
>
>

Re: Reliability recommendations

From
Josh Rovero
Date:
Jeremy Haile wrote:
> Thanks for everyone's feedback.  I will definitely take the hardware
> comments into consideration when purchasing future hardware.  I am
> located in Atlanta, GA.  If Dell has such a bad reputation with this
> list, does anyone have good vendor recommendations?
>
I can recommend Penguin Computing (even our windows weenies like them),
ASA, and HP Proliant.  AMD Opteron *is* the way to go.



Re: Reliability recommendations

From
Bruce Momjian
Date:
Christopher Browne wrote:
> After takin a swig o' Arrakan spice grog, jd@commandprompt.com ("Joshua D. Drake") belched out:
> > Jeremy Haile wrote:
> >> We are a small company looking to put together the most cost effective
> >> solution for our production database environment.  Currently in
> >> production Postgres 8.1 is running on this machine:
> >>
> >> Dell 2850
> >> 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
> >> 4 GB DDR2 400 Mhz
> >> 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
> >> 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
> >> Perc4ei controller
> >>
> >> The above is a standard Dell box with nothing added or modified beyond
> >> the options available directly through Dell.
>
> > You should probably review the archives for PostgreSQL user
> > experience with Dell's before you purchase one.
>
> Hear, hear!  We found Dell servers were big-time underperformers.
>
> Generic hardware put together with generally the same brand names of
> components (e.g. - for SCSI controllers and such) would generally play
> much better.
>
> For the cheapo desktop boxes they obviously have to buy the "cheapest
> hardware available this week;" it sure seems as though they engage in
> the same sort of thing with the "server class" hardware.
>
> I don't think anyone has been able to forcibly point out any
> completely precise shortcoming; just that they underperform what the
> specs suggest they ought to be able to provide.

Dell often says part X is included, but part X is not the exact same as
part X sold by the original manufacturer.  To hit a specific price
point, Dell is willing to strip thing out of commodity hardware, and
often does so even when performance suffers.  For many people, this is
unacceptable.

--
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Reliability recommendations

From
"Craig A. James"
Date:
Bruce Momjian wrote:
> Dell often says part X is included, but part X is not the exact same as
> part X sold by the original manufacturer.  To hit a specific price
> point, Dell is willing to strip thing out of commodity hardware, and
> often does so even when performance suffers.  For many people, this is
> unacceptable.

I find this strains credibility, that this major manufacturer of PC's would do something deceptive that hurts
performance,when it would be easily detected and widely reported.  Can anyone cite a specific instances where this has
happened? Such as, "I bought Dell model XYZ, which was advertised to have these parts and these specs, but in fact had
theseother parts and here are the actual specs." 

Dell seems to take quite a beating in this forum, and I don't recall seeing any other manufacturer blasted this way.
Isit that they are deceptive, or simply that their "servers" are designed to be office servers, not database servers? 

There's nothing wrong with Dell designing their servers for a different market than ours; they need to go for the
profits,and that may not include us.  But it's not fair for us to claim Dell is being deceptive unless we have concrete
evidence.

Craig


Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Bruce,

On 2/24/06 6:29 AM, "Bruce Momjian" <pgman@candle.pha.pa.us> wrote:

> Christopher Browne wrote:
>> After takin a swig o' Arrakan spice grog, jd@commandprompt.com ("Joshua D.
>> Drake") belched out:

Always more fun to read drunken posts :-)

>>>> Dell 2850
>>>> 2 x 3.0 Ghz Xeon 800Mhz FSB 2MB Cache
>>>> 4 GB DDR2 400 Mhz
>>>> 2 x 73 GB 10K SCSI RAID 1 (for xlog and OS)
>>>> 4 x 146 GB 10K SCSI RAID 10 (for postgres data)
>>>> Perc4ei controller

> Dell often says part X is included, but part X is not the exact same as
> part X sold by the original manufacturer.  To hit a specific price
> point, Dell is willing to strip thing out of commodity hardware, and
> often does so even when performance suffers.  For many people, this is
> unacceptable.

I'll register the contrarian's point of view on this: we just had a customer
(an airline) buy the exact machines that Jeremy lists here, including the
Perc4 controller (an LSI MPT RAID controller).

Besides the complete lack of real RAID 10 support, the machines are actually
performing very well in hardware RAID5 mode.  They are running Redhat 4
linux, and we did have to tune the I/O readahead to get the performance we
needed - about 250MB/s on 2 banks of RAID5, not bad at all for only 4 active
disks.

I think the build quality and the fit/finish/features of their chassis were
all pretty good.

So, if you want RAID5, these machines work for me.  The lack of RAID 10
could knock them out of contention for people.

The problem with their RAID10 is actually hidden from view BTW.  You can
configure the controller with a "spanning" and other options that sound like
RAID10, but in actuality it's spanning pairs of RAID1 disks, which does not
provide the performance benefit of RAID10 (see James Thornton's post here:
http://openacs.org/forums/message-view?message_id=178447), the important bit
of which says:

"RAID-10 on PERC 2/SC, 2/DC, 3/SC, 3/DCL, 3/DC, 3/QC, 4/Di, and CERC
ATA100/4ch controllers is implemented as RAID Level 1-Concatenated. RAID-1
Concatenated is a RAID-1 array that spans across more than a single pair of
array disks. This combines the advantages of concatenation with the
redundancy of RAID-1. No striping is involved in this RAID type. Also,
RAID-1 Concatenated can be implemented on hardware that supports only RAID-1
by creating multiple RAID-1 virtual disks, upgrading the virtual disks to
dynamic disks, and then using spanning to concatenate all of the RAID-1
virtual disks into one large dynamic volume. In a concatenation (spanned
volume), when an array disk in a concatenated or spanned volume fails, the
entire volume becomes unavailable."

- Luke



Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Bruce,

On 2/24/06 7:14 AM, "Luke Lonergan" <llonergan@greenplum.com> wrote:

> So, if you want RAID5, these machines work for me.  The lack of RAID 10
> could knock them out of contention for people.

Sorry in advance for the double post, but there's some more information on
this, which altogether demonstrates why people get frustrated with Dell IMO.

Later on in the article I cited, the same person who listed Dell's technical
description of their strange, not really RAID10 support, he mentions that he
called LSI directly to find out the real scoop.  He says that they (LSI)
told him that the LSI controller does support real RAID10, so he concluded
that the Dell could do it too.

However, Dell's documentation seems unambiguous to me, and matches our
direct experience.  Also, more online documentation from Dell reinforces
this.  See their definition of both RAID10 and spanning in this
backgrounder:
http://support.dell.com/support/edocs/storage/RAID/RAIDbk0.pdf.  Seems that
they are clear about what striping and spanning do, and what RAID10 does,
making their delineation of the standard RAID10 support versus what some of
their controllers do (all of the PERC 4 series) pretty clear.

So, I'd conclude at this point that Dell seems to have implemented a RAID
BIOS that does not allow true RAID10 or RAID50 on their embedded LSI
adapters, which could otherwise support RAID10/50.  They have done this
intentionally, for some reason unknown to me and it seems to some people at
LSI as well.

- Luke



Re: Reliability recommendations

From
Vivek Khera
Date:
On Feb 24, 2006, at 9:29 AM, Bruce Momjian wrote:

> Dell often says part X is included, but part X is not the exact
> same as
> part X sold by the original manufacturer.  To hit a specific price
> point, Dell is willing to strip thing out of commodity hardware, and
> often does so even when performance suffers.  For many people, this is
> unacceptable.

The last dell box I bought, a PE1850, came with a PERC 4e/Si card,
which I believe is the same as the card the OP was looking at.  It is
very fast in RAID1 with two U320 disks.

For real DB work, I'd look more to a dual channel card and have 1/2
of each mirror pair on opposing channels.  Dell can configure that
for you, I'm sure.

I think the well tossed-around notion of Dells being underperforming
needs to be re-evaluated with the EM64T Xeon based systems.  They are
quite fast.  I haven't put a very large db with extreme loads on any
of these systems, but the simple benchmarking I did on them shows
them to be acceptable performers.  The high-end RAID cards they sell
these days do not seem to me to be skimpy.

Now if they'd only get on the Opteron bandwagon....


Attachment

Re: Reliability recommendations

From
Scott Marlowe
Date:
On Fri, 2006-02-24 at 10:27, Vivek Khera wrote:
> On Feb 24, 2006, at 9:29 AM, Bruce Momjian wrote:
>
> > Dell often says part X is included, but part X is not the exact
> > same as
> > part X sold by the original manufacturer.  To hit a specific price
> > point, Dell is willing to strip thing out of commodity hardware, and
> > often does so even when performance suffers.  For many people, this is
> > unacceptable.
>
> The last dell box I bought, a PE1850, came with a PERC 4e/Si card,
> which I believe is the same as the card the OP was looking at.  It is
> very fast in RAID1 with two U320 disks.

My bad experiences were with the 2600 series machines.  We now have some
2800 and they're much better than the 2600/2650s I've used in the past.

That said, I've not tried to do anything too high end with them, like
RAID 1+0 or anything.

My past experience has been that Intel and SuperMicro make much better
"white boxes" you can get from almost any halfway decent local
supplier.  They're faster, more reliable, and the parts are easy to get.

Re: Reliability recommendations

From
Vivek Khera
Date:
On Feb 24, 2006, at 11:32 AM, Scott Marlowe wrote:

> My bad experiences were with the 2600 series machines.  We now have
> some
> 2800 and they're much better than the 2600/2650s I've used in the
> past.

Yes, the 2450 and 2650 were CRAP disk performers.  I haven't any 2850
to compare, just an 1850.


Attachment

Re: Reliability recommendations

From
Scott Marlowe
Date:
On Fri, 2006-02-24 at 10:40, Vivek Khera wrote:
> On Feb 24, 2006, at 11:32 AM, Scott Marlowe wrote:
>
> > My bad experiences were with the 2600 series machines.  We now have
> > some
> > 2800 and they're much better than the 2600/2650s I've used in the
> > past.
>
> Yes, the 2450 and 2650 were CRAP disk performers.  I haven't any 2850
> to compare, just an 1850.
>
And the real problem with the 2650s we have now is that under very heavy
load, they just lock up.  all of them, latest BIOS updates, etc...
nothing makes them stable.  Some things make them a little less dodgy,
but they are never truly reliable.  Something the Intel and Supermicro
white boxes have always been for me.

I don't want support from a company like Dell, I want a reliable
machine.

Re: Reliability recommendations

From
"Joshua D. Drake"
Date:
> I find this strains credibility, that this major manufacturer of PC's
> would do something deceptive that hurts performance, when it would be
> easily detected and widely reported.  Can anyone cite a specific
> instances where this has happened?  Such as, "I bought Dell model XYZ,
> which was advertised to have these parts and these specs, but in fact
> had these other parts and here are the actual specs."
I can :)

Feb 20 07:33:52 master kernel: [4294682.803000]   Vendor: MegaRAID
Model: LD 0 RAID1   51G  Rev: 196T
Feb 20 07:33:52 master kernel: [4294682.803000]   Type:
Direct-Access                      ANSI SCSI revision: 02
Feb 20 07:33:52 master kernel: [4294682.818000] tg3.c:v3.31 (June 8, 2005)
Feb 20 07:33:52 master kernel: [4294682.818000] ACPI: PCI Interrupt
0000:0a:01.0[A] -> GSI 17 (level, low) -> IRQ 17
Feb 20 07:33:52 master kernel: [4294683.510000] eth0: Tigon3
[partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000BaseT Ethernet 00:0f:1f:6e:01:f
Feb 20 07:33:52 master kernel: [4294683.510000] eth0: RXcsums[1]
LinkChgREG[1] MIirq[1] ASF[0] Split[0] WireSpeed[0] TSOcap[0]
Feb 20 07:33:52 master kernel: [4294683.510000] eth0: dma_rwctrl[76ff000f]
Feb 20 07:33:52 master kernel: [4294683.510000] ACPI: PCI Interrupt
0000:0a:02.0[A] -> GSI 18 (level, low) -> IRQ 18
Feb 20 07:33:52 master kernel: [4294684.203000] eth1: Tigon3
[partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000BaseT Ethernet 00:0f:1f:6e:01:f
Feb 20 07:33:52 master kernel: [4294684.203000] eth1: RXcsums[1]
LinkChgREG[1] MIirq[1] ASF[0] Split[0] WireSpeed[0] TSOcap[0]
Feb 20 07:33:52 master kernel: [4294684.203000] eth1: dma_rwctrl[76ff000f]
Feb 20 07:33:52 master kernel: [4294686.228000] SCSI device sda:
106168320 512-byte hdwr sectors (54358 MB)
Feb 20 07:33:52 master kernel: [4294686.228000] SCSI device sda:
106168320 512-byte hdwr sectors (54358 MB)
Feb 20 07:33:52 master kernel: [4294686.228000]
/dev/scsi/host0/bus2/target0/lun0: p1 p2 < p5 > p3
Feb 20 07:33:52 master kernel: [4294686.243000] Attached scsi disk sda
at scsi0, channel 2, id 0, lun 0
Feb 20 07:33:52 master kernel: [4294686.578000] ACPI: CPU0 (power
states: C1[C1])
Feb 20 07:33:52 master kernel: [4294686.655000] Attempting manual resume
Feb 20 07:33:52 master kernel: [4294686.659000] swsusp: Suspend
partition has wrong signature?
Feb 20 07:33:52 master kernel: [4294686.671000] kjournald starting.
Commit interval 5 seconds
Feb 20 07:33:52 master kernel: [4294686.671000] EXT3-fs: mounted
filesystem with ordered data mode.
Feb 20 07:33:52 master kernel: [4294687.327000] md: md driver 0.90.1
MAX_MD_DEVS=256, MD_SB_DISKS=27
Feb 20 07:33:52 master kernel: [4294688.633000] Adding 3903784k swap on
/dev/sda5.  Priority:-1 extents:1
Feb 20 07:33:52 master kernel: [4294688.705000] EXT3 FS on sda3,
internal journal
Feb 20 07:33:52 master kernel: [4294692.533000] lp: driver loaded but no
devices found
Feb 20 07:33:52 master kernel: [4294692.557000] mice: PS/2 mouse device
common for all mice
Feb 20 07:33:52 master kernel: [4294695.263000] device-mapper:
4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com
Feb 20 07:33:52 master kernel: [4294695.479000] kjournald starting.
Commit interval 5 seconds
Feb 20 07:33:52 master kernel: [4294695.479000] EXT3 FS on sda1,
internal journal
Feb 20 07:33:52 master kernel: [4294695.479000] EXT3-fs: mounted
filesystem with ordered data mode.
Feb 20 07:33:52 master kernel: [4294695.805000] Linux agpgart interface
v0.101 (c) Dave Jones
Feb 20 07:33:52 master kernel: [4294696.071000] piix4_smbus
0000:00:0f.0: Found 0000:00:0f.0 device
Feb 20 07:33:52 master kernel: [4294696.584000] pci_hotplug: PCI Hot
Plug PCI Core version: 0.5


processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) MP CPU 2.20GHz
stepping        : 6
cpu MHz         : 2194.056
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 4325.37



This machine... if you run it in raid 5 will only get 7-9 megabytes a
second READ! performance. That is with 6 SCSI drives.
If you run it in RAID 10 you get a more reasonable 50-55 megabytes per
second.

I don't have it sitting in front of me or I would give you an exact
model number.

This machine also uses the serverworks chipset which is known to be a
catastrophe.

Joshua D. Drake




>
> Dell seems to take quite a beating in this forum, and I don't recall
> seeing any other manufacturer blasted this way.  Is it that they are
> deceptive, or simply that their "servers" are designed to be office
> servers, not database servers?
>
> There's nothing wrong with Dell designing their servers for a
> different market than ours; they need to go for the profits, and that
> may not include us.  But it's not fair for us to claim Dell is being
> deceptive unless we have concrete evidence.
>
> Craig
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq


--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


Re: Reliability recommendations

From
"Craig A. James"
Date:
Joshua D. Drake wrote:
>> I find this strains credibility, that this major manufacturer of PC's
>> would do something deceptive that hurts performance, when it would be
>> easily detected and widely reported.  Can anyone cite a specific
>> instances where this has happened?  Such as, "I bought Dell model XYZ,
>> which was advertised to have these parts and these specs, but in fact
>> had these other parts and here are the actual specs."
>
> I can :)
>
> Feb 20 07:33:52 master kernel: [4294682.803000]   Vendor: MegaRAID
> Model: LD 0 RAID1   51G  Rev: 196T
> --- snip ---
> This machine... if you run it in raid 5 will only get 7-9 megabytes a
> second READ! performance. That is with 6 SCSI drives.
> If you run it in RAID 10 you get a more reasonable 50-55 megabytes per
> second.

But you don't say how this machine was advertised.  Are there components in that list that were not as advertised?  Was
themachine advertised as capable of RAID 5?  Were performance figures published for RAID 5? 

If Dell advertised that the machine could do what you asked, then you're right -- they screwed you.  But if it was
designedfor and advertised to a different market, then I've made my point: People are blaming Dell for something that's
nottheir fault. 

Craig

Re: Reliability recommendations

From
Scott Marlowe
Date:
On Fri, 2006-02-24 at 17:12, Craig A. James wrote:
> Joshua D. Drake wrote:
> >> I find this strains credibility, that this major manufacturer of PC's
> >> would do something deceptive that hurts performance, when it would be
> >> easily detected and widely reported.  Can anyone cite a specific
> >> instances where this has happened?  Such as, "I bought Dell model XYZ,
> >> which was advertised to have these parts and these specs, but in fact
> >> had these other parts and here are the actual specs."
> >
> > I can :)
> >
> > Feb 20 07:33:52 master kernel: [4294682.803000]   Vendor: MegaRAID
> > Model: LD 0 RAID1   51G  Rev: 196T
> > --- snip ---
> > This machine... if you run it in raid 5 will only get 7-9 megabytes a
> > second READ! performance. That is with 6 SCSI drives.
> > If you run it in RAID 10 you get a more reasonable 50-55 megabytes per
> > second.
>
> But you don't say how this machine was advertised.  Are there components in that list that were not as advertised?
Wasthe machine advertised as capable of RAID 5?  Were performance figures published for RAID 5? 
>
> If Dell advertised that the machine could do what you asked, then you're right -- they screwed you.  But if it was
designedfor and advertised to a different market, then I've made my point: People are blaming Dell for something that's
nottheir fault. 

IT was advertised as a rackmount server with dual processors and a RAID
controller with 6 drive bays.  I would expect such a machine to perform
well in both RAID 5 and RAID 1+0 configurations.

It certainly didn't do what we expected of a machine with the specs it
had.  For the same price, form factor and basic setup, i.e. dual P-IV 2
to 4 gig ram, 5 or 6 drive bays, I'd expect the same thing.  They were
crap.  Honestly.  Did you see the post where I mentioned that under
heavy I/O load they lock up about once every month or so.  They all do,
every one I've ever seen.  Some take more time than others, but they all
eventually lock up while running.

I was pretty much agnostic as to which servers we bought at my last job,
until someone started ordering from Dell and we got 2600 series
machines.  No one should have to live with these things.

Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Joshua,

On 2/24/06 9:19 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote:

> This machine... if you run it in raid 5 will only get 7-9 megabytes a
> second READ! performance. That is with 6 SCSI drives.
> If you run it in RAID 10 you get a more reasonable 50-55 megabytes per
> second.
>
> I don't have it sitting in front of me or I would give you an exact
> model number.
>
> This machine also uses the serverworks chipset which is known to be a
> catastrophe.

I'd be more shocked if this weren't also true of nearly all SCSI HW RAID
adapters of this era.  If you had ordered an HP DL380 server you'd get about
the same performance.

BTW - I don't think there's anything reasonable about 50-55 MB/s from 6
disks, I'd put the minimum for this era machine at 5 x 30 = 150MB/s.

- Luke



Re: Reliability recommendations

From
Philippe Marzin
Date:
Do you have a hw reference that runs that fast (5 x 30 = 150MB/s) ?


Luke Lonergan a écrit :
Joshua,

On 2/24/06 9:19 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
 
This machine... if you run it in raid 5 will only get 7-9 megabytes a
second READ! performance. That is with 6 SCSI drives.
If you run it in RAID 10 you get a more reasonable 50-55 megabytes per
second.

I don't have it sitting in front of me or I would give you an exact
model number.

This machine also uses the serverworks chipset which is known to be a
catastrophe.   
I'd be more shocked if this weren't also true of nearly all SCSI HW RAID
adapters of this era.  If you had ordered an HP DL380 server you'd get about
the same performance.

BTW - I don't think there's anything reasonable about 50-55 MB/s from 6
disks, I'd put the minimum for this era machine at 5 x 30 = 150MB/s.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate      subscribe-nomail command to majordomo@postgresql.org so that your      message can get through to the mailing list cleanly


 

Re: Reliability recommendations

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

>
> I'd be more shocked if this weren't also true of nearly all SCSI HW RAID
> adapters of this era.  If you had ordered an HP DL380 server you'd get about
> the same performance.
>
> BTW - I don't think there's anything reasonable about 50-55 MB/s from 6
> disks, I'd put the minimum for this era machine at 5 x 30 = 150MB/s.
>

He was quoting for 6 disk RAID 10 - I'm thinking 3 x 30MB/s = 90MB/s is
probably more correct? Having aid that, your point is still completely
correct - the performance @55MB/s is poor (e.g. my *ata* system with 2
disk RAID0 does reads @110MB/s).

cheers

Mark

Re: Reliability recommendations

From
Dan Gorman
Date:
All,

Was that sequential reads? If so, yeah you'll get 110MB/s? How big
was the datafile size? 8MB? Yeah, you'll get 110MB/s. 2GB? No, they
can't sustain that. There are so many details missing from this test
that it's hard to have any context around it :)

I was getting about 40-50MB/s on a PV with 14 disks on a RAID10 in
real world usage. (random IO and fully saturating a Dell 1850 with 4
concurrent threads (to peg the cpu on selects) and raw data files)

Best Regards,
Dan Gorman

On Feb 24, 2006, at 4:29 PM, Mark Kirkwood wrote:

> Luke Lonergan wrote:
>
>> I'd be more shocked if this weren't also true of nearly all SCSI
>> HW RAID
>> adapters of this era.  If you had ordered an HP DL380 server you'd
>> get about
>> the same performance.
>> BTW - I don't think there's anything reasonable about 50-55 MB/s
>> from 6
>> disks, I'd put the minimum for this era machine at 5 x 30 = 150MB/s.
>
> He was quoting for 6 disk RAID 10 - I'm thinking 3 x 30MB/s = 90MB/
> s is probably more correct? Having aid that, your point is still
> completely correct - the performance @55MB/s is poor (e.g. my *ata*
> system with 2 disk RAID0 does reads @110MB/s).
>
> cheers
>
> Mark
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org


Re: Reliability recommendations

From
Mark Kirkwood
Date:
Dan Gorman wrote:
> All,
>
> Was that sequential reads? If so, yeah you'll get 110MB/s? How big  was
> the datafile size? 8MB? Yeah, you'll get 110MB/s. 2GB? No, they  can't
> sustain that. There are so many details missing from this test  that
> it's hard to have any context around it :)
>

Actually they can. Datafile size was 8G, machine had 2G RAM (i.e.
datafile 4 times memory). The test was for a sequential read with 8K
blocks. I believe this is precisely the type of test that the previous
posters were referring to - while clearly, its not a real-world measure,
we are comparing like to like, and as such terrible results on such a
simple test are indicative of something 'not right'.

regards

Mark

P.s. FWIW - I'm quoting a test from a few years ago - the (same) machine
now has 4 RAID0 ata disks and does 175MB/s on the same test....




Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Dan,

On 2/24/06 4:47 PM, "Dan Gorman" <dgorman@hi5.com> wrote:

> Was that sequential reads? If so, yeah you'll get 110MB/s? How big
> was the datafile size? 8MB? Yeah, you'll get 110MB/s. 2GB? No, they
> can't sustain that. There are so many details missing from this test
> that it's hard to have any context around it :)
>
> I was getting about 40-50MB/s on a PV with 14 disks on a RAID10 in
> real world usage. (random IO and fully saturating a Dell 1850 with 4
> concurrent threads (to peg the cpu on selects) and raw data files)

OK, how about some proof?

In a synthetic test that writes 32GB of sequential 8k pages on a machine
with 16GB of RAM:
========================= Write test results ==============================
time bash -c "dd if=/dev/zero of=/dbfast1/llonergan/bigfile bs=8k
count=2000000 && sync" &
time bash -c "dd if=/dev/zero of=/dbfast3/llonergan/bigfile bs=8k
count=2000000 && sync" &

2000000 records in
2000000 records out
2000000 records in
2000000 records out

real    1m0.046s
user    0m0.270s
sys     0m30.008s

real    1m0.047s
user    0m0.287s
sys     0m30.675s

So that's 32,000 MB written in 60.05 seconds, which is 533MB/s sustained
with two threads.

Now to read the same files in parallel:
========================= Read test results ==============================
sync
time dd of=/dev/null if=/dbfast1/llonergan/bigfile bs=8k &
time dd of=/dev/null if=/dbfast3/llonergan/bigfile bs=8k &

2000000 records in
2000000 records out

real    0m39.849s
user    0m0.282s
sys     0m22.294s
2000000 records in
2000000 records out

real    0m40.410s
user    0m0.251s
sys     0m22.515s

And that's 32,000MB in 40.4 seconds, or 792MB/s sustained from disk (not
memory).

These are each RAID5 arrays of 8 internal SATA disks on 3Ware HW RAID
controllers.

Now for real usage, let's run a simple sequential scan query on 123,434 MB
of data in a single table on 4 of these machines in parallel.  All tables
are distributed evenly by Bizgres MPP over all 8 filesystems:

============= Bizgres MPP sequential scan results =========================

[llonergan@salerno0 ~]$ !psql
psql -p 9999 -U mppdemo1 demo
Welcome to psql 8.1.1 (server 8.1.3), the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

demo=# \timing
Timing is on.
demo=# select version();

version
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-----
 PostgreSQL 8.1.3 (Bizgres MPP 2.1) on x86_64-unknown-linux-gnu, compiled by
GCC gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) compiled on Feb 23 2006
11:34:06
(1 row)

Time: 0.570 ms
demo=# select relname,8*relpages/128 as MB from pg_class order by relpages
desc limit 6;
            relname             |   mb
---------------------------------------
 lineitem                       | 123434
 orders                         |  24907
 partsupp                       |  14785
 part                           |   3997
 customer                       |   3293
 supplier                       |    202
(6 rows)

Time: 1.824 ms
demo=# select count(*) from lineitem;
   count
-----------
 600037902
(1 row)

Time: 60300.960 ms

So that's 123,434 MB of data scanned in 60.3 seconds, or 2,047 MB/s on 4
machines, which uses 512MB/s of disk bandwidth on each machine.

Now let's do a query that uses a this big table (a two way join) using all 4
machines:
============= Bizgres MPP Query results =========================
demo=# select
demo-#         sum(l_extendedprice* (1 - l_discount)) as revenue
demo-# from
demo-#         lineitem,
demo-#         part
demo-# where
demo-#         (
demo(#                 p_partkey = l_partkey
demo(#                 and p_brand = 'Brand#42'
demo(#                 and p_container in ('SM CASE', 'SM BOX', 'SM PACK',
'SM PKG')
demo(#                 and l_quantity >= 7 and l_quantity <= 7  10
demo(#                 and p_size between 1 and 5
demo(#                 and l_shipmode in ('AIR', 'AIR REG')
demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
demo(#         )
demo-#         or
demo-#         (
demo(#                 p_partkey = l_partkey
demo(#                 and p_brand = 'Brand#15'
demo(#                 and p_container in ('MED BAG', 'MED BOX', 'MED PKG',
'MED PACK')
demo(#                 and l_quantity >= 14 and l_quantity <= 14  10
demo(#                 and p_size between 1 and 10
demo(#                 and l_shipmode in ('AIR', 'AIR REG')
demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
demo(#         )
demo-#         or
demo-#         (
demo(#                 p_partkey = l_partkey
demo(#                 and p_brand = 'Brand#53'
demo(#                 and p_container in ('LG CASE', 'LG BOX', 'LG PACK',
'LG PKG')
demo(#                 and l_quantity >= 22 and l_quantity <= 22  10
demo(#                 and p_size between 1 and 15
demo(#                 and l_shipmode in ('AIR', 'AIR REG')
demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
demo(#         );
    revenue
----------------
 356492404.3164
(1 row)

Time: 114908.149 ms


And now a 6-way join among 4 tables in this same schema:

demo=# SELECT
demo-#
s.s_acctbal,s.s_name,n.n_name,p.p_partkey,p.p_mfgr,s.s_address,s.s_phone,s.s
_comment
demo-# FROM
demo-#         supplier s,partsupp ps,nation n,region r,
demo-#         part p,  (
demo(#                         SELECT p_partkey,  min(ps_supplycost) as
min_ps_cost from part, partsupp ,
demo(#                                 supplier,nation, region
demo(#                         WHERE
demo(#                                 p_partkey=ps_partkey
demo(#                                 and s_suppkey = ps_suppkey
demo(#                                 and s_nationkey = n_nationkey
demo(#                                 and n_regionkey = r_regionkey
demo(#                                 and r_name = 'EUROPE'
demo(#                         GROUP BY
demo(#                                 p_partkey
demo(#                  ) g
demo-# WHERE
demo-#         p.p_partkey = ps.ps_partkey
demo-#         and g.p_partkey = p.p_partkey
demo-#         and g. min_ps_cost = ps.ps_supplycost
demo-#         and s.s_suppkey = ps.ps_suppkey
demo-#         and p.p_size = 15
demo-#         and p.p_type like '%BRASS'
demo-#         and s.s_nationkey = n.n_nationkey
demo-#         and n.n_regionkey = r.r_regionkey
demo-#         and r.r_name = 'EUROPE'
demo-# ORDER BY
demo-#         s. s_acctbal desc,n.n_name,s.s_name,p.p_partkey
demo-# LIMIT 100;
 s_acctbal |          s_name           |          n_name           |
p_partkey |          p_mfgr           |
          s_address                 |     s_phone     |
s_comment

----------------------------------------------------------------------
----------------------------------
------------------------------------------------------------------------
--------------------------------------
-------------------------------------------
   9999.70 | Supplier#000239544        | UNITED KINGDOM            |
6739531 | Manufacturer#4            | 1UCMu
3TLyUThghoeZ8arg6cV3Mr              | 33-509-584-9496 | carefully ironic
asymptotes cajole quickly. slyly silent a
ccounts sleep. fl
...
...
   9975.53 | Supplier#000310136        | ROMANIA                   |
10810115 | Manufacturer#5            | VNWON
A5Sr B                              | 29-977-903-6199 | pending deposits
wake permanently; final accounts sleep ab
out the pending deposits.
(100 rows)

Time: 424981.813 ms


- Luke



Re: Reliability recommendations

From
Mark Kirkwood
Date:
Luke Lonergan wrote:

>
> OK, how about some proof?
>
> In a synthetic test that writes 32GB of sequential 8k pages on a machine
> with 16GB of RAM:
> ========================= Write test results ==============================
> time bash -c "dd if=/dev/zero of=/dbfast1/llonergan/bigfile bs=8k
> count=2000000 && sync" &
> time bash -c "dd if=/dev/zero of=/dbfast3/llonergan/bigfile bs=8k
> count=2000000 && sync" &
>
> 2000000 records in
> 2000000 records out
> 2000000 records in
> 2000000 records out
>
> real    1m0.046s
> user    0m0.270s
> sys     0m30.008s
>
> real    1m0.047s
> user    0m0.287s
> sys     0m30.675s
>
> So that's 32,000 MB written in 60.05 seconds, which is 533MB/s sustained
> with two threads.
>

Well, since this is always fun (2G memory, 3Ware 7506, 4xPATA), writing:

$ dd if=/dev/zero of=/data0/dump/bigfile bs=8k count=500000
500000 records in
500000 records out
4096000000 bytes transferred in 32.619208 secs (125570185 bytes/sec)

> Now to read the same files in parallel:
> ========================= Read test results ==============================
> sync
> time dd of=/dev/null if=/dbfast1/llonergan/bigfile bs=8k &
> time dd of=/dev/null if=/dbfast3/llonergan/bigfile bs=8k &
>
> 2000000 records in
> 2000000 records out
>
> real    0m39.849s
> user    0m0.282s
> sys     0m22.294s
> 2000000 records in
> 2000000 records out
>
> real    0m40.410s
> user    0m0.251s
> sys     0m22.515s
>
> And that's 32,000MB in 40.4 seconds, or 792MB/s sustained from disk (not
> memory).
>

Reading:

$ dd of=/dev/null if=/data0/dump/bigfile bs=8k count=500000
500000 records in
500000 records out
4096000000 bytes transferred in 24.067298 secs (170189442 bytes/sec)

Ok - didn't quite get my quoted 175MB/s, (FWIW if bs=32k I get exactly
175MB/s).

Hmmm - a bit humbled by Luke's machinery :-), however, mine is probably
competitive on (MB/s)/$....


It would be interesting to see what Dan's system would do on a purely
sequential workload - as 40-50MB of purely random IO is high.

Cheers

Mark

Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Mark,

On 2/24/06 10:10 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> Well, since this is always fun (2G memory, 3Ware 7506, 4xPATA), writing:
>
> $ dd if=/dev/zero of=/data0/dump/bigfile bs=8k count=500000
> 500000 records in
> 500000 records out
> 4096000000 bytes transferred in 32.619208 secs (125570185 bytes/sec)

> Reading:
>
> $ dd of=/dev/null if=/data0/dump/bigfile bs=8k count=500000
> 500000 records in
> 500000 records out
> 4096000000 bytes transferred in 24.067298 secs (170189442 bytes/sec)

Not bad at all!  I have one of these cards in my home machine running WinXP
and it's not nearly this fast.

> Hmmm - a bit humbled by Luke's machinery :-), however, mine is probably
> competitive on (MB/s)/$....

Not sure - the machines I cite are about $10K each.  The machine you tested
was probably about $1500 a few years ago (my guess), and with a 5:1 ratio in
speed versus about a 6:1 ratio in price, we're not too far off in MB/s/$
after all :-)

> It would be interesting to see what Dan's system would do on a purely
> sequential workload - as 40-50MB of purely random IO is high.

Yeah - that is really high if the I/O is really random.  I'd normally expect
maybe 500-600 iops / second and if each IO is 8KB, that would be 4MB/s.  The
I/O is probably not really completely random, or it's random over cachable
bits of the occupied disk area.

- Luke



Re: Reliability recommendations

From
Mark Kirkwood
Date:
Luke Lonergan wrote:
> Mark,
>
>>Hmmm - a bit humbled by Luke's machinery :-), however, mine is probably
>>competitive on (MB/s)/$....
>
>
> Not sure - the machines I cite are about $10K each.  The machine you tested
> was probably about $1500 a few years ago (my guess), and with a 5:1 ratio in
> speed versus about a 6:1 ratio in price, we're not too far off in MB/s/$
> after all :-)
>

Wow - that is remarkable performance for $10K!

Cheers

Mark

Re: Reliability recommendations

From
Ron
Date:
At 11:41 PM 2/24/2006, Luke Lonergan wrote:
>Dan,
>
>On 2/24/06 4:47 PM, "Dan Gorman" <dgorman@hi5.com> wrote:
>
> > Was that sequential reads? If so, yeah you'll get 110MB/s? How big
> > was the datafile size? 8MB? Yeah, you'll get 110MB/s. 2GB? No, they
> > can't sustain that. There are so many details missing from this test
> > that it's hard to have any context around it :)
> >
> > I was getting about 40-50MB/s on a PV with 14 disks on a RAID10 in
> > real world usage. (random IO and fully saturating a Dell 1850 with 4
> > concurrent threads (to peg the cpu on selects) and raw data files)
>
>OK, how about some proof?
>
>In a synthetic test that writes 32GB of sequential 8k pages on a machine
>with 16GB of RAM:
>========================= Write test results ==============================
>time bash -c "dd if=/dev/zero of=/dbfast1/llonergan/bigfile bs=8k
>count=2000000 && sync" &
>time bash -c "dd if=/dev/zero of=/dbfast3/llonergan/bigfile bs=8k
>count=2000000 && sync" &
>
>2000000++0 records in
>2000000++0 records out
>2000000++0 records in
>2000000++0 records out
>
>real    1m0.046s
>user    0m0.270s
>sys     0m30.008s
>
>real    1m0.047s
>user    0m0.287s
>sys     0m30.675s
>
>So that's 32,000 MB written in 60.05 seconds, which is 533MB/s sustained
>with two threads.
>
>Now to read the same files in parallel:
>========================= Read test results ==============================
>sync
>time dd of=/dev/null if=/dbfast1/llonergan/bigfile bs=8k &
>time dd of=/dev/null if=/dbfast3/llonergan/bigfile bs=8k &
>
>2000000++0 records in
>2000000++0 records out
>
>real    0m39.849s
>user    0m0.282s
>sys     0m22.294s
>2000000++0 records in
>2000000++0 records out
>
>real    0m40.410s
>user    0m0.251s
>sys     0m22.515s
>
>And that's 32,000MB in 40.4 seconds, or 792MB/s sustained from disk (not
>memory).
>
>These are each RAID5 arrays of 8 internal SATA disks on 3Ware HW RAID
>controllers.

Impressive IO rates.  A more detailed HW list would help put them in context.

Which 3Ware?  The 9550SX?  How much cache on it (AFAIK, the only
options are 128MB and 256MB?)?

Which HDs?

What CPUs (looks like Opterons, but which flavor?) and mainboard?

What's CPU utilization when hammering the physical IO subsystem this hard?


>Now for real usage, let's run a simple sequential scan query on 123,434 MB
>of data in a single table on 4 of these machines in parallel.  All tables
>are distributed evenly by Bizgres MPP over all 8 filesystems:
>
>============= Bizgres MPP sequential scan results =========================
>
>[llonergan@salerno0 +AH4]$ !psql
>psql -p 9999 -U mppdemo1 demo
>Welcome to psql 8.1.1 (server 8.1.3), the PostgreSQL interactive terminal.
>
>Type:  +AFw-copyright for distribution terms
>        +AFw-h for help with SQL commands
>        +AFw? for help with psql commands
>        +AFw-g or terminate with semicolon to execute query
>        +AFw-q to quit
>
>demo=# +AFw-timing
>Timing is on.
>demo=# select version();
>
>version
>----------------------------------------------------------------------------
>----------------------------------------------------------------------------
>-----
>  PostgreSQL 8.1.3 (Bizgres MPP 2.1) on x86_64-unknown-linux-gnu, compiled by
>GCC gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) compiled on Feb 23 2006
>11:34:06
>(1 row)
>
>Time: 0.570 ms
>demo=# select relname,8*relpages/128 as MB from pg_class order by relpages
>desc limit 6;
>             relname             |   mb
>--------------------------------++--------
>  lineitem                       | 123434
>  orders                         |  24907
>  partsupp                       |  14785
>  part                           |   3997
>  customer                       |   3293
>  supplier                       |    202
>(6 rows)
>
>Time: 1.824 ms
>demo=# select count(*) from lineitem;
>    count
>-----------
>  600037902
>(1 row)
>
>Time: 60300.960 ms
>
>So that's 123,434 MB of data scanned in 60.3 seconds, or 2,047 MB/s on 4
>machines, which uses 512MB/s of disk bandwidth on each machine.
>
>Now let's do a query that uses a this big table (a two way join) using all 4
>machines:
>============= Bizgres MPP Query results =========================
>demo=# select
>demo-#         sum(l_extendedprice* (1 - l_discount)) as revenue
>demo-# from
>demo-#         lineitem,
>demo-#         part
>demo-# where
>demo-#         (
>demo(#                 p_partkey = l_partkey
>demo(#                 and p_brand = 'Brand#42'
>demo(#                 and p_container in ('SM CASE', 'SM BOX', 'SM PACK',
>'SM PKG')
>demo(#                 and l_quantity >= 7 and l_quantity <= 7 ++ 10
>demo(#                 and p_size between 1 and 5
>demo(#                 and l_shipmode in ('AIR', 'AIR REG')
>demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
>demo(#         )
>demo-#         or
>demo-#         (
>demo(#                 p_partkey = l_partkey
>demo(#                 and p_brand = 'Brand#15'
>demo(#                 and p_container in ('MED BAG', 'MED BOX', 'MED PKG',
>'MED PACK')
>demo(#                 and l_quantity >= 14 and l_quantity <= 14 ++ 10
>demo(#                 and p_size between 1 and 10
>demo(#                 and l_shipmode in ('AIR', 'AIR REG')
>demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
>demo(#         )
>demo-#         or
>demo-#         (
>demo(#                 p_partkey = l_partkey
>demo(#                 and p_brand = 'Brand#53'
>demo(#                 and p_container in ('LG CASE', 'LG BOX', 'LG PACK',
>'LG PKG')
>demo(#                 and l_quantity >= 22 and l_quantity <= 22 ++ 10
>demo(#                 and p_size between 1 and 15
>demo(#                 and l_shipmode in ('AIR', 'AIR REG')
>demo(#                 and l_shipinstruct = 'DELIVER IN PERSON'
>demo(#         );
>     revenue
>----------------
>  356492404.3164
>(1 row)
>
>Time: 114908.149 ms
Hmmm.  ~115secs @ ~500MBps => ~57.5GB of data manipulated.


>And now a 6-way join among 4 tables in this same schema:
>
>demo=# SELECT
>demo-#
>s.s_acctbal,s.s_name,n.n_name,p.p_partkey,p.p_mfgr,s.s_address,s.s_phone,s.s
>_comment
>demo-# FROM
>demo-#         supplier s,partsupp ps,nation n,region r,
>demo-#         part p,  (
>demo(#                         SELECT p_partkey,  min(ps_supplycost) as
>min_ps_cost from part, partsupp ,
>demo(#                                 supplier,nation, region
>demo(#                         WHERE
>demo(#                                 p_partkey=ps_partkey
>demo(#                                 and s_suppkey = ps_suppkey
>demo(#                                 and s_nationkey = n_nationkey
>demo(#                                 and n_regionkey = r_regionkey
>demo(#                                 and r_name = 'EUROPE'
>demo(#                         GROUP BY
>demo(#                                 p_partkey
>demo(#                  ) g
>demo-# WHERE
>demo-#         p.p_partkey = ps.ps_partkey
>demo-#         and g.p_partkey = p.p_partkey
>demo-#         and g. min_ps_cost = ps.ps_supplycost
>demo-#         and s.s_suppkey = ps.ps_suppkey
>demo-#         and p.p_size = 15
>demo-#         and p.p_type like '%BRASS'
>demo-#         and s.s_nationkey = n.n_nationkey
>demo-#         and n.n_regionkey = r.r_regionkey
>demo-#         and r.r_name = 'EUROPE'
>demo-# ORDER BY
>demo-#         s. s_acctbal desc,n.n_name,s.s_name,p.p_partkey
>demo-# LIMIT 100;
>  s_acctbal |          s_name           |          n_name           |
>p_partkey |          p_mfgr           |
>           s_address                 |     s_phone     |
>s_comment
>
>-----------++---------------------------++---------------------------++--------
>---++---------------------------++------
>------------------------------------++-----------------++---------------------
>--------------------------------------
>-------------------------------------------
>    9999.70 | Supplier#000239544        | UNITED KINGDOM            |
>6739531 | Manufacturer#4            | 1UCMu
>3TLyUThghoeZ8arg6cV3Mr              | 33-509-584-9496 | carefully ironic
>asymptotes cajole quickly. slyly silent a
>ccounts sleep. fl
>...
>...
>    9975.53 | Supplier#000310136        | ROMANIA                   |
>10810115 | Manufacturer#5            | VNWON
>A5Sr B                              | 29-977-903-6199 | pending deposits
>wake permanently; final accounts sleep ab
>out the pending deposits.
>(100 rows)
>
>Time: 424981.813 ms
...and this implies ~425secs @ ~500MBps => 212.5GB

What are the IO rates during these joins?

How much data is being handled to complete these joins?

How much data is being exchanged between these machines to complete the joins?

What is the connectivity between these 4 machines?

Putting these numbers in context may help the advocacy effort
considerably as well as help us improve things even further. ;-)

TiA,
Ron



Re: Reliability recommendations

From
Ron
Date:
At 01:22 AM 2/25/2006, Luke Lonergan wrote:
>Mark,
>
>On 2/24/06 10:10 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:
>
> > Well, since this is always fun (2G memory, 3Ware 7506, 4xPATA), writing:
> >
> > $ dd if=/dev/zero of=/data0/dump/bigfile bs=8k count=500000
> > 500000 records in
> > 500000 records out
> > 4096000000 bytes transferred in 32.619208 secs (125570185 bytes/sec)
>
> > Reading:
> >
> > $ dd of=/dev/null if=/data0/dump/bigfile bs=8k count=500000
> > 500000 records in
> > 500000 records out
> > 4096000000 bytes transferred in 24.067298 secs (170189442 bytes/sec)
>
>Not bad at all!  I have one of these cards in my home machine running WinXP
>and it's not nearly this fast.
>
> > Hmmm - a bit humbled by Luke's machinery :-), however, mine is probably
> > competitive on (MB/s)/$....
>
>Not sure - the machines I cite are about $10K each.  The machine you tested
>was probably about $1500 a few years ago (my guess), and with a 5:1 ratio in
>speed versus about a 6:1 ratio in price, we're not too far off in MB/s/$
>after all :-)
>
> > It would be interesting to see what Dan's system would do on a purely
> > sequential workload - as 40-50MB of purely random IO is high.
>
>Yeah - that is really high if the I/O is really random.  I'd normally expect
>maybe 500-600 iops / second and if each IO is 8KB, that would be 4MB/s.  The
>I/O is probably not really completely random, or it's random over cachable
>bits of the occupied disk area.
Side note: the new WD 150GB Raptors (10Krpm 1.5Gbps SATA w/ NCQ
support) have benched at ~1000 IOps _per drive_

http://www.storagereview.com/articles/200601/WD1500ADFD_4.html

(Now if we can just get WD to make a 300GB Raptor, increase that
wimpy 16MB buffer, and implement 6Gbps SATA...;-) )

An array of these things plugged into a PCI-E <-> SATA RAID
controller with 1-2GB of cache should set a new bar for performance
as well as making that performance more resilient than ever to
variations in usage patterns.





Re: Reliability recommendations

From
"Luke Lonergan"
Date:
Ron,

On 2/25/06 3:24 AM, "Ron" <rjpeace@earthlink.net> wrote:

>> These are each RAID5 arrays of 8 internal SATA disks on 3Ware HW RAID
>> controllers.
>
> Impressive IO rates.  A more detailed HW list would help put them in context.
>
> Which 3Ware?  The 9550SX?  How much cache on it (AFAIK, the only
> options are 128MB and 256MB?)?
>
> Which HDs?
>
> What CPUs (looks like Opterons, but which flavor?) and mainboard?
>
> What's CPU utilization when hammering the physical IO subsystem this hard?

OK.  There are four machines. Each machine has:
Qty 2 of 3Ware 9550SX/PCIX/128MB cache SATAII RAID controllers
Qty 2 of AMD Opteron 250 CPUs (2.4 GHz)
Qty 16 of 1GB PC3200 RAM (16GB total)
Qty 1 of Tyan 2882S Motherboard
Qty 16 Western Digital 400GB Raid Edition 2 SATA disks
Cost: About $10,000 each

They are connected together using a Netgear 48 port Gigabit Ethernet switch
and copper CAT6 cables.
Cost: About $1,500

Total of all machines:
8 CPUs
64GB RAM
64 Disks
20TB storage in RAID5
Total HW cost: About $45,000.

CPU utilization is apparently (adding up usr+system and dividing by real):
Writing at 2,132MB/s: 51%
Reading at 3,168MB/s: 56%

>>     revenue
>> ----------------
>>  356492404.3164
>> (1 row)
>>
>> Time: 114908.149 ms
> Hmmm.  ~115secs @ ~500MBps => ~57.5GB of data manipulated.

Actually, this query sequential scans all of both lineitem and part, so it¹s
accessing 127.5GB of data, and performing the work of the hash join, in
about double the scan time.  That¹s possible because we¹re using all 8 CPUs,
4 network interfaces and the 64 disks while we are performing the query:

 Aggregate  (cost=3751996.97..3751996.99 rows=1 width=22)
   ->  Gather Motion  (cost=3751996.97..3751996.99 rows=1 width=22)
         ->  Aggregate  (cost=3751996.97..3751996.99 rows=1 width=22)
               ->  Hash Join  (cost=123440.49..3751993.62 rows=1339
width=22)
                     Hash Cond: ("outer".l_partkey = "inner".p_partkey)
                     Join Filter: ((("inner".p_brand = 'Brand#42'::bpchar)
AND (("inner".p_container = 'SM CASE'::bpchar) OR ("inner".p_container = 'SM
BOX'::bpchar) OR ("inner".p_container = 'SM PACK'::bpchar) OR
("inner".p_container = 'SM PKG'::bpchar)) AND ("outer".l_quantity >=
7::numeric) AND ("outer".l_quantity <= 17::numeric) AND ("inner".p_size <=
5)) OR (("inner".p_brand = 'Brand#15'::bpchar) AND (("inner".p_container =
'MED BAG'::bpchar) OR ("inner".p_container = 'MED BOX'::bpchar) OR
("inner".p_container = 'MED PKG'::bpchar) OR ("inner".p_container = 'MED
PACK'::bpchar)) AND ("outer".l_quantity >= 14::numeric) AND
("outer".l_quantity <= 24::numeric) AND ("inner".p_size <= 10)) OR
(("inner".p_brand = 'Brand#53'::bpchar) AND (("inner".p_container = 'LG
CASE'::bpchar) OR ("inner".p_container = 'LG BOX'::bpchar) OR
("inner".p_container = 'LG PACK'::bpchar) OR ("inner".p_container = 'LG
PKG'::bpchar)) AND ("outer".l_quantity >= 22::numeric) AND
("outer".l_quantity <= 32::numeric) AND ("inner".p_size <= 15)))
                     ->  Redistribute Motion  (cost=0.00..3340198.25
rows=2611796 width=36)
                           Hash Key: l_partkey
                           ->  Seq Scan on lineitem  (cost=0.00..3287962.32
rows=2611796 width=36)
                                 Filter: (((l_shipmode = 'AIR'::bpchar) OR
(l_shipmode = 'AIR REG'::bpchar)) AND (l_shipinstruct = 'DELIVER IN
PERSON'::bpchar))
                     ->  Hash  (cost=95213.75..95213.75 rows=2500296
width=36)
                           ->  Seq Scan on part  (cost=0.00..95213.75
rows=2500296 width=36)
                                 Filter: (p_size >= 1)
(13 rows)

>> Time: 424981.813 ms
> ...and this implies ~425secs @ ~500MBps => 212.5GB

I've attached this explain plan because it's large.  Same thing applies as
in previous, all of the involved tables are scanned, and in this case we've
got all manner of CPU work being performed: sorts, hashes, aggregations,...

> What are the IO rates during these joins?

They burst then disappear during the progression of the queries.  These are
now CPU bound because the I/O available to each CPU is so fast.

> How much data is being handled to complete these joins?

See above, basically all of it because there are no indexes.

> How much data is being exchanged between these machines to complete the joins?

Good question, you can infer it from reading the EXPLAIN plans, but the
stats are not what they appear - generally multiply them by 8.

> What is the connectivity between these 4 machines?

See above.

> Putting these numbers in context may help the advocacy effort
> considerably as well as help us improve things even further. ;-)

Thanks!  (bait bait) I can safely say that there have never been query
speeds with Postgres this fast, not even close. (bait bait)

What's next is a machine 4 or maybe 8 times this fast.

Explain for second query is attached, compressed with gzip (it¹s kinda
huge).

- Luke



Attachment