Thread: Question about Hardware & Configuration for Massive Database

Question about Hardware & Configuration for Massive Database

From
Kevin & Jessica Hermansen
Date:
I'm looking to set up an informational database which will be accessed by the general public on our website (there will be no internal users).  The database will likely have about 10 million records with a total database size of 200-300 GB.  All 10 million records will be very similar (the same data fields), so I anticipate that the entire database will be just 1 table.  This is the main application that will be running on our server.  Users will be querying the database by searching either a particular field (or multiple fields) or by doing a full text search.
 
My primary concern is that wait times (between the user doing a search query & then being able to view the results) will be long (I consider "long" to be more than 4 or 5 seconds).  I was hoping someone could advise me on what dedicated server hardware to purchase (# of processors (physical and/or virtual), type of processors (Xeon, P4, 32-bit, 64-bit, HyperThreading, etc.), cache, RAM, etc.).  We're on a very, very tight budget and don't want to OVER-deploy, but the speed of the database is also a major consideration.
 
Going along with that, can you also advise (in general terms) on how to optimally configure the enormous, 1-table database (and how to make it run optimally with the hardware configuration)?
 
Thank you very much!!!

Re: Question about Hardware & Configuration for Massive

From
Scott Marlowe
Date:
On Thu, 2006-01-19 at 12:39, Kevin & Jessica Hermansen wrote:
> I'm looking to set up an informational database which will be accessed
> by the general public on our website (there will be no internal
> users).  The database will likely have about 10 million records with a
> total database size of 200-300 GB.  All 10 million records will be
> very similar (the same data fields), so I anticipate that the entire
> database will be just 1 table.  This is the main application that will
> be running on our server.  Users will be querying the database by
> searching either a particular field (or multiple fields) or by doing a
> full text search.
>
> My primary concern is that wait times (between the user doing a search
> query & then being able to view the results) will be long (I consider
> "long" to be more than 4 or 5 seconds).  I was hoping someone could
> advise me on what dedicated server hardware to purchase (# of
> processors (physical and/or virtual), type of processors (Xeon, P4,
> 32-bit, 64-bit, HyperThreading, etc.), cache, RAM, etc.).  We're on a
> very, very tight budget and don't want to OVER-deploy, but the speed
> of the database is also a major consideration.

That very much depends on how you approach this problem...

> Going along with that, can you also advise (in general terms) on how
> to optimally configure the enormous, 1-table database (and how to make
> it run optimally with the hardware configuration)?

Use full text searching.  Look for tsearch2 in the contrib directory of
the tarball distribution, or a contribs package for your OS.

A pentium III-750 with 512 Megs of ram will likely be faster running
tsearch2 than a Pentium IV 2.8GHz and 2 to 4 gigs of RAM.  Seriously,
using tsearch2 is your first step here.  Assuming that you're doing text
searches.

If all the searches are numeric types, then performance should be good
with almost any system.

I'd do some test runs on smaller boxes you already probably have laying
about to get a base line of how fast this is gonna be on them, then
you'll know about how much, if any, you need to step things up.

Hyperthreading doesn't help, and often hurts PostgreSQL performance,
especially with a lot of users.

AMD 64 bit CPUs are currently the king of 2 to 8 way processor systems
to run PostgreSQL (or Oracle for that matter) on.

Throw money at a good RAID card and RAID 10 while you're at it.  Areca,
LSI, and a few others are pretty good.  Avoid Dell, even though they
make some servers with LSI controllers in the (PERC4C and PERC3C and who
knows what else) they lobotomize them or something, because they just
aren't as fast as the OEM cousins they were bred from.

If this database will be mostly read, then look at running a large array
(say 8 to 12) of medium / small disks (70 to 150 gig... yeah, that's
medium to small nowadays...)  In fact, you may find better performance
if this is a reporting database (mostly read) by using older drives but
a lot of them.  I'm just guessing that you may have a desk drawer full
of 30 gig drives laying about.

Re: Question about Hardware & Configuration for Massive

From
Dan Sugalski
Date:
At 11:39 AM -0700 1/19/06, Kevin & Jessica Hermansen wrote:
>I'm looking to set up an informational database which will be
>accessed by the general public on our website (there will be no
>internal users).  The database will likely have about 10 million
>records with a total database size of 200-300 GB.  All 10 million
>records will be very similar (the same data fields), so I anticipate
>that the entire database will be just 1 table.  This is the main
>application that will be running on our server.  Users will be
>querying the database by searching either a particular field (or
>multiple fields) or by doing a full text search.

Is the public access going to be entirely read-only? You may find it
advantageous to have multiple database machines kept synchronized
with slony, and distribute the queries from the website across the
read-only systems. If a reasonably priced machine can manage, say,
five simultaneous nasty queries in your timeframe, it's usually a
*lot* cheaper to buy ten of those systems to handle fifty queries
than it is to beef up the system enough so that one machine can
handle fifty queries.

It also makes it a lot easier to add on more capacity as you need it
(or as the budget allows) this way -- throw an extra machine or three
into the cluster as time and cash permit and need demands.

This may not work for you, depending on your needs, but it's worth
some consideration, in case it does.
--
                Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk

Re: Question about Hardware & Configuration for Massive

From
"Jim C. Nasby"
Date:
On Thu, Jan 19, 2006 at 02:26:03PM -0500, Dan Sugalski wrote:
> At 11:39 AM -0700 1/19/06, Kevin & Jessica Hermansen wrote:
> >I'm looking to set up an informational database which will be
> >accessed by the general public on our website (there will be no
> >internal users).  The database will likely have about 10 million
> >records with a total database size of 200-300 GB.  All 10 million
> >records will be very similar (the same data fields), so I anticipate
> >that the entire database will be just 1 table.  This is the main
> >application that will be running on our server.  Users will be
> >querying the database by searching either a particular field (or
> >multiple fields) or by doing a full text search.
>
> Is the public access going to be entirely read-only? You may find it
> advantageous to have multiple database machines kept synchronized
> with slony, and distribute the queries from the website across the
> read-only systems. If a reasonably priced machine can manage, say,
> five simultaneous nasty queries in your timeframe, it's usually a
> *lot* cheaper to buy ten of those systems to handle fifty queries
> than it is to beef up the system enough so that one machine can
> handle fifty queries.
>
> It also makes it a lot easier to add on more capacity as you need it
> (or as the budget allows) this way -- throw an extra machine or three
> into the cluster as time and cash permit and need demands.

Actually, it doesn't need to be read-only for that to work, though it
does simplify things. But if you make your application smart enough to
know that read queries go to one set of hosts while updates have to go
to one specific host you can use Slony to scale. You'd need to be able
to tolerate the replication delay as well.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461