Thread: Hardware needed for 15,000,000 record DB?

Hardware needed for 15,000,000 record DB?

From
Date:
Hello,  :-)

Can someone help me spec out the hardware needed for a simple web-based
database app?

Basically, that application needs to lookup a single row by its primary key.
This would be fairly straightforward, except that the table needs to contain
up to 15 million records.  Each row will contain approximately 25  variable
length CHAR fields with perhaps a total of 3 to 4 kilobytes of data per row.

Updates will be done nightly via some sort of a batch process.

What kind of hardware would be needed for this sort of application? The
queries are not complex, it's just a lot of data.

Would a midrange Celeron processor with 256 MB RAM be sufficient? How would
backups work for a database this large?

Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or
maybe a Microsoft solution?

Thanks so much,
David


Re: Hardware needed for 15,000,000 record DB?

From
Jeremy Buchmann
Date:
pdg@stratos.net wrote:
> Hello,  :-)
>
> Can someone help me spec out the hardware needed for a simple web-based
> database app?
>
> Basically, that application needs to lookup a single row by its primary key.
> This would be fairly straightforward, except that the table needs to contain
> up to 15 million records.  Each row will contain approximately 25  variable
> length CHAR fields with perhaps a total of 3 to 4 kilobytes of data per row.
>
> Updates will be done nightly via some sort of a batch process.
>
> What kind of hardware would be needed for this sort of application? The
> queries are not complex, it's just a lot of data.

Databases are I/O bound...if your I/O is slow, your database will be slow.
So your goal is to minimize the amount of I/O needed and the time it takes
to do the I/O.  You minimize the amount of I/O by getting things with big
caches.  You minimize the amount of time the I/O takes by using fast
storage devices.

This means SCSI.  If you have limited funds, spend them on fast SCSI disks
with big caches and a good controller instead the latest and greatest
processor.

If you have the funds, look into getting a RAID card with a big cache on it.
A RAID 1 or 5 also helps out if a disk crashes.

> Would a midrange Celeron processor with 256 MB RAM be sufficient? How would
> backups work for a database this large?

I would not use the Celeron.  The Celeron lacks cache, and you want all the
cache you can afford.  One user posted benchmarks to this list showing the
Celeron to be twice as slow as a PIII at the same clock speed.

Buy a lot of memory.  And make sure the motherboard you get can support a lot
of memory (i.e., it has 4 or more slots for RAM).

Backups: DDS3/4 tape is a good, but expensive choice.  It's one of the few backup
technologies that has cheap media and is vendor-neutral.  Stay away from
"Big floppies" like Jaz drives.  They're too flaky.

> Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or
> maybe a Microsoft solution?

MySQL is traditionally faster at pumping out web pages with pure speed, but
PostgreSQL has been catching up very quickly.  Also, PostgreSQL has
traditionally been able to handle many more concurrent users, but I think
MySQL has been getting better there, too.  I haven't seen any banchmarks
with recent versions of either database.  They're both free, so try some
tests with both of them (but don't forget to tune them properly).

If you care anything about cost or flexibility, I wouldn't go for a Microsoft
"solution".


--Jeremy


Re: Hardware needed for 15,000,000 record DB?

From
Curt Sampson
Date:
On Mon, 22 Apr 2002, Jeremy Buchmann wrote:

> Databases are I/O bound...if your I/O is slow, your database will be slow.
> So your goal is to minimize the amount of I/O needed and the time it takes
> to do the I/O.  You minimize the amount of I/O by getting things with big
> caches.

Assuming you freqently access the same data. If you're doing
essentially random queries on a 50 GB database, an extra few hundred
megabytes of cache will probably make little difference.

> You minimize the amount of time the I/O takes by using fast
> storage devices.
> This means SCSI.

Not necessarially. More disk arms is also a big help, so much so
that I would take two IDE drives (assuming that they're fast modern
ones) over one SCSI drive any day.

You probably want to at least make sure that your log files and
data files are on separate disks.

> If you have the funds, look into getting a RAID card with a big cache on it.
> A RAID 1 or 5 also helps out if a disk crashes.

A RAID 5 will slow down write performance considerably, but the
added reliability might be worth the tradeoff. But mirroring is
better, if you can afford it. Almost certainly you want your log
file on a mirrored volume rather than on a RAID 5 volume.

As far as a "big cache" goes, well, see above. Also note that if
your OS caches blocks in system RAM, you will: a) have two caches
in series, which is a bit of a waste, and b) n MB of cache on a
raid controller will often be more expensive than the same amount
as system RAM.

> > Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or
> > maybe a Microsoft solution?
>
> MySQL is traditionally faster at pumping out web pages with pure speed, but
> PostgreSQL has been catching up very quickly.  Also, PostgreSQL has
> traditionally been able to handle many more concurrent users, but I think
> MySQL has been getting better there, too.  I haven't seen any banchmarks
> with recent versions of either database.  They're both free, so try some
> tests with both of them (but don't forget to tune them properly).

For this simple application, I'd say MySQL is likely to be a bit
faster, simply because the database will be smaller. (MySQL has
considerably less row overhead.) However, the speed difference is
not likely to be too great, and certain other things might make
MySQL even worse.

> If you care anything about cost or flexibility, I wouldn't go for a
> Microsoft "solution".

Au contraire, MS SQL Server is cheap (compared to other commerical
products) and is a pretty good database.  I would happily use it
again. But the fact that it runs under Windows only makes administration
a pain, and "cheap" in the commerical world is still a heck of a
lot more expensive than "free."

However, given the size of your project, and what you're looking
at spending, I can't see MS SQL Server or other products in that
class being worthwhile. An SQL Server installation is typically a
$10,000-$50,000 kind of thing. (You can do it for a lot less, but
most such products that can get by with such small installations
would probably be just as well off with PostgreSQL.)

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Hardware needed for 15,000,000 record DB?

From
Jeremy Buchmann
Date:
Curt Sampson wrote:
> On Mon, 22 Apr 2002, Jeremy Buchmann wrote:
>
>>You minimize the amount of time the I/O takes by using fast
>>storage devices.
>>This means SCSI.
>
>
> Not necessarially. More disk arms is also a big help, so much so
> that I would take two IDE drives (assuming that they're fast modern
> ones) over one SCSI drive any day.

SCSI also has several other advantages...it doesn't suck up all your
CPU time when there's a lot of disk activity, and it supports
command queueing.  IDE was designed to be cheap, SCSI was designed to
be good.  I've used both and from my experience, SCSI is The Way.

--Jeremy


Re: Hardware needed for 15,000,000 record DB?

From
David Gilbert
Date:
>>>>> "Jeremy" == Jeremy Buchmann <jeremy@wellsgaming.com> writes:

Jeremy> Curt Sampson wrote:
>> On Mon, 22 Apr 2002, Jeremy Buchmann wrote:
>>
>>> You minimize the amount of time the I/O takes by using fast
>>> storage devices.  This means SCSI.
>>
>>
>> Not necessarially. More disk arms is also a big help, so much so
>> that I would take two IDE drives (assuming that they're fast modern
>> ones) over one SCSI drive any day.

Jeremy> SCSI also has several other advantages...it doesn't suck up
Jeremy> all your CPU time when there's a lot of disk activity, and it
Jeremy> supports command queueing.  IDE was designed to be cheap, SCSI
Jeremy> was designed to be good.  I've used both and from my
Jeremy> experience, SCSI is The Way.

Well... I wouldn't phrase the problem that 1-dimensionally.  There are
IDE drives that support queueing (IBM Deskstar something-or-other) and
some OSs have ATA controller code that really kicks but (FreeBSD, for
instance), so from a software point of view, IDE can work well.

It's even likely that the maufacturing process is nearly identical for
some makes and models.

But the RMA rates for _extreme_ usage don't match.  My theory:

I find that IDE drives placed in situations where their request queue
is populated completely for days at a time will fail between 3 and 6
months where SCSI drives can last years in this condition.  Pay
attention to the qualification: we're talking not even a milisecond of
unused time for at least hours on end out of a day.

I believe this is due to the SCSI disk having software that will
momentarily stop processing requests to perform a recalibration
... where IDE drives may only do this when they are idle for some
small period of time.

Dave.

--
============================================================================
|David Gilbert, Velocet Communications.       | Two things can only be     |
|Mail:       dgilbert@velocet.net             |  equal if and only if they |
|http://daveg.ca                              |   are precisely opposite.  |
=========================================================GLO================

Re: Hardware needed for 15,000,000 record DB?

From
Ragnar Kjørstad
Date:
On Tue, Apr 23, 2002 at 01:44:40PM -0400, David Gilbert wrote:
> I believe this is due to the SCSI disk having software that will
> momentarily stop processing requests to perform a recalibration
> ... where IDE drives may only do this when they are idle for some
> small period of time.

Another reason is that IDE drives often have higher densities. A 160 GB
SCSI disk have a lot of plates (8?), while IDE drives with the same
capacity typicly have only one or two. It basicly means that the disk is
more sensitive to errors.

(Numbers above may not be correct for all IDE and SCSI-disks, they are
just meant as an illustration)

This, like many other IDE/SCSI issues is not a IDE-problem per-say -
it's just (like it was written earlier in the thread) that SCSI-disks
are typicly designed for servers, while IDE disks are typicly designed
for workstations.


--
Ragnar Kjørstad
Big Storage

Re: Hardware needed for 15,000,000 record DB?

From
dalgoda@ix.netcom.com (Mike Castle)
Date:
In article <15557.40328.672259.475024@trooper.velocet.net>,
David Gilbert  <dgilbert@velocet.ca> wrote:
>IDE drives that support queueing (IBM Deskstar something-or-other) and

Aren't these the same drives that are dying left and right?

mrc

--
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc