One particular large database application - Mailing list pgsql-general

From Curt Sampson
Subject One particular large database application
Date
Msg-id Pine.NEB.4.43.0204211417200.6249-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Backup very large databases  (Francisco Reyes <lists@natserv.com>)
Responses Re: One particular large database application
List pgsql-general
On Sat, 20 Apr 2002, Francisco Reyes wrote:

> On Sat, 20 Apr 2002, Curt Sampson wrote:
>
> > I anticipate having a 30-50 GB or larger database in the next few months.
>
> I think that with so much data having good hardware is key.
> What type of data is this? How many people concurrently connect?

Well, "I'm not sure." The client is being a bit obscure about what he
really wants, and in fact may not quite know exactly what he really
wants. Welcome to the joys of being a consultant. :-)

My contract is to do which seems to be a proof-of-concept. They are
giving me a pile of sample data (about half a billion records) which
includes a record ID, date, user ID, and a bit of other stuff. I need to
do selects by user ID across a range of dates. I'm obviously going to
have to work out more details about what they really want as I go along,
which is no problem: that's one of the things I'm hired to do.

Anyway, it appears that this is going to be used as a query-only
copy of a database where the master data is stored elsewhere, so
that's why I'm not too worried about updates. I probably don't need
backups, either, since I can always just re-import the data if I
lose it.

One of the things they want me to try is partitioning the data
across multiple machines, and submitting queries in parallel. So
I'll be writing software that will take a query, figure out what
tables it needs to apply that query to, apply that query to those
tables (chosing the servers appropriately as well), and consolidate
the results.

For hardware, it seems that a bunch of cheap, basic PCs would do
the trick. I'm thinking of a system with a 1-2 GHz CPU, 512 MB of
memory, a 20-40 GB IDE disk for the system, log and temporary space,
and an 80 GB or larger IDE disk for the data. If reliability is a
real concern, probably mirroring the disks is the best option.

These systems, in 1U cases, should come in under 200,000 yen each
(under US$1500) and we can scale by adding more of them. But even
the performance of a single one should be fairly good, since I'm
avoiding RAID-5, which tends to be a serious performance killer.

This, BTW, is why I've been going on about import speed recently.
It's expensive (both in terms of cost and performance) to avoid
disk system failure with RAID or whatever, but if you can live with
recovering from failure, rather than avoiding failure, you can use
cheaper and faster hardware. So I'm wondering if I can set this up
to be able to recover from a failure fast enough that I don't need
to worry so much about preventing one.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


pgsql-general by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Backup very large databases
Next
From: Michael Loftis
Date:
Subject: Re: phpPgAdmin problem login