Re: Fast Inserts and Hardware Questions - Mailing list pgsql-general

From Orion Henry
Subject Re: Fast Inserts and Hardware Questions
Date
Msg-id 3AB00E01.6C4F7333@trustcommerce.com
Whole thread Raw
In response to Fast Inserts and Hardware Questions  (Orion Henry <orion@trustcommerce.com>)
List pgsql-general
Richard H wrote:
>
> On 3/14/01, 7:08:48 PM, Orion Henry <orion@trustcommerce.com> wrote
> regarding [GENERAL] Fast Inserts and Hardware Questions:
>
> Please bear in mind that you are operating well above anything I do, so
> I'm not speaking from experience.

I appricate the advice in anycase.

> > I am specing out a database for my company and I need some advice.  The
> > database in mind will consist of one table, with about 300 bytes per
> > record.  The table will be getting two or three million inserts a week
> > and it would be nice if it could handle a sustained 30 to 50 a second.
> > The table will have (ACK!) 4 indexes (if anyone can explain to me how I
> > can get away with less please let me know)
>
> Three million inserts a week is about 5 per second, so you want to
> sustain 10 times the average rate (sounds reasonable). Have you
> considered separating inserts from reads? The inserts wouldn't even need
> to go into a database initially, just log them to a file and feed them in
> at a steady 10 per second (allows for downtime).

Already done.  It's just that the less time between getting the data and
it being avaliable to the customers the better we look.

> > The indexes will be
> > int8 (primary key)
> > int4 (group number)
> > timestamp (creation date)
> > int4 (customer id)
>
> > The customers want to be able to query their data whenever and have it
> > be snappy.
>
> And they don't see why they need to pay so much for something so simple,
> either ;-)

Tell me about it ;)

> > So here is my question:
> > * Is there an OS that is best suited for postgres. All things being
> > equal I would like to run this on Linux.
>
> Can't comment - I've only used PG on Linux. People who use the various
> xxxBSDs swear by them (rather than at them) and Solaris has some
> excellent support (at a cost)
>
> Of course, you could try MS-Access on Win-ME ;-)

Or... not. ;)

> > * Is there an architecture best suited for this.  Should I shell out the
> > extra cash for a 64 bit box over a 32bit one.  Quad Xeon or Quad Alpha?
> > Quad Alpha or Quad UltraSparc?
>
> Again heresay, but Sun's hardware has excellent I/O throughput and it's
> difficult to get people to abandon it in favour of Intel once they've
> tried it.

What I was hoping to hear was something along the line of "An int8
index?  Dear God!  You better be using a 64 bit machine!"  or  "The only
way you will be able to handle a load like that will be with
UltraSparc+Solaris!!" or "Some Alpha fan did tons of assembly level
optimizations for the Alpha processors, they are the fastest!"  From the
sounds of it what I need is lots of fast disks and to hell with the
CPU.  I might as well get an Athlon and a Network Appliance.

> > * Since most of what I am doing is inserts I will assume that the disk
> > will be my bottleneck over anything else.  Is this correct?
>
> Yes.
>
> > * Will the 7.1 WAL save me, when it comes to insert times?
> > * I read something about disabling fsync() to speed up inserts.  How
> > much will this speed things up?  I would consider it as I COULD rebuild
> > lost data from my logs in the event of a system crash and one night a
> > year of panic on my part is worth saving $100,000 in drive arrays.
>
> Turning fsync off gave about a factor of 10 increase (!) before 7.1 - the
> WAL stuff means you can do so safely. Basically PG should rebuild itself
> from the WAL file in the event of system failure. Have to admit I've not
> tried it myself yet.

A tenfold increase in insert speed over what I have now would make me a
very happy man.  That would put me in the order of magnitude of speed I
need with the hardware I have now.

> > Oh, and if any of you SQL guru's are still reading I'll show you the
> > queries I will be running to see if I can really ditch an index.
>
> > select * from table where customer_id = ? and primary_key = ?::int8
>
> Well if primary_key is one, you don't need customer_id.

true.

> > select * from table where customer_id = ? and group_number = ?

The groups are less than 10 objects each so ther customer index here
does no good.

> > select * from table where customer_id = ? and creation > ? and creation < ?

This is where the customer index comes in handy.  I'll explain below.

> How many customers do you have? Are the customers going to want a
> specific time period? If so, you could have one table per week (say) and
> use a view to conceal the fact for the occasional query that crosses week
> boundaries.
>
> If these are actually going to by GROUP BY queries totalling figures,
> perhaps try calculating totals beforehand.
>
> If these are representative of your main queries, I'd be tempted to buy
> 10 cheap machines (+ a spare) and split the customers among the machines.
> Proxy this stuff at the application level and they'll never know. For
> your management stats you'd have to write a script to summarise stuff
> from several machines, but that shouldn't be too difficult. The spare
> machine can be over-specced and have the data from all the other ready
> for a hot-swapover.
>
> The beauty of multiple machines is it should scale well (so long as you
> are sufficiently fascist about keeping the configs the same).

I have about 100 customers but one of them is repsonsible for generating
over half the database entries.  Without the customer index the queries
of the smaller customers who have a dozen entries are slow becasue the
database is sorting through millions of entries from other customers.  I
suppose the thing to do is to get the big customer their own box and
drop the customer index all together.

pgsql-general by date:

Previous
From: adb
Date:
Subject: Re: LIKE and indexes?
Next
From: adb
Date:
Subject: Re: Fast Inserts and Hardware Questions