Re: Reading data in bulk - help? - Mailing list pgsql-performance

From Chris Huston
Subject Re: Reading data in bulk - help?
Date
Msg-id 91D7A447-E361-11D7-A142-000393011B1A@bangjafwac.com
Whole thread Raw
In response to Re: Reading data in bulk - help?  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Reading data in bulk - help?  (Dennis Bjorklund <db@zigo.dhs.org>)
Re: Reading data in bulk - help?  ("Magnus Naeslund(w)" <mag@fbab.net>)
Re: Reading data in bulk - help?  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
Thanks Josh that helped. I had gone looking for some kind of cluster
option but was looking under create database, create index and
initlocation - didn't see the CLUSTER index ON table.

I ran the CLUSTER which took about 2 1/2 hours to complete. That
improved the query performance about 6x - which is great - but is still
taking 26 minutes to do what a serial read does in about 2 1/2 minutes.

At this point I'm ok because each fetch is taking around 200
milliseconds from call to the time the data is ready. The processing
takes 300-600ms per batch. I've got the fetch and the processing
running in separate threads so even if postgres was running faster it
wouldn't help this implementation.

However, "iostat" is still reporting average size per transfer of about
10kB and total thru-put of about 1MB/s. The transfers per second went
from >200/s to about 80/s. It still seams like it ought to be a faster.

The system is currently running on a single processor 500Mhz G4. We're
likely to move to a two processor 2Ghz G5 in the next few months. Then
each block may take only a 30-60 milliseconds to complete and their can
be two concurrent blocks processing at once.

Sometime before then I need to figure out how to cut the fetch times
from the now 200ms to something like 10ms. There are currently
1,628,800 records in the single data table representing 6817 groups.
Each group has 2 to 284 records - with 79% having the max 284 (max
grows by 1 every day - although the value may change throughout the
day). Each record is maybe 1 or 2k so ideally each batch/group should
require 284-568k - at 10MB/s - that'd be

RELATED QUESTION: How now do I speed up the following query: "select
distinct group_id from datatable"? Which results in a sequential scan
of the db. Why doesn't it use the group_id index? I only do this once
per run so it's not as critical as the fetch speed which is done 6817
times.

Thanks for the help!
- Chris

On Tuesday, Sep 9, 2003, at 18:11 America/Denver, Josh Berkus wrote:

> Chris,
>
>> I've got an application that needs to chunk through ~2GB of data. The
>> data is ~7000 different sets of 300 records each. I put all of the
>> data
>> into a postgres database but that doesn't look like its going to work
>> because of how the data lives on the disk.
>
> Your problem is curable through 4 steps:
>
> 1) adjust your postgresql.conf to appropriate levels for memory usage.
>
> 2) if those sets of 300 are blocks in some contiguous order, then
> cluster them
> to force their physical ordering on disk to be the same order you want
> to
> read them in.   This will require you to re-cluster whenever you
> change a
> significant number of records, but from the sound of it that happens in
> batches.
>
> 3) Get better disks, preferrably a RAID array, or just very fast scsi
> if the
> database is small.    If you're budget-constrained, Linux software
> raid (or
> BSD raid) on IDE disks is cheap.  What kind of RAID depends on what
> else
> you'll be doing with the app; RAID 5 is better for read-only access,
> RAID 1+0
> is better for read-write.
>
> 4) Make sure that you aren't dumping the data to the same disk
> postgreSQL
> lives on!   Preferably, make sure that your swap partition is on a
> different
> disk/array from postgresql.   If the computing app is complex and
> requires
> disk reads aside from postgres data, you should make sure that it
> lives on
> yet another disk.   Or you can simplify this with a good, really large
> multi-channel RAID array.
>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly


pgsql-performance by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Hardware recommendations to scale to silly load
Next
From: "Gregory S. Williamson"
Date:
Subject: Re: Hardware recommendations to scale to silly load