Re: Reading data in bulk - help? - Mailing list pgsql-performance
From | Chris Huston |
---|---|
Subject | Re: Reading data in bulk - help? |
Date | |
Msg-id | 91D7A447-E361-11D7-A142-000393011B1A@bangjafwac.com Whole thread Raw |
In response to | Re: Reading data in bulk - help? (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Reading data in bulk - help?
Re: Reading data in bulk - help? Re: Reading data in bulk - help? |
List | pgsql-performance |
Thanks Josh that helped. I had gone looking for some kind of cluster option but was looking under create database, create index and initlocation - didn't see the CLUSTER index ON table. I ran the CLUSTER which took about 2 1/2 hours to complete. That improved the query performance about 6x - which is great - but is still taking 26 minutes to do what a serial read does in about 2 1/2 minutes. At this point I'm ok because each fetch is taking around 200 milliseconds from call to the time the data is ready. The processing takes 300-600ms per batch. I've got the fetch and the processing running in separate threads so even if postgres was running faster it wouldn't help this implementation. However, "iostat" is still reporting average size per transfer of about 10kB and total thru-put of about 1MB/s. The transfers per second went from >200/s to about 80/s. It still seams like it ought to be a faster. The system is currently running on a single processor 500Mhz G4. We're likely to move to a two processor 2Ghz G5 in the next few months. Then each block may take only a 30-60 milliseconds to complete and their can be two concurrent blocks processing at once. Sometime before then I need to figure out how to cut the fetch times from the now 200ms to something like 10ms. There are currently 1,628,800 records in the single data table representing 6817 groups. Each group has 2 to 284 records - with 79% having the max 284 (max grows by 1 every day - although the value may change throughout the day). Each record is maybe 1 or 2k so ideally each batch/group should require 284-568k - at 10MB/s - that'd be RELATED QUESTION: How now do I speed up the following query: "select distinct group_id from datatable"? Which results in a sequential scan of the db. Why doesn't it use the group_id index? I only do this once per run so it's not as critical as the fetch speed which is done 6817 times. Thanks for the help! - Chris On Tuesday, Sep 9, 2003, at 18:11 America/Denver, Josh Berkus wrote: > Chris, > >> I've got an application that needs to chunk through ~2GB of data. The >> data is ~7000 different sets of 300 records each. I put all of the >> data >> into a postgres database but that doesn't look like its going to work >> because of how the data lives on the disk. > > Your problem is curable through 4 steps: > > 1) adjust your postgresql.conf to appropriate levels for memory usage. > > 2) if those sets of 300 are blocks in some contiguous order, then > cluster them > to force their physical ordering on disk to be the same order you want > to > read them in. This will require you to re-cluster whenever you > change a > significant number of records, but from the sound of it that happens in > batches. > > 3) Get better disks, preferrably a RAID array, or just very fast scsi > if the > database is small. If you're budget-constrained, Linux software > raid (or > BSD raid) on IDE disks is cheap. What kind of RAID depends on what > else > you'll be doing with the app; RAID 5 is better for read-only access, > RAID 1+0 > is better for read-write. > > 4) Make sure that you aren't dumping the data to the same disk > postgreSQL > lives on! Preferably, make sure that your swap partition is on a > different > disk/array from postgresql. If the computing app is complex and > requires > disk reads aside from postgres data, you should make sure that it > lives on > yet another disk. Or you can simplify this with a good, really large > multi-channel RAID array. > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
pgsql-performance by date: