Re: Help with bulk read performance - Mailing list pgsql-performance
From | Andy Colson |
---|---|
Subject | Re: Help with bulk read performance |
Date | |
Msg-id | 4D078CD7.7030308@squeakycode.net Whole thread Raw |
In response to | Help with bulk read performance (Dan Schaffer <Daniel.S.Schaffer@noaa.gov>) |
Responses |
Re: Help with bulk read performance
|
List | pgsql-performance |
On 11/1/2010 9:15 AM, Dan Schaffer wrote: > Hello > > We have an application that needs to do bulk reads of ENTIRE Postgres > tables very quickly (i.e. select * from table). We have observed that > such sequential scans run two orders of magnitude slower than observed > raw disk reads (5 MB/s versus 100 MB/s). Part of this is due to the > storage overhead we have observed in Postgres. In the example below, it > takes 1 GB to store 350 MB of nominal data. However that suggests we > would expect to get 35 MB/s bulk read rates. > > Observations using iostat and top during these bulk reads suggest that > the queries are CPU bound, not I/O bound. In fact, repeating the queries > yields similar response times. Presumably if it were an I/O issue the > response times would be much shorter the second time through with the > benefit of caching. > > We have tried these simple queries using psql, JDBC, pl/java stored > procedures, and libpq. In all cases the client code ran on the same box > as the server. > We have experimented with Postgres 8.1, 8.3 and 9.0. > > We also tried playing around with some of the server tuning parameters > such as shared_buffers to no avail. > > Here is uname -a for a machine we have tested on: > > Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 > 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux > > A sample dataset that reproduces these results looks like the following > (there are no indexes): > > Table "bulk_performance.counts" > Column | Type | Modifiers > --------+---------+----------- > i1 | integer | > i2 | integer | > i3 | integer | > i4 | integer | > > There are 22 million rows in this case. > > We HAVE observed that summation queries run considerably faster. In this > case, > > select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts > > runs at 35 MB/s. > > > Our business logic does operations on the resulting data such that the > output is several orders of magnitude smaller than the input. So we had > hoped that by putting our business logic into stored procedures (and > thus drastically reducing the amount of data flowing to the client) our > throughput would go way up. This did not happen. > > So our questions are as follows: > > Is there any way using stored procedures (maybe C code that calls SPI > directly) or some other approach to get close to the expected 35 MB/s > doing these bulk reads? Or is this the price we have to pay for using > SQL instead of some NoSQL solution. (We actually tried Tokyo Cabinet and > found it to perform quite well. However it does not measure up to > Postgres in terms of replication, data interrogation, community support, > acceptance, etc). > > Thanks > > Dan Schaffer > Paul Hamer > Nick Matheson > > > > Whoa... Deja Vu Is this the same thing Nick is working on? How'd he get along? http://archives.postgresql.org/message-id/4CD1853F.2010800@noaa.gov -Andy
pgsql-performance by date: