Re: Simple (hopefully) throughput question? - Mailing list pgsql-performance

From Andy Colson
Subject Re: Simple (hopefully) throughput question?
Date
Msg-id 4CD19E84.7070204@squeakycode.net
Whole thread Raw
In response to Simple (hopefully) throughput question?  (Nick Matheson <Nick.D.Matheson@noaa.gov>)
Responses Re: Simple (hopefully) throughput question?  (Nick Matheson <Nick.D.Matheson@noaa.gov>)
Re: Simple (hopefully) throughput question?  (Nick Matheson <Nick.D.Matheson@noaa.gov>)
List pgsql-performance
On 11/3/2010 10:52 AM, Nick Matheson wrote:
> Hello
>
> We have an application that needs to do bulk reads of ENTIRE
> Postgres tables very quickly (i.e. select * from table). We have
> observed that such sequential scans run two orders of magnitude slower
> than observed raw disk reads (5 MB/s versus 100 MB/s). Part of this is
> due to the storage overhead we have observed in Postgres. In the
> example below, it takes 1 GB to store 350 MB of nominal data. However
> that suggests we would expect to get 35 MB/s bulk read rates.
>
> Observations using iostat and top during these bulk reads suggest
> that the queries are CPU bound, not I/O bound. In fact, repeating the
> queries yields similar response times. Presumably if it were an I/O
> issue the response times would be much shorter the second time through
> with the benefit of caching.
>
> We have tried these simple queries using psql, JDBC, pl/java stored
> procedures, and libpq. In all cases the client code ran on the same
> box as the server. We have experimented with Postgres 8.1, 8.3 and 9.0.
>
> We also tried playing around with some of the server tuning parameters
> such as shared_buffers to no avail.
>
> Here is uname -a for a machine we have tested on:
>
> Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
> 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> A sample dataset that reproduces these results looks like the following
> (there are no indexes):
>
> Table "bulk_performance.counts"
> Column | Type | Modifiers
> --------+---------+-----------
> i1 | integer |
> i2 | integer |
> i3 | integer |
> i4 | integer |
>
> There are 22 million rows in this case.
>
> We HAVE observed that summation queries run considerably faster. In this
> case,
>
> select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts
>
> runs at 35 MB/s.
>
> Our business logic does operations on the resulting data such that
> the output is several orders of magnitude smaller than the input. So
> we had hoped that by putting our business logic into stored procedures
> (and thus drastically reducing the amount of data flowing to the
> client) our throughput would go way up. This did not happen.
>
> So our questions are as follows:
>
> Is there any way using stored procedures (maybe C code that calls
> SPI directly) or some other approach to get close to the expected 35
> MB/s doing these bulk reads? Or is this the price we have to pay for
> using SQL instead of some NoSQL solution. (We actually tried Tokyo
> Cabinet and found it to perform quite well. However it does not measure
> up to Postgres in terms of replication, data interrogation, community
> support, acceptance, etc).
>
> Thanks
>
> Dan Schaffer
> Paul Hamer
> Nick Matheson
>
>

I have no idea if this would be helpful or not, never tried it, but when
you fire off "select * from bigtable" pg will create the entire
resultset in memory (and maybe swap?) and then send it all to the client
in one big lump.  You might try a cursor and fetch 100-1000 at a time
from the cursor.  No idea if it would be faster or slower.

-Andy

pgsql-performance by date:

Previous
From: Marti Raudsepp
Date:
Subject: Re: Simple (hopefully) throughput question?
Next
From: "Pierre C"
Date:
Subject: Re: Simple (hopefully) throughput question?