Simple (hopefully) throughput question? - Mailing list pgsql-performance

From Nick Matheson
Subject Simple (hopefully) throughput question?
Date
Msg-id 4CD1853F.2010800@noaa.gov
Whole thread Raw
Responses Re: Simple (hopefully) throughput question?  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: Simple (hopefully) throughput question?  (Marti Raudsepp <marti@juffo.org>)
Re: Simple (hopefully) throughput question?  (Andy Colson <andy@squeakycode.net>)
Re: Simple (hopefully) throughput question?  ("Pierre C" <lists@peufeu.com>)
Re: Simple (hopefully) throughput question?  (Robert Klemme <shortcutter@googlemail.com>)
List pgsql-performance
Hello

We have an application that needs to do bulk reads of ENTIRE
Postgres tables very quickly (i.e. select * from table).  We have
observed that such sequential scans run two orders of magnitude slower
than observed raw disk reads (5 MB/s versus 100 MB/s).  Part of this is
due to the storage overhead we have observed in Postgres.  In the
example below, it takes 1 GB to store 350 MB of nominal data.  However
that suggests we would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest
that the queries are CPU bound, not I/O bound.  In fact, repeating the
queries yields similar response times.  Presumably if it were an I/O
issue the response times would be much shorter the second time through
with the benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored
procedures, and libpq.  In all cases the client code ran on the same
box as the server. We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters such as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following (there are no indexes):

Table "bulk_performance.counts"
 Column |  Type   | Modifiers
--------+---------+-----------
 i1     | integer |
 i2     | integer |
 i3     | integer |
 i4     | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster.  In this case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.

Our business logic does operations on the resulting data such that
the output is several orders of magnitude smaller than the input.  So
we had hoped that by putting our business logic into stored procedures
(and thus drastically reducing the amount of data flowing to the
client) our throughput would go way up.  This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls
SPI directly) or some other approach to get close to the expected 35
MB/s doing these bulk reads?  Or is this the price we have to pay for
using SQL instead of some NoSQL solution.  (We actually tried Tokyo
Cabinet and found it to perform quite well. However it does not measure
up to Postgres in terms of replication, data interrogation, community
support, acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson


pgsql-performance by date:

Previous
From: Kenneth Marshall
Date:
Subject: Re: Bufer cache replacement LRU algorithm?
Next
From: "Kevin Grittner"
Date:
Subject: Re: Bufer cache replacement LRU algorithm?