Critical performance problems on large databases - Mailing list pgsql-general

Hi,

it had been repeatedly noted that pgsql was sort of slow on
selects on large tables. For instance, I know a colleague of
mine has asked this recently but on the Google archives I
found the following pertient question and response.

http://archives.postgresql.org/pgsql-sql/2000-03/msg00031.php

Basically if you have a large table with, say, a million rows
and you do

SELECT * FROM Bigtable;

it takes a long time for it to come up with the first page
of results using up lots of computer resources etc and after
the first page is returned the backend basically goes into
idle mode for all the rest of the query results retrieval.

We also noted that a

SELECT COUNT(*) FROM BigQuery;

can take quite a long time and again use a lot of resources,
whereas

SELECT COUNT(smallcolumn) FROM BigQuery;

may be faster and less resource consuming.

Off the bat, this indicates to me that there is something
sub-obtimal about PostgreSQL handling simple queries. From
a database that should perform well in online user transactions
one would want the query processing to be streamed as much
as possible, i.e., since in a SELECT * FROM Bigtable; there is
no work needed other than to retrieve the tuples out of
physical storage, the response should be immediate and resource
usage low. There should not be large buffer allocations.

Conversely it looks as if PostgreSQL will always read a sizeable
piece (if not all?) of the result set into some buffer area before
returning a single row. This would explain the slow startup on
the SELECT * FROM Bigtable; query as well as the fact that
COUNT(smallcolumn) behaves much faster than COUNT(*).

Again, count should be streamed as well such as to use no
significant memory resources other than the counter. Apparently
a COUNT(*) in postgres is executed as

SELECT * FROM Bigtable INTO $somebuffer
COUNT(tuples in $somebuffer)

Admittedly I am conjecturing here, but the evidence is strong.
Especially because I can make a 3 level subquery with group-
by and all kinds of stuff go almost faster than a SELECT *
FROM Bigtable;

Any enlightenments? Am I wrong? Will this be fixed soon?
Is it hard to change pgsql to do better streaming of its
operations. As a corollary, I would presume that pgsql
won't benefit much from multi-CPU machines because it cannot
parallelize its activities. It may be less hard to make all
of pgsql thread-safe and threaded than it is to rewrite the
execution engine to stream between the different task
and do less buffering, right?

Or may be all is fine with pgsql and we just didn't figure
out how to set up the configuration right?

thanks
-Gunther


PS: we are seriously looking into using pgsql as the core
of a BIG medical record system, but we already know that
if we can't get quick online responses (< 2 s) on
large rasult sets (10000 records)  at least at the first
page (~ 100 records) we are in trouble.

--
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org



pgsql-general by date:

Previous
From: "P. Jourdan"
Date:
Subject: migration
Next
From: Tom Lane
Date:
Subject: Re: "Bad date external representation '2002-04-09 00:02:60.00+10'".