Critical performance problems on large databases - Mailing list pgsql-general
From | Gunther Schadow |
---|---|
Subject | Critical performance problems on large databases |
Date | |
Msg-id | 3CB4BF29.1000700@aurora.regenstrief.org Whole thread Raw |
Responses |
Re: Critical performance problems on large databases
Re: Critical performance problems on large databases Re: Critical performance problems on large databases Re: Critical performance problems on large databases |
List | pgsql-general |
Hi, it had been repeatedly noted that pgsql was sort of slow on selects on large tables. For instance, I know a colleague of mine has asked this recently but on the Google archives I found the following pertient question and response. http://archives.postgresql.org/pgsql-sql/2000-03/msg00031.php Basically if you have a large table with, say, a million rows and you do SELECT * FROM Bigtable; it takes a long time for it to come up with the first page of results using up lots of computer resources etc and after the first page is returned the backend basically goes into idle mode for all the rest of the query results retrieval. We also noted that a SELECT COUNT(*) FROM BigQuery; can take quite a long time and again use a lot of resources, whereas SELECT COUNT(smallcolumn) FROM BigQuery; may be faster and less resource consuming. Off the bat, this indicates to me that there is something sub-obtimal about PostgreSQL handling simple queries. From a database that should perform well in online user transactions one would want the query processing to be streamed as much as possible, i.e., since in a SELECT * FROM Bigtable; there is no work needed other than to retrieve the tuples out of physical storage, the response should be immediate and resource usage low. There should not be large buffer allocations. Conversely it looks as if PostgreSQL will always read a sizeable piece (if not all?) of the result set into some buffer area before returning a single row. This would explain the slow startup on the SELECT * FROM Bigtable; query as well as the fact that COUNT(smallcolumn) behaves much faster than COUNT(*). Again, count should be streamed as well such as to use no significant memory resources other than the counter. Apparently a COUNT(*) in postgres is executed as SELECT * FROM Bigtable INTO $somebuffer COUNT(tuples in $somebuffer) Admittedly I am conjecturing here, but the evidence is strong. Especially because I can make a 3 level subquery with group- by and all kinds of stuff go almost faster than a SELECT * FROM Bigtable; Any enlightenments? Am I wrong? Will this be fixed soon? Is it hard to change pgsql to do better streaming of its operations. As a corollary, I would presume that pgsql won't benefit much from multi-CPU machines because it cannot parallelize its activities. It may be less hard to make all of pgsql thread-safe and threaded than it is to rewrite the execution engine to stream between the different task and do less buffering, right? Or may be all is fine with pgsql and we just didn't figure out how to set up the configuration right? thanks -Gunther PS: we are seriously looking into using pgsql as the core of a BIG medical record system, but we already know that if we can't get quick online responses (< 2 s) on large rasult sets (10000 records) at least at the first page (~ 100 records) we are in trouble. -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
pgsql-general by date: