Re: Critical performance problems on large databases - Mailing list pgsql-general

From Shaun Thomas
Subject Re: Critical performance problems on large databases
Date
Msg-id Pine.LNX.4.44.0204110759120.6628-100000@hamster.lee.net
Whole thread Raw
In response to Critical performance problems on large databases  (Gunther Schadow <gunther@aurora.regenstrief.org>)
Responses Re: Critical performance problems on large databases
List pgsql-general
On Wed, 10 Apr 2002, Gunther Schadow wrote:

> SELECT * FROM Bigtable;
>
> it takes a long time for it to come up with the first page
> of results using up lots of computer resources etc and after
> the first page is returned the backend basically goes into
> idle mode for all the rest of the query results retrieval.

Very simple.  PHP and PERL DBI, when doing non-cursor queries, will try
and buffer the *entire* result into memory before letting you have it.
In PHP, when you run a pg_exec, or mysql_query, or whatever, it actually
executes the query, and stores all of the results in memory.  Don't
believe me?  Try closing the database connection, and then do a fetch on
the result identifier.  Heck, fetch until you reach the end.

You'll get all of your results.  Even though the database is closed.

So when you are doing a "SELECT * FROM bigtable", you're telling PHP to
not only buffer over a million rows, but to transfer it from the
database as well.  "SELECT COUNT(*) FROM bigtable" doesn't have
this overhead.  The database just returns a single row, which you
can view basically as soon as the database is done.

There's also something to be said for the LIMIT clause.  Somehow I
doubt you need every single row in the entire table in order to do
your work on that page.  Just adding LIMIT/OFFSET will increase your
speed significantly.

Or, as someone else mentioned, use a cursor, and let the database buffer
your query.  It'll be slow too, but you can at least fetch the rows
individually and get the perception of speed in your application.

> no work needed other than to retrieve the tuples out of
> physical storage, the response should be immediate and resource
> usage low. There should not be large buffer allocations.

I see how you can get into this kind of thinking.  Since select * has no
ordering, no calculations, nothing but just returning raw results from
the table in whatever format they may be in, how could it possibly be
resource intensive?  But the logic is wrong.  If you want all the data,
it'll give it to you.  Whether you do it at the console, and it has to
throw everything at the screen buffer, or your application, which has
to put it in some temporary storage until it gets what it needs and
deallocates the memory.  That data has to go *somewhere*, it can't just
exist in limbo until you decide you want to use some of it.

You can use cursors to get around this, because the database basically
runs the query and doesn't send back squat until you actually ask it to.
But I'll tell you a well selected limit clause will work almost as well,
and reduce the need for the database to maintain a cursor.

Say you're on page 5, and you are showing 20 results per page.  Knowing
that results start at 0, you can get the offset just by doing:

(PAGE - 1) * PER_PAGE = (5-1) * 20 = 4*20 = 80.  So your query becomes:

SELECT * FROM Bigtable LIMIT 20 OFFSET 80.

And viola.  You'll get back 20 rows right where you want them.  Also,
take care of what you select from the table.  Maybe you don't actually
need all of the data, but only certain rows.  The less data that has to
be transferred from database to application, the faster you can get/show
your data.  Especially if your database is on a separate machine from
your application server.  Network transfers are *not* instantaneous.

Try:

SELECT col1,col2,col3,etc FROM Bigtable LIMIT x OFFSET y

instead.  You'll save yourself some time, save the database the effort
of loading all million rows of every column in the table into the
network interface, and save your application the need to fetch them.

> COUNT(smallcolumn) behaves much faster than COUNT(*).

Or you can do what every database optimization book in the entire
universe says, and do COUNT(1).  Since all you're testing is the
existence of the row, not any value of anything in it.

> Again, count should be streamed as well such as to use no
> significant memory resources other than the counter.

Actually, it's converted into: "*, eh?  I think not.  Let me rewrite
this query and put a 1 here instead."

The database tries to keep people from shooting themselves in the foot,
but even the most Herculean effort will fail when the user is bound and
determined to blow off their big toe.

> Any enlightenments? Am I wrong? Will this be fixed soon?
> Is it hard to change pgsql to do better streaming of its
> operations.

Read a few books on database design and optimization, the basics of
application memory allocation, and the TCP/IP stack.  If you want to
understand how to make an application fast from front to back, you have
to understand the components that make it work.  Your knowledge of
application memory performance and network latency seems inherently
flawed, and until you get over the assumption that network transfers are
free and optimizing queries is a fool's errand, you'll continue to have
problems in any database you choose.

--
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
| Shaun M. Thomas                INN Database Administrator           |
| Phone: (309) 743-0812          Fax  : (309) 743-0830                |
| Email: sthomas@townnews.com    AIM  : trifthen                      |
| Web  : www.townnews.com                                             |
|                                                                     |
|     "Most of our lives are about proving something, either to       |
|      ourselves or to someone else."                                 |
|                                           -- Anonymous              |
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+



pgsql-general by date:

Previous
From: "Samuel J. Sutjiono"
Date:
Subject: Transactional vs. Read-only (Retrieval) database
Next
From: Herb Blacker
Date:
Subject: Error Message