Thread: Re: [PERFORM] Need help in setting optimal configuration for a huge database.

Hi Craig,
 
Does the no of tables and the size of each table affect the performance of a join operation?
 
When we are trying to join the two big tables that I described above, pg is taking so long to execute?
 
Thank you,
Kishore.

 
On 10/23/05, Craig A. James <cjames@modgraph-usa.com> wrote:

> We are facing a* critical situation because of the performance of the
> **database** .* Even a basic query like select count(*) from
> bigger_table is taking about 4 minutes to return.

Several other replies have mentioned that COUNT() requires a full table scan, but this point can't be emphasized enough: Don't do it!  People who are migrating from other environments (Oracle or MySQL) are used to COUNT(), MIN() and MAX() returning almost instantaneously, certainly on indexed columns.  But for reasons that have something to do with transactions, these operations are unbelievably slow in PostgreSQL.

Here are the alternatives that I've learned.

COUNT() -- There is no good substitute.  What I do is create a new column, "ROW_NUM" with an auto-incrementing sequence.  Every time I insert a row, it gets a new value.  Unfortunately, this doesn't work if you ever delete a row.  The alternative is a more complex pair of triggers, one for insert and one for delete, that maintains the count in a separate one-row table.  It's a nuisance, but it's a lot faster than doing a full table scan for every COUNT().

MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN!  I don't understand why, but happily there is an effective substitute:

  select mycolumn from mytable order by mycolumn limit 1;  -- same as MIN()

  select mycolumn from mytable order by mycolumn desc limit 1;  -- same as MAX()

For a large table, MIN or MAX can take 5-10 minutes, where the above "select..." replacements can return in one millisecond.

You should carefully examine your entire application for COUNT, MIN, and MAX, and get rid of them EVERYWHERE.  This may be the entire source of your problem.  It was in my case.

This is, in my humble opinion, the only serious flaw in PostgreSQL.  I've been totally happy with it in every other way, and once I understood these shortcomings, my application is runs faster than ever on PostgreSQL.

Craig

Re: [PERFORM] Need help in setting optimal configuration

From
Scott Marlowe
Date:
(Please don't top reply... your response has been moved to the bottom)

On Sun, 2005-10-23 at 16:55, Kishore B wrote:
>
> On 10/23/05, Craig A. James <cjames@modgraph-usa.com> wrote:
>         > We are facing a* critical situation because of the
>         performance of the
>         > **database** .* Even a basic query like select count(*) from
>         > bigger_table is taking about 4 minutes to return.
>
>         Several other replies have mentioned that COUNT() requires a
>         full table scan,

This isn't wholly correct.  A query like this:

select count(*) from locatorcodes where locatorcode like 'ZZZ%';

can use an index.  However, since tuple visibility info is NOT stored in
indexes, ALL these tuples must be looked up in the actual table.

>          but this point can't be emphasized enough: Don't do it!
>         People who are migrating from other environments (Oracle or
>         MySQL) are used to COUNT(), MIN() and MAX() returning almost
>         instantaneously, certainly on indexed columns.

While I'll admit that min and max are a bit faster in Oracle than in
postgresql, count doesn't seem much faster in my testing.  Of course, on
a wider table Oracle probably is faster, but I'm used to normalizing out
my tables so that there's no advantage for Oracle there.

>           But for reasons that have something to do with transactions,
>         these operations are unbelievably slow in PostgreSQL.

It's because of visibility in the MVCC system PostgreSQL uses.

>         MIN() and MAX() -- These are surprisingly slow, because they
>         seem to do a full table scan EVEN ON AN INDEXED COLUMN!  I
>         don't understand why, but happily there is an effective
>         substitute:

It's because aggregate in PostgreSQL are abstract things.  To make these
two faster would require short circuiting the query planner to use
something other than the abstracted methods PostgreSQL was built around.

On the other hand, select with limit and order by can use the indexes
because they are not aggregates.

>         You should carefully examine your entire application for
>         COUNT, MIN, and MAX, and get rid of them EVERYWHERE.  This may
>         be the entire source of your problem.  It was in my case.

You're right on here.  The problem is that people often use aggregates
where they shouldn't.  Aggregates really are meant to operate across a
whole set of data.  An aggregate like sum() or avg() seems obviously
designed to hit every tuple.  Well, while min, max, and count may not
look like they should, they, in fact, do hit every table covered by the
where clause.

>         This is, in my humble opinion, the only serious flaw in
>         PostgreSQL.  I've been totally happy with it in every other
>         way, and once I understood these shortcomings, my application
>         is runs faster than ever on PostgreSQL.

I wouldn't fully qualify it as a flaw.  It's a design quirk, caused by
the extensible model PostgreSQL is built under.  While it costs you in
one way, like slower min / max / count in some circumstances, it
benefits you others, like the ability make your own aggregate functions.

> Hi Craig,
>
> Does the no of tables and the size of each table affect the
> performance of a join operation?

Of course they do.  The more information your query has to process, the
slower it will run.  It's usually a pretty much a linear increase in
time required, unless you go from everything fitting into buffers to
spilling to disk.  Then things will slow down noticeably.

>
> When we are trying to join the two big tables that I described above,
> pg is taking so long to execute?

Hard to say.  There are many ways to tune PostgreSQL.  I strongly
suggest you take this thread to the performance list, and post your
postgresql.conf file, and the output of "explain analyze <your query
here>" and ask for help.  That list is much better equipped to help with
these things.