Thread: bug in query planning?

bug in query planning?

From
Steven D.Arnold
Date:
I have a query which does not use column indexes that it should use.  I
have discovered some interesting behaviors of Postgres which may
indicate a bug in the database's query planning.

Take a look at the query below.  There is a btree index on both
m.account_id and a.account_id.  Query (1) does not use the index on the
messages table, instead opting for a full table scan, thus killing
performance.  The messages table can contain potentially hundreds of
thousands or millions of rows.  Even at 50,000, it's murder.

Query (2) below is the same query, but we reverse the order of the
tables.  It's obviously not quite the same query semantically, even
though in my case it should always produce the same result.  It is
interesting to note that it uses the indexes tho.

Finally, query (3) below uses traditional joining (non-ANSI).  Indexes
are correctly used in that query.  The suggestion is that Postgres does
not correctly analyze queries using ANSI joins.  Indexes are
occasionally skipped when they should be used.  This seems like a bug
in Postgres.  I'm using version 7.3.4 of Postgres.

Thanks in advance for any comments...
steve


Query (1)
=========
defender=# explain analyze
defender-# select    count(message_id)
defender-# from      messages m
defender-# left join accounts a
defender-# on        m.account_id::bigint = a.account_id::bigint
defender-# where     a.email = 'stevena@neosynapse.net';
                                                                  QUERY
PLAN
------------------------------------------------------------------------
--------------------------------------------------------------------
  Aggregate  (cost=20461.10..20461.10 rows=1 width=47) (actual
time=1420.09..1420.09 rows=1 loops=1)
    ->  Hash Join  (cost=30.77..20334.38 rows=50687 width=47) (actual
time=0.51..1319.69 rows=51419 loops=1)
          Hash Cond: ("outer".account_id = "inner".account_id)
          Filter: ("inner".email = 'stevena@neosynapse.net'::text)
          ->  Seq Scan on messages m  (cost=0.00..19289.87 rows=50687
width=16) (actual time=0.06..703.89 rows=52541 loops=1)
          ->  Hash  (cost=30.76..30.76 rows=3 width=31) (actual
time=0.40..0.40 rows=0 loops=1)
                ->  Index Scan using accounts_pkey on accounts a
(cost=0.00..30.76 rows=3 width=31) (actual time=0.17..0.38 rows=3
loops=1)
  Total runtime: 1420.25 msec
(8 rows)


Query (2)
=========
defender=# explain analyze
defender-# select    count(message_id)
defender-# from      accounts a
defender-# left join messages m
defender-# on        a.account_id::bigint = m.account_id::bigint
defender-# where     a.email = 'stevena@neosynapse.net';

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------
  Aggregate  (cost=6806.54..6806.54 rows=1 width=24) (actual
time=792.14..792.14 rows=1 loops=1)
    ->  Nested Loop  (cost=0.00..6764.30 rows=16896 width=24) (actual
time=0.38..718.12 rows=51419 loops=1)
          ->  Index Scan using accounts_email on accounts a
(cost=0.00..8.98 rows=1 width=8) (actual time=0.22..0.25 rows=1
loops=1)
                Index Cond: (email = 'stevena@neosynapse.net'::text)
          ->  Index Scan using messages_account_id on messages m
(cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..593.15
rows=51419 loops=1)
                Index Cond: ("outer".account_id = m.account_id)
  Total runtime: 792.33 msec
(7 rows)


Query (3)
=========
defender=# explain analyze
defender-# select    count(message_id)
defender-# from      messages m, accounts a
defender-# where     m.account_id::bigint = a.account_id::bigint
defender-# and       a.email = 'stevena@neosynapse.net';

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------
  Aggregate  (cost=6806.54..6806.54 rows=1 width=24) (actual
time=782.30..782.30 rows=1 loops=1)
    ->  Nested Loop  (cost=0.00..6764.30 rows=16896 width=24) (actual
time=0.33..708.52 rows=51422 loops=1)
          ->  Index Scan using accounts_email on accounts a
(cost=0.00..8.98 rows=1 width=8) (actual time=0.15..0.18 rows=1
loops=1)
                Index Cond: (email = 'stevena@neosynapse.net'::text)
          ->  Index Scan using messages_account_id on messages m
(cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..578.23
rows=51422 loops=1)
                Index Cond: (m.account_id = "outer".account_id)
  Total runtime: 782.46 msec
(7 rows)


Re: bug in query planning?

From
Tom Lane
Date:
"Steven D.Arnold" <stevena@neosynapse.net> writes:
> Query (2) below is the same query, but we reverse the order of the
> tables.  It's obviously not quite the same query semantically, even
> though in my case it should always produce the same result.

Since it is in fact not the same query, I'm unclear on why you expect
it to produce the same plan.

> I'm using version 7.3.4 of Postgres.

FWIW, I believe that 7.4 will recognize that (1) and (3) are
semantically equivalent.

            regards, tom lane

Re: bug in query planning?

From
Steven D.Arnold
Date:
On Dec 21, 2003, at 11:47 PM, Tom Lane wrote:

> "Steven D.Arnold" <stevena@neosynapse.net> writes:
>> Query (2) below is the same query, but we reverse the order of the
>> tables.  It's obviously not quite the same query semantically, even
>> though in my case it should always produce the same result.
>
> Since it is in fact not the same query, I'm unclear on why you expect
> it to produce the same plan.

What I expect is for both queries to use the index on the messages
table!  Why is it not doing that?

> FWIW, I believe that 7.4 will recognize that (1) and (3) are
> semantically equivalent.

I will try 7.4 and report back.

steve


Re: bug in query planning?

From
DeJuan Jackson
Date:
The queries are listed here for the referentially (yes that's a pun)
challenged.

Query 1:
SELECT COUNT(message_id)
 FROM messages m
      LEFT JOIN accounts a
       ON  m.account_id::bigint = a.account_id::bigint
 WHERE a.email = 'stevena@neosynapse.net';

Query 2:
SELECT COUNT(message_id)
 FROM accounts a
      LEFT JOIN messages m
       ON  a.account_id::bigint = m.account_id::bigint
 WHERE a.email = 'stevena@neosynapse.net';

Query 3:
SELECT COUNT(message_id)
 FROM messages m, accounts a
 WHERE m.account_id::bigint = a.account_id::bigint
       AND a.email = 'stevena@neosynapse.net';

 From what I can see they are not the same query and therefore shouldn't
use the same plan.

The first query is saying go get all the messages (best done with a seq
scan since there is no where to limit the results of the message table
[using an index scan would just add the overhead of reading the pages
for the index, the computational time to resolve the index entries, and
turn the table access into a random sector read instead of sequential
without actually limiting what gets returned]) match that with as many
accounts as you can and return a row for all of the messages (note the
LEFT JOIN).  Next filter all of the results on the account email  (which
only eliminates 1100 messages out of 52000).  Now count how many
messages are left which should return 51419.

The second query is saying get all of the accounts filter by email
address (it can get this from the where this time) giving 1 row.  Now
match that to every message for this account_id and return at least one
row even if there are no messages for this account (note again the LEFT
JOIN) (which uses the index scan because it expects the index
selectivity to be a approximately 1/4 of the full table [it's wrong]).
Now count how many messages I have which returns 51419.

The third query is saying give me all of the messages for the accounts
where my email = 'stevena@neosynapse.net' and I don't care where you
start from.  The optimizer,  after going through consideration of
various possible plans, is then smart enough to realize the email =
'blah' is indexed and it's selectivity is 1 row which means that we now
return to the situation in query 2 with one small change if there are no
messages for the account in question you would get no row returned,
leading to a more efficient aggregation step.

Steven D.Arnold wrote:

>
> On Dec 21, 2003, at 11:47 PM, Tom Lane wrote:
>
>> "Steven D.Arnold" <stevena@neosynapse.net> writes:
>>
>>> Query (2) below is the same query, but we reverse the order of the
>>> tables.  It's obviously not quite the same query semantically, even
>>> though in my case it should always produce the same result.
>>
You are correct the queries produce the same results, but they are
telling the planner to do completely different things.  The query
doesn't show it bu if the behavior you are desiring happened in postgres
(unless show the relational algebra that makes it work), I would have to
start looking for a new database (that's a disturbing thought).

>>
>> Since it is in fact not the same query, I'm unclear on why you expect
>> it to produce the same plan.
>
>
> What I expect is for both queries to use the index on the messages
> table!  Why is it not doing that?

Because of the table ordering and the left join in 7.3.x
Because of the left join in 7.4

>> FWIW, I believe that 7.4 will recognize that (1) and (3) are
>> semantically equivalent.
>
>
> I will try 7.4 and report back.

I don't believe the optimiser (in any database that cares about giving
you the correct results) can determine that a non-constrained primary
table in a left join can be rewritten as either of your other two
queries (but there are smarter people than me working on Postgres, so I
could be wrong).

>
> steve

My suggestion would be to place the more selective table first in a
JOIN, and get rid of the LEFT JOIN's unless that's exactly what you
want.  For more information about the different JOIN methods RTFM.

I would also suggest that you might want to tune your random page cost
toward 1, because obviously random access is being over estimated for
your hardware.  (You might just want to look at tuning your parameters
in general.)

And in the future you should run a query at least one extra time to note
the different caching makes (the second run for an explain analyze is
usually quite different than the first for tables of this size).

DeJuan


Re: bug in query planning?

From
Tom Lane
Date:
DeJuan Jackson <djackson@speedfc.com> writes:
> Query 1:
> SELECT COUNT(message_id)
>  FROM messages m
>       LEFT JOIN accounts a
>        ON  m.account_id::bigint = a.account_id::bigint
>  WHERE a.email = 'stevena@neosynapse.net';

> Query 2:
> SELECT COUNT(message_id)
>  FROM accounts a
>       LEFT JOIN messages m
>        ON  a.account_id::bigint = m.account_id::bigint
>  WHERE a.email = 'stevena@neosynapse.net';

> Query 3:
> SELECT COUNT(message_id)
>  FROM messages m, accounts a
>  WHERE m.account_id::bigint = a.account_id::bigint
>        AND a.email = 'stevena@neosynapse.net';

>  From what I can see they are not the same query and therefore shouldn't
> use the same plan.

Actually, queries 1 and 3 are equivalent, and I believe PG 7.4 will
recognize them as such.  The reason is that the WHERE clause "a.email =
'something'" cannot succeed when a.email is NULL; therefore, there is no
point in the JOIN being a LEFT JOIN --- any null-extended rows added by
the left join will be thrown away again by the WHERE clause.  We may as
well reduce the LEFT JOIN to a plain inner JOIN, whereupon query 1 is
obviously the same as query 3.  PG 7.4's optimizer can make exactly this
sequence of deductions.  The bit of knowledge it needs for this is that
the '=' operator involved is STRICT, ie, yields NULL for NULL input.
All the standard '=' operators are strict and are so marked in the
catalogs.  (If you are defining a user-defined type, don't forget to
mark your operators strict where applicable.)

I believe that query 2 is really equivalent to the others as well, but
proving it is more subtle.  The reason is that COUNT(message_id) does
not count rows where message_id is NULL, and so any null-extended rows
added by the LEFT JOIN won't be counted, and so we might as well reduce
the LEFT JOIN to a plain inner JOIN.  PG's optimizer will not recognize
this, however.  Possibly it could if anyone wanted to figure out how.
Right now we make very few assumptions about the behavior of aggregate
functions, but I think you could prove that this is safe based on the
behavior of nodeAgg.c for strict transition functions.  Next question
is whether the case would come up often enough to be worth testing
for ...

            regards, tom lane

Re: bug in query planning?

From
Steven D.Arnold
Date:
Thanks to all for the detailed replies.  I just wanted to let everyone
know -- for future google searches as much as anything else -- that
dumping the database, upgrading to 7.4.1 and reloading did solve the
problem.  All the queries I mentioned now use the available indices,
except for understandable cases such as the number of rows in a table
being really small.

Thanks for the tip and thanks for the improvements in 7.4.1 that fixed
this problem.

steve