Thread: bug in query planning?
I have a query which does not use column indexes that it should use. I have discovered some interesting behaviors of Postgres which may indicate a bug in the database's query planning. Take a look at the query below. There is a btree index on both m.account_id and a.account_id. Query (1) does not use the index on the messages table, instead opting for a full table scan, thus killing performance. The messages table can contain potentially hundreds of thousands or millions of rows. Even at 50,000, it's murder. Query (2) below is the same query, but we reverse the order of the tables. It's obviously not quite the same query semantically, even though in my case it should always produce the same result. It is interesting to note that it uses the indexes tho. Finally, query (3) below uses traditional joining (non-ANSI). Indexes are correctly used in that query. The suggestion is that Postgres does not correctly analyze queries using ANSI joins. Indexes are occasionally skipped when they should be used. This seems like a bug in Postgres. I'm using version 7.3.4 of Postgres. Thanks in advance for any comments... steve Query (1) ========= defender=# explain analyze defender-# select count(message_id) defender-# from messages m defender-# left join accounts a defender-# on m.account_id::bigint = a.account_id::bigint defender-# where a.email = 'stevena@neosynapse.net'; QUERY PLAN ------------------------------------------------------------------------ -------------------------------------------------------------------- Aggregate (cost=20461.10..20461.10 rows=1 width=47) (actual time=1420.09..1420.09 rows=1 loops=1) -> Hash Join (cost=30.77..20334.38 rows=50687 width=47) (actual time=0.51..1319.69 rows=51419 loops=1) Hash Cond: ("outer".account_id = "inner".account_id) Filter: ("inner".email = 'stevena@neosynapse.net'::text) -> Seq Scan on messages m (cost=0.00..19289.87 rows=50687 width=16) (actual time=0.06..703.89 rows=52541 loops=1) -> Hash (cost=30.76..30.76 rows=3 width=31) (actual time=0.40..0.40 rows=0 loops=1) -> Index Scan using accounts_pkey on accounts a (cost=0.00..30.76 rows=3 width=31) (actual time=0.17..0.38 rows=3 loops=1) Total runtime: 1420.25 msec (8 rows) Query (2) ========= defender=# explain analyze defender-# select count(message_id) defender-# from accounts a defender-# left join messages m defender-# on a.account_id::bigint = m.account_id::bigint defender-# where a.email = 'stevena@neosynapse.net'; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ -------- Aggregate (cost=6806.54..6806.54 rows=1 width=24) (actual time=792.14..792.14 rows=1 loops=1) -> Nested Loop (cost=0.00..6764.30 rows=16896 width=24) (actual time=0.38..718.12 rows=51419 loops=1) -> Index Scan using accounts_email on accounts a (cost=0.00..8.98 rows=1 width=8) (actual time=0.22..0.25 rows=1 loops=1) Index Cond: (email = 'stevena@neosynapse.net'::text) -> Index Scan using messages_account_id on messages m (cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..593.15 rows=51419 loops=1) Index Cond: ("outer".account_id = m.account_id) Total runtime: 792.33 msec (7 rows) Query (3) ========= defender=# explain analyze defender-# select count(message_id) defender-# from messages m, accounts a defender-# where m.account_id::bigint = a.account_id::bigint defender-# and a.email = 'stevena@neosynapse.net'; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ -------- Aggregate (cost=6806.54..6806.54 rows=1 width=24) (actual time=782.30..782.30 rows=1 loops=1) -> Nested Loop (cost=0.00..6764.30 rows=16896 width=24) (actual time=0.33..708.52 rows=51422 loops=1) -> Index Scan using accounts_email on accounts a (cost=0.00..8.98 rows=1 width=8) (actual time=0.15..0.18 rows=1 loops=1) Index Cond: (email = 'stevena@neosynapse.net'::text) -> Index Scan using messages_account_id on messages m (cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..578.23 rows=51422 loops=1) Index Cond: (m.account_id = "outer".account_id) Total runtime: 782.46 msec (7 rows)
"Steven D.Arnold" <stevena@neosynapse.net> writes: > Query (2) below is the same query, but we reverse the order of the > tables. It's obviously not quite the same query semantically, even > though in my case it should always produce the same result. Since it is in fact not the same query, I'm unclear on why you expect it to produce the same plan. > I'm using version 7.3.4 of Postgres. FWIW, I believe that 7.4 will recognize that (1) and (3) are semantically equivalent. regards, tom lane
On Dec 21, 2003, at 11:47 PM, Tom Lane wrote: > "Steven D.Arnold" <stevena@neosynapse.net> writes: >> Query (2) below is the same query, but we reverse the order of the >> tables. It's obviously not quite the same query semantically, even >> though in my case it should always produce the same result. > > Since it is in fact not the same query, I'm unclear on why you expect > it to produce the same plan. What I expect is for both queries to use the index on the messages table! Why is it not doing that? > FWIW, I believe that 7.4 will recognize that (1) and (3) are > semantically equivalent. I will try 7.4 and report back. steve
The queries are listed here for the referentially (yes that's a pun) challenged. Query 1: SELECT COUNT(message_id) FROM messages m LEFT JOIN accounts a ON m.account_id::bigint = a.account_id::bigint WHERE a.email = 'stevena@neosynapse.net'; Query 2: SELECT COUNT(message_id) FROM accounts a LEFT JOIN messages m ON a.account_id::bigint = m.account_id::bigint WHERE a.email = 'stevena@neosynapse.net'; Query 3: SELECT COUNT(message_id) FROM messages m, accounts a WHERE m.account_id::bigint = a.account_id::bigint AND a.email = 'stevena@neosynapse.net'; From what I can see they are not the same query and therefore shouldn't use the same plan. The first query is saying go get all the messages (best done with a seq scan since there is no where to limit the results of the message table [using an index scan would just add the overhead of reading the pages for the index, the computational time to resolve the index entries, and turn the table access into a random sector read instead of sequential without actually limiting what gets returned]) match that with as many accounts as you can and return a row for all of the messages (note the LEFT JOIN). Next filter all of the results on the account email (which only eliminates 1100 messages out of 52000). Now count how many messages are left which should return 51419. The second query is saying get all of the accounts filter by email address (it can get this from the where this time) giving 1 row. Now match that to every message for this account_id and return at least one row even if there are no messages for this account (note again the LEFT JOIN) (which uses the index scan because it expects the index selectivity to be a approximately 1/4 of the full table [it's wrong]). Now count how many messages I have which returns 51419. The third query is saying give me all of the messages for the accounts where my email = 'stevena@neosynapse.net' and I don't care where you start from. The optimizer, after going through consideration of various possible plans, is then smart enough to realize the email = 'blah' is indexed and it's selectivity is 1 row which means that we now return to the situation in query 2 with one small change if there are no messages for the account in question you would get no row returned, leading to a more efficient aggregation step. Steven D.Arnold wrote: > > On Dec 21, 2003, at 11:47 PM, Tom Lane wrote: > >> "Steven D.Arnold" <stevena@neosynapse.net> writes: >> >>> Query (2) below is the same query, but we reverse the order of the >>> tables. It's obviously not quite the same query semantically, even >>> though in my case it should always produce the same result. >> You are correct the queries produce the same results, but they are telling the planner to do completely different things. The query doesn't show it bu if the behavior you are desiring happened in postgres (unless show the relational algebra that makes it work), I would have to start looking for a new database (that's a disturbing thought). >> >> Since it is in fact not the same query, I'm unclear on why you expect >> it to produce the same plan. > > > What I expect is for both queries to use the index on the messages > table! Why is it not doing that? Because of the table ordering and the left join in 7.3.x Because of the left join in 7.4 >> FWIW, I believe that 7.4 will recognize that (1) and (3) are >> semantically equivalent. > > > I will try 7.4 and report back. I don't believe the optimiser (in any database that cares about giving you the correct results) can determine that a non-constrained primary table in a left join can be rewritten as either of your other two queries (but there are smarter people than me working on Postgres, so I could be wrong). > > steve My suggestion would be to place the more selective table first in a JOIN, and get rid of the LEFT JOIN's unless that's exactly what you want. For more information about the different JOIN methods RTFM. I would also suggest that you might want to tune your random page cost toward 1, because obviously random access is being over estimated for your hardware. (You might just want to look at tuning your parameters in general.) And in the future you should run a query at least one extra time to note the different caching makes (the second run for an explain analyze is usually quite different than the first for tables of this size). DeJuan
DeJuan Jackson <djackson@speedfc.com> writes: > Query 1: > SELECT COUNT(message_id) > FROM messages m > LEFT JOIN accounts a > ON m.account_id::bigint = a.account_id::bigint > WHERE a.email = 'stevena@neosynapse.net'; > Query 2: > SELECT COUNT(message_id) > FROM accounts a > LEFT JOIN messages m > ON a.account_id::bigint = m.account_id::bigint > WHERE a.email = 'stevena@neosynapse.net'; > Query 3: > SELECT COUNT(message_id) > FROM messages m, accounts a > WHERE m.account_id::bigint = a.account_id::bigint > AND a.email = 'stevena@neosynapse.net'; > From what I can see they are not the same query and therefore shouldn't > use the same plan. Actually, queries 1 and 3 are equivalent, and I believe PG 7.4 will recognize them as such. The reason is that the WHERE clause "a.email = 'something'" cannot succeed when a.email is NULL; therefore, there is no point in the JOIN being a LEFT JOIN --- any null-extended rows added by the left join will be thrown away again by the WHERE clause. We may as well reduce the LEFT JOIN to a plain inner JOIN, whereupon query 1 is obviously the same as query 3. PG 7.4's optimizer can make exactly this sequence of deductions. The bit of knowledge it needs for this is that the '=' operator involved is STRICT, ie, yields NULL for NULL input. All the standard '=' operators are strict and are so marked in the catalogs. (If you are defining a user-defined type, don't forget to mark your operators strict where applicable.) I believe that query 2 is really equivalent to the others as well, but proving it is more subtle. The reason is that COUNT(message_id) does not count rows where message_id is NULL, and so any null-extended rows added by the LEFT JOIN won't be counted, and so we might as well reduce the LEFT JOIN to a plain inner JOIN. PG's optimizer will not recognize this, however. Possibly it could if anyone wanted to figure out how. Right now we make very few assumptions about the behavior of aggregate functions, but I think you could prove that this is safe based on the behavior of nodeAgg.c for strict transition functions. Next question is whether the case would come up often enough to be worth testing for ... regards, tom lane
Thanks to all for the detailed replies. I just wanted to let everyone know -- for future google searches as much as anything else -- that dumping the database, upgrading to 7.4.1 and reloading did solve the problem. All the queries I mentioned now use the available indices, except for understandable cases such as the number of rows in a table being really small. Thanks for the tip and thanks for the improvements in 7.4.1 that fixed this problem. steve