Thread: Inner join vs where-clause subquery
I have the following query which performs extremely slow: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts where nlogid > ( select max(a.end_nlogid) from activity_log_import_history a) and dtCreateDate < '2006-12-18 9:10' If I change the where clause to have the return value of the subquery it runs very fast: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts where nlogid > 402123456 and dtCreateDate < '2006-12-18 9:10' If I change the query to the following, it runs fast: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts inner join ( select max(end_nlogid) as previous_nlogid from activity_log_import_history) as a on activity_log_facts.nlogid > a.previous_nlogid where dtCreateDate < ${IMPORT_TIMESTAMP} I am running PG 8.2. Why is that this the case? Shouldn't the query planner be smart enough to know that the first query is the same as the second and third? The inner query does not refer to any columns outside of itself. I personally find the first query easiest to read and wish it performed well. Jeremy Haile
Jeremy Haile wrote: > I have the following query which performs extremely slow: > select min(nlogid) as start_nlogid, > max(nlogid) as end_nlogid, > min(dtCreateDate) as start_transaction_timestamp, > max(dtCreateDate) as end_transaction_timestamp > from activity_log_facts > where nlogid > ( select max(a.end_nlogid) from > activity_log_import_history a) > and dtCreateDate < '2006-12-18 9:10' Can you post the EXPLAIN ANALYSE for this one please? That'll show us exactly what it's doing. -- Richard Huxton Archonet Ltd
Here is the explain analyze output: Result (cost=9.45..9.46 rows=1 width=0) (actual time=156589.390..156589.391 rows=1 loops=1) InitPlan -> Result (cost=0.04..0.05 rows=1 width=0) (actual time=0.034..0.034 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=4) (actual time=0.029..0.030 rows=1 loops=1) -> Index Scan Backward using activity_log_import_history_end_nlogid_idx on activity_log_import_history a (cost=0.00..113.43 rows=2877 width=4) (actual time=0.027..0.027 rows=1 loops=1) Filter: (end_nlogid IS NOT NULL) -> Limit (cost=0.00..1.19 rows=1 width=12) (actual time=0.052..0.052 rows=0 loops=1) -> Index Scan using activity_log_facts_pkey on activity_log_facts (cost=0.00..1831613.82 rows=1539298 width=12) (actual time=0.050..0.050 rows=0 loops=1) Index Cond: (nlogid > $1) Filter: ((nlogid IS NOT NULL) AND (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone)) -> Limit (cost=0.00..1.19 rows=1 width=12) (actual time=0.006..0.006 rows=0 loops=1) -> Index Scan Backward using activity_log_facts_pkey on activity_log_facts (cost=0.00..1831613.82 rows=1539298 width=12) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: (nlogid > $1) Filter: ((nlogid IS NOT NULL) AND (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone)) -> Limit (cost=0.00..3.51 rows=1 width=12) (actual time=100221.955..100221.955 rows=0 loops=1) -> Index Scan using activity_log_facts_dtcreatedate_idx on activity_log_facts (cost=0.00..5406927.50 rows=1539298 width=12) (actual time=100221.953..100221.953 rows=0 loops=1) Index Cond: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone) Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1)) -> Limit (cost=0.00..3.51 rows=1 width=12) (actual time=56367.367..56367.367 rows=0 loops=1) -> Index Scan Backward using activity_log_facts_dtcreatedate_idx on activity_log_facts (cost=0.00..5406927.50 rows=1539298 width=12) (actual time=56367.364..56367.364 rows=0 loops=1) Index Cond: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone) Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1)) Total runtime: 156589.605 ms On Tue, 19 Dec 2006 16:31:41 +0000, "Richard Huxton" <dev@archonet.com> said: > Jeremy Haile wrote: > > I have the following query which performs extremely slow: > > select min(nlogid) as start_nlogid, > > max(nlogid) as end_nlogid, > > min(dtCreateDate) as start_transaction_timestamp, > > max(dtCreateDate) as end_transaction_timestamp > > from activity_log_facts > > where nlogid > ( select max(a.end_nlogid) from > > activity_log_import_history a) > > and dtCreateDate < '2006-12-18 9:10' > > Can you post the EXPLAIN ANALYSE for this one please? That'll show us > exactly what it's doing. > > -- > Richard Huxton > Archonet Ltd
Jeremy Haile wrote: > Here is the explain analyze output: Well, the row estimates are about as far out as you can get: > -> Index Scan using activity_log_facts_pkey on > activity_log_facts (cost=0.00..1831613.82 rows=1539298 > width=12) (actual time=0.050..0.050 rows=0 loops=1) > -> Index Scan Backward using activity_log_facts_pkey on > activity_log_facts (cost=0.00..1831613.82 rows=1539298 > width=12) (actual time=0.004..0.004 rows=0 loops=1) > -> Index Scan using activity_log_facts_dtcreatedate_idx on > activity_log_facts (cost=0.00..5406927.50 rows=1539298 > width=12) (actual time=100221.953..100221.953 rows=0 loops=1) > -> Index Scan Backward using > activity_log_facts_dtcreatedate_idx on activity_log_facts > (cost=0.00..5406927.50 rows=1539298 width=12) (actual > time=56367.364..56367.364 rows=0 loops=1) Hmm - it's using the indexes on dtCreateDate and nlogid which seems broadly sensible, and then plans to limit the results for min()/max(). However, it's clearly wrong about how many rows will satisfy nlogid > (select max(a.end_nlogid) from activity_log_import_history a) >>> select min(nlogid) as start_nlogid, >>> max(nlogid) as end_nlogid, >>> min(dtCreateDate) as start_transaction_timestamp, >>> max(dtCreateDate) as end_transaction_timestamp >>> from activity_log_facts >>> where nlogid > ( select max(a.end_nlogid) from >>> activity_log_import_history a) >>> and dtCreateDate < '2006-12-18 9:10' If you run explain on the other forms of your query, I'd guess it's much more accurate. There's a simple way to see if that is the issue. Run the sub-query and substitute the actual value returned into the query above. Then, try the same but with a prepared query. If it's down to nlogid estimates then the first should be fast and the second slow. -- Richard Huxton Archonet Ltd
Here's the query and explain analyze using the result of the sub-query substituted: QUERY explain analyze select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts where nlogid > 478287801 and dtCreateDate < '2006-12-18 9:10' EXPLAIN ANALYZE Aggregate (cost=657.37..657.38 rows=1 width=12) (actual time=0.018..0.019 rows=1 loops=1) -> Index Scan using activity_log_facts_nlogid_idx on activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual time=0.014..0.014 rows=0 loops=1) Index Cond: (nlogid > 478287801) Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone) Total runtime: 0.076 ms Sorry if the reason should be obvious, but I'm not the best at interpreting the explains. Why is this explain so much simpler than the other query plan (with the subquery)? On Tue, 19 Dec 2006 18:23:06 +0000, "Richard Huxton" <dev@archonet.com> said: > Jeremy Haile wrote: > > Here is the explain analyze output: > > Well, the row estimates are about as far out as you can get: > > > -> Index Scan using activity_log_facts_pkey on > > activity_log_facts (cost=0.00..1831613.82 rows=1539298 > > width=12) (actual time=0.050..0.050 rows=0 loops=1) > > > -> Index Scan Backward using activity_log_facts_pkey on > > activity_log_facts (cost=0.00..1831613.82 rows=1539298 > > width=12) (actual time=0.004..0.004 rows=0 loops=1) > > > -> Index Scan using activity_log_facts_dtcreatedate_idx on > > activity_log_facts (cost=0.00..5406927.50 rows=1539298 > > width=12) (actual time=100221.953..100221.953 rows=0 loops=1) > > > -> Index Scan Backward using > > activity_log_facts_dtcreatedate_idx on activity_log_facts > > (cost=0.00..5406927.50 rows=1539298 width=12) (actual > > time=56367.364..56367.364 rows=0 loops=1) > > Hmm - it's using the indexes on dtCreateDate and nlogid which seems > broadly sensible, and then plans to limit the results for min()/max(). > However, it's clearly wrong about how many rows will satisfy > nlogid > (select max(a.end_nlogid) from activity_log_import_history a) > > >>> select min(nlogid) as start_nlogid, > >>> max(nlogid) as end_nlogid, > >>> min(dtCreateDate) as start_transaction_timestamp, > >>> max(dtCreateDate) as end_transaction_timestamp > >>> from activity_log_facts > >>> where nlogid > ( select max(a.end_nlogid) from > >>> activity_log_import_history a) > >>> and dtCreateDate < '2006-12-18 9:10' > > If you run explain on the other forms of your query, I'd guess it's much > more accurate. There's a simple way to see if that is the issue. Run the > sub-query and substitute the actual value returned into the query above. > Then, try the same but with a prepared query. If it's down to nlogid > estimates then the first should be fast and the second slow. > > -- > Richard Huxton > Archonet Ltd
Jeremy Haile wrote: > Here's the query and explain analyze using the result of the sub-query > substituted: > > QUERY > explain analyze select min(nlogid) as start_nlogid, > max(nlogid) as end_nlogid, > min(dtCreateDate) as start_transaction_timestamp, > max(dtCreateDate) as end_transaction_timestamp > from activity_log_facts > where nlogid > 478287801 > and dtCreateDate < '2006-12-18 9:10' > > EXPLAIN ANALYZE > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual > time=0.018..0.019 rows=1 loops=1) > -> Index Scan using activity_log_facts_nlogid_idx on > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual > time=0.014..0.014 rows=0 loops=1) > Index Cond: (nlogid > 478287801) > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without > time zone) > Total runtime: 0.076 ms > > > Sorry if the reason should be obvious, but I'm not the best at > interpreting the explains. Why is this explain so much simpler than the > other query plan (with the subquery)? Because it's planning it with knowledge of what "nlogid"s it's filtering by. It knows it isn't going to get many rows back with nlogid > 478287801. In your previous explain it thought a large number of rows would match and was trying not to sequentially scan the activity_log_facts table. Ideally, the planner would evaluate the subquery in your original form (it should know it's only getting one row back from max()). Then it could plan the query as above. I'm not sure how tricky that is to do though. -- Richard Huxton Archonet Ltd
Makes sense. It is NOT executing the subquery more than once is it? On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com> said: > Jeremy Haile wrote: > > Here's the query and explain analyze using the result of the sub-query > > substituted: > > > > QUERY > > explain analyze select min(nlogid) as start_nlogid, > > max(nlogid) as end_nlogid, > > min(dtCreateDate) as start_transaction_timestamp, > > max(dtCreateDate) as end_transaction_timestamp > > from activity_log_facts > > where nlogid > 478287801 > > and dtCreateDate < '2006-12-18 9:10' > > > > EXPLAIN ANALYZE > > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual > > time=0.018..0.019 rows=1 loops=1) > > -> Index Scan using activity_log_facts_nlogid_idx on > > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual > > time=0.014..0.014 rows=0 loops=1) > > Index Cond: (nlogid > 478287801) > > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without > > time zone) > > Total runtime: 0.076 ms > > > > > > Sorry if the reason should be obvious, but I'm not the best at > > interpreting the explains. Why is this explain so much simpler than the > > other query plan (with the subquery)? > > Because it's planning it with knowledge of what "nlogid"s it's filtering > by. It knows it isn't going to get many rows back with nlogid > > 478287801. In your previous explain it thought a large number of rows > would match and was trying not to sequentially scan the > activity_log_facts table. > > Ideally, the planner would evaluate the subquery in your original form > (it should know it's only getting one row back from max()). Then it > could plan the query as above. I'm not sure how tricky that is to do > though. > > -- > Richard Huxton > Archonet Ltd
I'm still confused as to why the inner join version ran so much faster than the where-clause version. Here's the inner join query and explain ouput: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts inner join ( select max(end_nlogid) as previous_nlogid from activity_log_import_history) as a on activity_log_facts.nlogid > a.previous_nlogid where dtCreateDate < '2006-12-18 9:10' Aggregate (cost=246226.95..246226.96 rows=1 width=12) -> Nested Loop (cost=49233.27..231209.72 rows=1501722 width=12) -> Result (cost=0.04..0.05 rows=1 width=0) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=4) -> Index Scan Backward using activity_log_import_history_end_nlogid_idx on activity_log_import_history (cost=0.00..114.97 rows=2913 width=4) Filter: (end_nlogid IS NOT NULL) -> Bitmap Heap Scan on activity_log_facts (cost=49233.23..210449.44 rows=1660817 width=12) Recheck Cond: (activity_log_facts.nlogid > a.previous_nlogid) Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone) -> Bitmap Index Scan on activity_log_facts_nlogid_idx (cost=0.00..49233.23 rows=1660817 width=0) Index Cond: (activity_log_facts.nlogid > a.previous_nlogid) Since the inner join is basically the same thing as doing the where-clause subquery, why does it generate a far different plan? On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com> said: > Jeremy Haile wrote: > > Here's the query and explain analyze using the result of the sub-query > > substituted: > > > > QUERY > > explain analyze select min(nlogid) as start_nlogid, > > max(nlogid) as end_nlogid, > > min(dtCreateDate) as start_transaction_timestamp, > > max(dtCreateDate) as end_transaction_timestamp > > from activity_log_facts > > where nlogid > 478287801 > > and dtCreateDate < '2006-12-18 9:10' > > > > EXPLAIN ANALYZE > > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual > > time=0.018..0.019 rows=1 loops=1) > > -> Index Scan using activity_log_facts_nlogid_idx on > > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual > > time=0.014..0.014 rows=0 loops=1) > > Index Cond: (nlogid > 478287801) > > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without > > time zone) > > Total runtime: 0.076 ms > > > > > > Sorry if the reason should be obvious, but I'm not the best at > > interpreting the explains. Why is this explain so much simpler than the > > other query plan (with the subquery)? > > Because it's planning it with knowledge of what "nlogid"s it's filtering > by. It knows it isn't going to get many rows back with nlogid > > 478287801. In your previous explain it thought a large number of rows > would match and was trying not to sequentially scan the > activity_log_facts table. > > Ideally, the planner would evaluate the subquery in your original form > (it should know it's only getting one row back from max()). Then it > could plan the query as above. I'm not sure how tricky that is to do > though. > > -- > Richard Huxton > Archonet Ltd