Re: Inner join vs where-clause subquery - Mailing list pgsql-performance
From | Jeremy Haile |
---|---|
Subject | Re: Inner join vs where-clause subquery |
Date | |
Msg-id | 1166561276.10609.281240077@webmail.messagingengine.com Whole thread Raw |
In response to | Re: Inner join vs where-clause subquery (Richard Huxton <dev@archonet.com>) |
List | pgsql-performance |
I'm still confused as to why the inner join version ran so much faster than the where-clause version. Here's the inner join query and explain ouput: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, max(dtCreateDate) as end_transaction_timestamp from activity_log_facts inner join ( select max(end_nlogid) as previous_nlogid from activity_log_import_history) as a on activity_log_facts.nlogid > a.previous_nlogid where dtCreateDate < '2006-12-18 9:10' Aggregate (cost=246226.95..246226.96 rows=1 width=12) -> Nested Loop (cost=49233.27..231209.72 rows=1501722 width=12) -> Result (cost=0.04..0.05 rows=1 width=0) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=4) -> Index Scan Backward using activity_log_import_history_end_nlogid_idx on activity_log_import_history (cost=0.00..114.97 rows=2913 width=4) Filter: (end_nlogid IS NOT NULL) -> Bitmap Heap Scan on activity_log_facts (cost=49233.23..210449.44 rows=1660817 width=12) Recheck Cond: (activity_log_facts.nlogid > a.previous_nlogid) Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without time zone) -> Bitmap Index Scan on activity_log_facts_nlogid_idx (cost=0.00..49233.23 rows=1660817 width=0) Index Cond: (activity_log_facts.nlogid > a.previous_nlogid) Since the inner join is basically the same thing as doing the where-clause subquery, why does it generate a far different plan? On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com> said: > Jeremy Haile wrote: > > Here's the query and explain analyze using the result of the sub-query > > substituted: > > > > QUERY > > explain analyze select min(nlogid) as start_nlogid, > > max(nlogid) as end_nlogid, > > min(dtCreateDate) as start_transaction_timestamp, > > max(dtCreateDate) as end_transaction_timestamp > > from activity_log_facts > > where nlogid > 478287801 > > and dtCreateDate < '2006-12-18 9:10' > > > > EXPLAIN ANALYZE > > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual > > time=0.018..0.019 rows=1 loops=1) > > -> Index Scan using activity_log_facts_nlogid_idx on > > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual > > time=0.014..0.014 rows=0 loops=1) > > Index Cond: (nlogid > 478287801) > > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without > > time zone) > > Total runtime: 0.076 ms > > > > > > Sorry if the reason should be obvious, but I'm not the best at > > interpreting the explains. Why is this explain so much simpler than the > > other query plan (with the subquery)? > > Because it's planning it with knowledge of what "nlogid"s it's filtering > by. It knows it isn't going to get many rows back with nlogid > > 478287801. In your previous explain it thought a large number of rows > would match and was trying not to sequentially scan the > activity_log_facts table. > > Ideally, the planner would evaluate the subquery in your original form > (it should know it's only getting one row back from max()). Then it > could plan the query as above. I'm not sure how tricky that is to do > though. > > -- > Richard Huxton > Archonet Ltd
pgsql-performance by date: