Re: Inner join vs where-clause subquery - Mailing list pgsql-performance
| From | Jeremy Haile |
|---|---|
| Subject | Re: Inner join vs where-clause subquery |
| Date | |
| Msg-id | 1166561276.10609.281240077@webmail.messagingengine.com Whole thread Raw |
| In response to | Re: Inner join vs where-clause subquery (Richard Huxton <dev@archonet.com>) |
| List | pgsql-performance |
I'm still confused as to why the inner join version ran so much faster
than the where-clause version.
Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'
Aggregate (cost=246226.95..246226.96 rows=1 width=12)
-> Nested Loop (cost=49233.27..231209.72 rows=1501722 width=12)
-> Result (cost=0.04..0.05 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using
activity_log_import_history_end_nlogid_idx on
activity_log_import_history (cost=0.00..114.97
rows=2913 width=4)
Filter: (end_nlogid IS NOT NULL)
-> Bitmap Heap Scan on activity_log_facts
(cost=49233.23..210449.44 rows=1660817 width=12)
Recheck Cond: (activity_log_facts.nlogid >
a.previous_nlogid)
Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
without time zone)
-> Bitmap Index Scan on activity_log_facts_nlogid_idx
(cost=0.00..49233.23 rows=1660817 width=0)
Index Cond: (activity_log_facts.nlogid >
a.previous_nlogid)
Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?
On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> > max(nlogid) as end_nlogid,
> > min(dtCreateDate) as start_transaction_timestamp,
> > max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> > -> Index Scan using activity_log_facts_nlogid_idx on
> > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
> > time=0.014..0.014 rows=0 loops=1)
> > Index Cond: (nlogid > 478287801)
> > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> > time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains. Why is this explain so much simpler than the
> > other query plan (with the subquery)?
>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
> Richard Huxton
> Archonet Ltd
pgsql-performance by date: