Re: Inner join vs where-clause subquery - Mailing list pgsql-performance

From Jeremy Haile
Subject Re: Inner join vs where-clause subquery
Date
Msg-id 1166561276.10609.281240077@webmail.messagingengine.com
Whole thread Raw
In response to Re: Inner join vs where-clause subquery  (Richard Huxton <dev@archonet.com>)
List pgsql-performance
I'm still confused as to why the inner join version ran so much faster
than the where-clause version.

Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'

Aggregate  (cost=246226.95..246226.96 rows=1 width=12)
  ->  Nested Loop  (cost=49233.27..231209.72 rows=1501722 width=12)
        ->  Result  (cost=0.04..0.05 rows=1 width=0)
              InitPlan
                ->  Limit  (cost=0.00..0.04 rows=1 width=4)
                      ->  Index Scan Backward using
                      activity_log_import_history_end_nlogid_idx on
                      activity_log_import_history  (cost=0.00..114.97
                      rows=2913 width=4)
                            Filter: (end_nlogid IS NOT NULL)
        ->  Bitmap Heap Scan on activity_log_facts
        (cost=49233.23..210449.44 rows=1660817 width=12)
              Recheck Cond: (activity_log_facts.nlogid >
              a.previous_nlogid)
              Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
              without time zone)
              ->  Bitmap Index Scan on activity_log_facts_nlogid_idx
              (cost=0.00..49233.23 rows=1660817 width=0)
                    Index Cond: (activity_log_facts.nlogid >
                    a.previous_nlogid)


Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?



On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> >        max(nlogid) as end_nlogid,
> >        min(dtCreateDate) as start_transaction_timestamp,
> >        max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> >   ->  Index Scan using activity_log_facts_nlogid_idx on
> >   activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
> >   time=0.014..0.014 rows=0 loops=1)
> >         Index Cond: (nlogid > 478287801)
> >         Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> >         time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains.  Why is this explain so much simpler than the
> > other query plan (with the subquery)?
>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
>    Richard Huxton
>    Archonet Ltd

pgsql-performance by date:

Previous
From: "Jeremy Haile"
Date:
Subject: Re: Inner join vs where-clause subquery
Next
From: ALVARO ARCILA
Date:
Subject: max_fsm_pages and check_points