Thread: Inner join vs where-clause subquery

Inner join vs where-clause subquery

From
"Jeremy Haile"
Date:
I have the following query which performs extremely slow:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > ( select max(a.end_nlogid) from
activity_log_import_history a)
and dtCreateDate < '2006-12-18 9:10'


If I change the where clause to have the return value of the subquery it
runs very fast:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > 402123456
and dtCreateDate < '2006-12-18 9:10'


If I change the query to the following, it runs fast:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < ${IMPORT_TIMESTAMP}


I am running PG 8.2.  Why is that this the case?  Shouldn't the query
planner be smart enough to know that the first query is the same as the
second and third?  The inner query does not refer to any columns outside
of itself.  I personally find the first query easiest to read and wish
it performed well.

Jeremy Haile

Re: Inner join vs where-clause subquery

From
Richard Huxton
Date:
Jeremy Haile wrote:
> I have the following query which performs extremely slow:
> select min(nlogid) as start_nlogid,
>        max(nlogid) as end_nlogid,
>        min(dtCreateDate) as start_transaction_timestamp,
>        max(dtCreateDate) as end_transaction_timestamp
> from activity_log_facts
> where nlogid > ( select max(a.end_nlogid) from
> activity_log_import_history a)
> and dtCreateDate < '2006-12-18 9:10'

Can you post the EXPLAIN ANALYSE for this one please? That'll show us
exactly what it's doing.

--
   Richard Huxton
   Archonet Ltd

Re: Inner join vs where-clause subquery

From
"Jeremy Haile"
Date:
Here is the explain analyze output:

Result  (cost=9.45..9.46 rows=1 width=0) (actual
time=156589.390..156589.391 rows=1 loops=1)
  InitPlan
    ->  Result  (cost=0.04..0.05 rows=1 width=0) (actual
    time=0.034..0.034 rows=1 loops=1)
          InitPlan
            ->  Limit  (cost=0.00..0.04 rows=1 width=4) (actual
            time=0.029..0.030 rows=1 loops=1)
                  ->  Index Scan Backward using
                  activity_log_import_history_end_nlogid_idx on
                  activity_log_import_history a  (cost=0.00..113.43
                  rows=2877 width=4) (actual time=0.027..0.027 rows=1
                  loops=1)
                        Filter: (end_nlogid IS NOT NULL)
    ->  Limit  (cost=0.00..1.19 rows=1 width=12) (actual
    time=0.052..0.052 rows=0 loops=1)
          ->  Index Scan using activity_log_facts_pkey on
          activity_log_facts  (cost=0.00..1831613.82 rows=1539298
          width=12) (actual time=0.050..0.050 rows=0 loops=1)
                Index Cond: (nlogid > $1)
                Filter: ((nlogid IS NOT NULL) AND (dtcreatedate <
                '2006-12-18 09:10:00'::timestamp without time zone))
    ->  Limit  (cost=0.00..1.19 rows=1 width=12) (actual
    time=0.006..0.006 rows=0 loops=1)
          ->  Index Scan Backward using activity_log_facts_pkey on
          activity_log_facts  (cost=0.00..1831613.82 rows=1539298
          width=12) (actual time=0.004..0.004 rows=0 loops=1)
                Index Cond: (nlogid > $1)
                Filter: ((nlogid IS NOT NULL) AND (dtcreatedate <
                '2006-12-18 09:10:00'::timestamp without time zone))
    ->  Limit  (cost=0.00..3.51 rows=1 width=12) (actual
    time=100221.955..100221.955 rows=0 loops=1)
          ->  Index Scan using activity_log_facts_dtcreatedate_idx on
          activity_log_facts  (cost=0.00..5406927.50 rows=1539298
          width=12) (actual time=100221.953..100221.953 rows=0 loops=1)
                Index Cond: (dtcreatedate < '2006-12-18
                09:10:00'::timestamp without time zone)
                Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1))
    ->  Limit  (cost=0.00..3.51 rows=1 width=12) (actual
    time=56367.367..56367.367 rows=0 loops=1)
          ->  Index Scan Backward using
          activity_log_facts_dtcreatedate_idx on activity_log_facts
          (cost=0.00..5406927.50 rows=1539298 width=12) (actual
          time=56367.364..56367.364 rows=0 loops=1)
                Index Cond: (dtcreatedate < '2006-12-18
                09:10:00'::timestamp without time zone)
                Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1))
Total runtime: 156589.605 ms


On Tue, 19 Dec 2006 16:31:41 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > I have the following query which performs extremely slow:
> > select min(nlogid) as start_nlogid,
> >        max(nlogid) as end_nlogid,
> >        min(dtCreateDate) as start_transaction_timestamp,
> >        max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > ( select max(a.end_nlogid) from
> > activity_log_import_history a)
> > and dtCreateDate < '2006-12-18 9:10'
>
> Can you post the EXPLAIN ANALYSE for this one please? That'll show us
> exactly what it's doing.
>
> --
>    Richard Huxton
>    Archonet Ltd

Re: Inner join vs where-clause subquery

From
Richard Huxton
Date:
Jeremy Haile wrote:
> Here is the explain analyze output:

Well, the row estimates are about as far out as you can get:

>           ->  Index Scan using activity_log_facts_pkey on
>           activity_log_facts  (cost=0.00..1831613.82 rows=1539298
>           width=12) (actual time=0.050..0.050 rows=0 loops=1)

>           ->  Index Scan Backward using activity_log_facts_pkey on
>           activity_log_facts  (cost=0.00..1831613.82 rows=1539298
>           width=12) (actual time=0.004..0.004 rows=0 loops=1)

>           ->  Index Scan using activity_log_facts_dtcreatedate_idx on
>           activity_log_facts  (cost=0.00..5406927.50 rows=1539298
>           width=12) (actual time=100221.953..100221.953 rows=0 loops=1)

>           ->  Index Scan Backward using
>           activity_log_facts_dtcreatedate_idx on activity_log_facts
>           (cost=0.00..5406927.50 rows=1539298 width=12) (actual
>           time=56367.364..56367.364 rows=0 loops=1)

Hmm - it's using the indexes on dtCreateDate and nlogid which seems
broadly sensible, and then plans to limit the results for min()/max().
However, it's clearly wrong about how many rows will satisfy
  nlogid > (select max(a.end_nlogid) from activity_log_import_history a)

>>> select min(nlogid) as start_nlogid,
>>>        max(nlogid) as end_nlogid,
>>>        min(dtCreateDate) as start_transaction_timestamp,
>>>        max(dtCreateDate) as end_transaction_timestamp
>>> from activity_log_facts
>>> where nlogid > ( select max(a.end_nlogid) from
>>> activity_log_import_history a)
>>> and dtCreateDate < '2006-12-18 9:10'

If you run explain on the other forms of your query, I'd guess it's much
more accurate. There's a simple way to see if that is the issue. Run the
sub-query and substitute the actual value returned into the query above.
Then, try the same but with a prepared query. If it's down to nlogid
estimates then the first should be fast and the second slow.

--
   Richard Huxton
   Archonet Ltd

Re: Inner join vs where-clause subquery

From
"Jeremy Haile"
Date:
Here's the query and explain analyze using the result of the sub-query
substituted:

QUERY
explain analyze select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > 478287801
and dtCreateDate < '2006-12-18 9:10'

EXPLAIN ANALYZE
Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
time=0.018..0.019 rows=1 loops=1)
  ->  Index Scan using activity_log_facts_nlogid_idx on
  activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
  time=0.014..0.014 rows=0 loops=1)
        Index Cond: (nlogid > 478287801)
        Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
        time zone)
Total runtime: 0.076 ms


Sorry if the reason should be obvious, but I'm not the best at
interpreting the explains.  Why is this explain so much simpler than the
other query plan (with the subquery)?



On Tue, 19 Dec 2006 18:23:06 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here is the explain analyze output:
>
> Well, the row estimates are about as far out as you can get:
>
> >           ->  Index Scan using activity_log_facts_pkey on
> >           activity_log_facts  (cost=0.00..1831613.82 rows=1539298
> >           width=12) (actual time=0.050..0.050 rows=0 loops=1)
>
> >           ->  Index Scan Backward using activity_log_facts_pkey on
> >           activity_log_facts  (cost=0.00..1831613.82 rows=1539298
> >           width=12) (actual time=0.004..0.004 rows=0 loops=1)
>
> >           ->  Index Scan using activity_log_facts_dtcreatedate_idx on
> >           activity_log_facts  (cost=0.00..5406927.50 rows=1539298
> >           width=12) (actual time=100221.953..100221.953 rows=0 loops=1)
>
> >           ->  Index Scan Backward using
> >           activity_log_facts_dtcreatedate_idx on activity_log_facts
> >           (cost=0.00..5406927.50 rows=1539298 width=12) (actual
> >           time=56367.364..56367.364 rows=0 loops=1)
>
> Hmm - it's using the indexes on dtCreateDate and nlogid which seems
> broadly sensible, and then plans to limit the results for min()/max().
> However, it's clearly wrong about how many rows will satisfy
>   nlogid > (select max(a.end_nlogid) from activity_log_import_history a)
>
> >>> select min(nlogid) as start_nlogid,
> >>>        max(nlogid) as end_nlogid,
> >>>        min(dtCreateDate) as start_transaction_timestamp,
> >>>        max(dtCreateDate) as end_transaction_timestamp
> >>> from activity_log_facts
> >>> where nlogid > ( select max(a.end_nlogid) from
> >>> activity_log_import_history a)
> >>> and dtCreateDate < '2006-12-18 9:10'
>
> If you run explain on the other forms of your query, I'd guess it's much
> more accurate. There's a simple way to see if that is the issue. Run the
> sub-query and substitute the actual value returned into the query above.
> Then, try the same but with a prepared query. If it's down to nlogid
> estimates then the first should be fast and the second slow.
>
> --
>    Richard Huxton
>    Archonet Ltd

Re: Inner join vs where-clause subquery

From
Richard Huxton
Date:
Jeremy Haile wrote:
> Here's the query and explain analyze using the result of the sub-query
> substituted:
>
> QUERY
> explain analyze select min(nlogid) as start_nlogid,
>        max(nlogid) as end_nlogid,
>        min(dtCreateDate) as start_transaction_timestamp,
>        max(dtCreateDate) as end_transaction_timestamp
> from activity_log_facts
> where nlogid > 478287801
> and dtCreateDate < '2006-12-18 9:10'
>
> EXPLAIN ANALYZE
> Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
> time=0.018..0.019 rows=1 loops=1)
>   ->  Index Scan using activity_log_facts_nlogid_idx on
>   activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
>   time=0.014..0.014 rows=0 loops=1)
>         Index Cond: (nlogid > 478287801)
>         Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
>         time zone)
> Total runtime: 0.076 ms
>
>
> Sorry if the reason should be obvious, but I'm not the best at
> interpreting the explains.  Why is this explain so much simpler than the
> other query plan (with the subquery)?

Because it's planning it with knowledge of what "nlogid"s it's filtering
by. It knows it isn't going to get many rows back with nlogid >
478287801. In your previous explain it thought a large number of rows
would match and was trying not to sequentially scan the
activity_log_facts table.

Ideally, the planner would evaluate the subquery in your original form
(it should know it's only getting one row back from max()). Then it
could plan the query as above. I'm not sure how tricky that is to do though.

--
   Richard Huxton
   Archonet Ltd

Re: Inner join vs where-clause subquery

From
"Jeremy Haile"
Date:
Makes sense.   It is NOT executing the subquery more than once is it?

On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> >        max(nlogid) as end_nlogid,
> >        min(dtCreateDate) as start_transaction_timestamp,
> >        max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> >   ->  Index Scan using activity_log_facts_nlogid_idx on
> >   activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
> >   time=0.014..0.014 rows=0 loops=1)
> >         Index Cond: (nlogid > 478287801)
> >         Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> >         time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains.  Why is this explain so much simpler than the
> > other query plan (with the subquery)?
>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
>    Richard Huxton
>    Archonet Ltd

Re: Inner join vs where-clause subquery

From
"Jeremy Haile"
Date:
I'm still confused as to why the inner join version ran so much faster
than the where-clause version.

Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'

Aggregate  (cost=246226.95..246226.96 rows=1 width=12)
  ->  Nested Loop  (cost=49233.27..231209.72 rows=1501722 width=12)
        ->  Result  (cost=0.04..0.05 rows=1 width=0)
              InitPlan
                ->  Limit  (cost=0.00..0.04 rows=1 width=4)
                      ->  Index Scan Backward using
                      activity_log_import_history_end_nlogid_idx on
                      activity_log_import_history  (cost=0.00..114.97
                      rows=2913 width=4)
                            Filter: (end_nlogid IS NOT NULL)
        ->  Bitmap Heap Scan on activity_log_facts
        (cost=49233.23..210449.44 rows=1660817 width=12)
              Recheck Cond: (activity_log_facts.nlogid >
              a.previous_nlogid)
              Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
              without time zone)
              ->  Bitmap Index Scan on activity_log_facts_nlogid_idx
              (cost=0.00..49233.23 rows=1660817 width=0)
                    Index Cond: (activity_log_facts.nlogid >
                    a.previous_nlogid)


Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?



On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> >        max(nlogid) as end_nlogid,
> >        min(dtCreateDate) as start_transaction_timestamp,
> >        max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> >   ->  Index Scan using activity_log_facts_nlogid_idx on
> >   activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
> >   time=0.014..0.014 rows=0 loops=1)
> >         Index Cond: (nlogid > 478287801)
> >         Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> >         time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains.  Why is this explain so much simpler than the
> > other query plan (with the subquery)?
>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
>    Richard Huxton
>    Archonet Ltd