Thread: Index use with left join

Index use with left join

From

"Julian Scarfe"

Date:

08 April 2005, 10:05:16

In a 7.4 database I have a table n of which the relevant columns see to be:

                    Table "public.n"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 code          | text                        | not null
 ref           | text                        | not null
 part          | integer                     | not null
...
 q_node        | point                       |
 bbox          | box                         |

with indexes

Indexes:
    "n_pkey" primary key, btree (code, ref, part)
    "n_bbox" rtree (bbox)

Performance on simple spatial queries seem sensible.

1) Using the index

explain analyze
select  n.ref, n.code
        from n
        where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Index Scan using n_bbox on n  (cost=0.00..88.33 rows=22 width=20) (actual
time=0.090..7.309 rows=396 loops=1)
   Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
 Total runtime: 7.713 ms

2) Filtering on a different criterion to force a sequential scan. (These 150
rows are a subset of those from query 1)

explain analyze
select  n.ref, n.code
        from n
        where box (q_node, q_node)
                @ box (point (-0.032, 0.873), point (0.017, 0.908))
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on n  (cost=0.00..9087.60 rows=5 width=20) (actual
time=646.273..1135.774 rows=150 loops=1)
   Filter: (box(q_node, q_node) @ '(0.017,0.908),(-0.032,0.873)'::box)
 Total runtime: 1136.919 ms

3) Combining the two, the strategy seems sensible

explain analyze
select  n.ref, n.code
        from n
        where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
        and box (q_node, q_node)
                @ box (point (-0.032, 0.873), point (0.017, 0.908))
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Index Scan using n_bbox on n  (cost=0.00..88.44 rows=1 width=20) (actual
time=0.360..11.482 rows=150 loops=1)
   Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
   Filter: (box(q_node, q_node) @ '(0.017,0.908),(-0.032,0.873)'::box)
 Total runtime: 11.772 ms

So far so good.  Now I want to left join it with another table a (again,
just the columns that appear relevant)

                    Table "public.a"
       Column        |         Type          | Modifiers
---------------------+-----------------------+-----------
 ident               | character(4)          |
 name                | character varying(30) |
...
Indexes:
    "a_ident" unique, btree (ident)

4) First with a filter for which there's an index, like query 1

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
                                                         QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=1371.99..1449.68 rows=174 width=45) (actual
time=182.082..211.555 rows=396 loops=1)
   Merge Cond: ("outer".code = "inner"."?column3?")
   ->  Sort  (cost=88.82..88.87 rows=22 width=20) (actual
time=16.120..16.375 rows=396 loops=1)
         Sort Key: n.code
         ->  Index Scan using n_bbox on n  (cost=0.00..88.33 rows=22
width=20) (actual time=0.213..11.893 rows=396 loops=1)
               Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
   ->  Sort  (cost=1283.17..1308.44 rows=10105 width=25) (actual
time=164.296..170.774 rows=10054 loops=1)
         Sort Key: (a.ident)::text
         ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.066..68.968 rows=10105 loops=1)
 Total runtime: 214.777 ms

5) Now with a filter that forces a sequential scan

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where box (q_node, q_node)
                @ box (point (-0.032, 0.873), point (0.017, 0.908))
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=10370.84..10447.06 rows=40 width=45) (actual
time=912.068..914.158 rows=150 loops=1)
   Merge Cond: ("outer".code = "inner"."?column3?")
   ->  Sort  (cost=9087.66..9087.68 rows=5 width=20) (actual
time=728.757..728.847 rows=150 loops=1)
         Sort Key: n.code
         ->  Seq Scan on n  (cost=0.00..9087.60 rows=5 width=20) (actual
time=314.466..726.479 rows=150 loops=1)
               Filter: (box(q_node, q_node) @
'(0.017,0.908),(-0.032,0.873)'::box)
   ->  Sort  (cost=1283.17..1308.44 rows=10105 width=25) (actual
time=180.078..180.927 rows=1391 loops=1)
         Sort Key: (a.ident)::text
         ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.170..83.442 rows=10105 loops=1)
 Total runtime: 917.066 ms

Again, so far, nothing obviously unusual.  Now I combine the filters in 4 &
5 (as I did from 1 & 2 to get 3)

6) Now I combine the filters in 4 & 5 (as I did from 1 & 2 to get 3, which
performed in a similar time to 1)

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
        and box (q_node, q_node)
                @ box (point (-0.032, 0.873), point (0.017, 0.908))
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..851.06 rows=8 width=45) (actual
time=11.662..7919.946 rows=150 loops=1)
   Join Filter: (("inner".ident)::text = "outer".code)
   ->  Index Scan using n_bbox on n  (cost=0.00..88.44 rows=1 width=20)
(actual time=0.107..10.256 rows=150 loops=1)
         Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
         Filter: (box(q_node, q_node) @ '(0.017,0.908),(-0.032,0.873)'::box)
   ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.006..18.044 rows=10105 loops=150)
 Total runtime: 7920.684 ms

Whoa!  Instead of a performance similar to query 4, it chooses a different
strategy, and takes 40 times as long. (Both tables just analyzed.)

By brute force:

set enable_nestloop to off;

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
        and box (q_node, q_node)
                @ box (point (-0.032, 0.873), point (0.017, 0.908))
                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=1371.62..1447.50 rows=8 width=45) (actual
time=177.273..179.341 rows=150 loops=1)
   Merge Cond: ("outer".code = "inner"."?column3?")
   ->  Sort  (cost=88.45..88.45 rows=1 width=20) (actual time=8.452..8.538
rows=150 loops=1)
         Sort Key: n.code
         ->  Index Scan using n_bbox on n  (cost=0.00..88.44 rows=1
width=20) (actual time=0.109..7.031 rows=150 loops=1)
               Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
               Filter: (box(q_node, q_node) @
'(0.017,0.908),(-0.032,0.873)'::box)
   ->  Sort  (cost=1283.17..1308.44 rows=10105 width=25) (actual
time=165.520..166.348 rows=1391 loops=1)
         Sort Key: (a.ident)::text
         ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.042..69.560 rows=10105 loops=1)
 Total runtime: 182.275 ms

What's happening here, please? How am I misleading the planner? Is it
because the index is rtree?

Yes, I should consider PostGIS for spatial stuff, but I've got what I've got
:-).

TIA

Julian Scarfe

Re: Index use with left join

From

Tom Lane

Date:

08 April 2005, 14:52:47

"Julian Scarfe" <julian@avbrief.com> writes:
> 6) Now I combine the filters in 4 & 5 (as I did from 1 & 2 to get 3, which
> performed in a similar time to 1)

> explain analyze
> select  n.ref, n.code, a.ident, a.name
>         from n left outer join a on (a.ident = n.code)
>         where bbox && box (point (-0.032, 0.873), point (0.017, 0.908))
>         and box (q_node, q_node)
>                 @ box (point (-0.032, 0.873), point (0.017, 0.908))
>                                                      QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Nested Loop Left Join  (cost=0.00..851.06 rows=8 width=45) (actual
> time=11.662..7919.946 rows=150 loops=1)
>    Join Filter: (("inner".ident)::text = "outer".code)
>    ->  Index Scan using n_bbox on n  (cost=0.00..88.44 rows=1 width=20)
> (actual time=0.107..10.256 rows=150 loops=1)
>          Index Cond: (bbox && '(0.017,0.908),(-0.032,0.873)'::box)
>          Filter: (box(q_node, q_node) @ '(0.017,0.908),(-0.032,0.873)'::box)
>    ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
> time=0.006..18.044 rows=10105 loops=150)
>  Total runtime: 7920.684 ms

> Whoa!  Instead of a performance similar to query 4, it chooses a different
> strategy, and takes 40 times as long. (Both tables just analyzed.)

The problem is that it's underestimating the number of rows pulled from
the n table (1 vs actual 150), which makes a simple nestloop join look
like the way to go.  That error comes from the fact that we don't really
have any statistical estimation for geometric conditions :-(.  Some of
the PostGIS hackers have been working on such, I believe, but I'm not
sure how far they've gotten.

            regards, tom lane

Re: Index use with left join

From

"Julian Scarfe"

Date:

08 April 2005, 19:42:03

From: "Tom Lane" <tgl@sss.pgh.pa.us>

> The problem is that it's underestimating the number of rows pulled from
> the n table (1 vs actual 150), which makes a simple nestloop join look
> like the way to go.  That error comes from the fact that we don't really
> have any statistical estimation for geometric conditions :-(.  Some of
> the PostGIS hackers have been working on such, I believe, but I'm not
> sure how far they've gotten.

Thanks Tom.

I can see the poor estimation of the geometric filter is a significant
factor.

Reviewing it I wondered if the coarse/fine nature of the filters was also an
issue.  Query 4, with the filter on the index, selects 396 of the ~20,000
rows of the table (having estimated 22).  Query 5, with the filter requiring
a Seq scan, selects 150 of the ~20,000 rows of the table (having estimated
5), all of which were returned by Query 4.  Does the planner "realise" that
the intersection, Query 6, will still return 150 rows, or does it assume
independence of the filters in some way and estimate
20,000*(150/20,000)*(396/20,000)?

I guess I can test that by trying it with a non-geometric column and a btree
index:

7) Filtered requiring a sequntial scan

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where code ~~ 'EGT%';
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=10361.53..10441.90 rows=419 width=45) (actual
time=731.175..732.615 rows=248 loops=1)
   Merge Cond: ("outer".code = "inner"."?column3?")
   ->  Sort  (cost=9078.36..9078.49 rows=53 width=20) (actual
time=547.154..547.300 rows=248 loops=1)
         Sort Key: n.code
         ->  Seq Scan on n  (cost=0.00..9076.84 rows=53 width=20) (actual
time=260.558..543.587 rows=248 loops=1)
               Filter: (code ~~ 'EGT%'::text)
   ->  Sort  (cost=1283.17..1308.44 rows=10105 width=25) (actual
time=180.359..181.149 rows=1292 loops=1)
         Sort Key: (a.ident)::text
         ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.125..83.844 rows=10105 loops=1)
 Total runtime: 735.613 ms

8) Filtered using the index, but returning a subset of the 419 rows from
Query 7

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where code = 'EGTT';
                                                         QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=1283.17..1572.15 rows=411 width=45) (actual
time=451.609..510.507 rows=226 loops=1)
   Merge Cond: ("outer".code = "inner"."?column3?")
   ->  Index Scan using n_pkey on n  (cost=0.00..208.82 rows=52 width=20)
(actual time=17.301..73.840 rows=226 loops=1)
         Index Cond: (code = 'EGTT'::text)
   ->  Sort  (cost=1283.17..1308.44 rows=10105 width=25) (actual
time=430.231..431.032 rows=1279 loops=1)
         Sort Key: (a.ident)::text
         ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=5.743..321.326 rows=10105 loops=1)
 Total runtime: 514.103 ms


9) Filtered on both

explain analyze
select  n.ref, n.code, a.ident, a.name
        from n left outer join a on (a.ident = n.code)
        where code ~~ 'EGT%' and code = 'EGTT';
                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..971.58 rows=8 width=45) (actual
time=53.634..12222.285 rows=226 loops=1)
   Join Filter: (("inner".ident)::text = "outer".code)
   ->  Index Scan using n_pkey on n  (cost=0.00..208.95 rows=1 width=20)
(actual time=0.288..21.137 rows=226 loops=1)
         Index Cond: (code = 'EGTT'::text)
         Filter: (code ~~ 'EGT%'::text)
   ->  Seq Scan on a  (cost=0.00..611.05 rows=10105 width=25) (actual
time=0.008..18.741 rows=10105 loops=226)
 Total runtime: 12223.328 ms


Similar problem.  Of course Query 9 is concocted and unrealistic, but it is
representative of the coarse/fine filter problem, where I select a set of
rows using an approximate filter (e.g. bounding box for the geometrical
case) with an index and then use a second, exact but computationally
expensive filter to keep only those rows that I really want.

Any general suggestions for workarounds?

Julian

Re: Index use with left join

From

Tom Lane

Date:

08 April 2005, 20:04:46

"Julian Scarfe" <julian@avbrief.com> writes:
> Does the planner "realise" that
> the intersection, Query 6, will still return 150 rows, or does it assume
> independence of the filters in some way and estimate
> 20,000*(150/20,000)*(396/20,000)?

It assumes independence of the conditions --- which is why having two
of them reduced the rowcount estimate so much.  There are some limited
cases in which it can recognize redundant conditions, but offhand I
think that only works for scalar inequalities (like "x < 5 AND x < 6").
There just hasn't been a lot of work on geometric conditions ...

> Any general suggestions for workarounds?

Not much, other than trying to avoid redundant conditions.

Did you look into the state of the PostGIS work on geometric statistics?
It's bad enough not knowing how to combine geometric conditions, but
when the individual estimates themselves are completely bogus, you've
really got a problem :-(

            regards, tom lane

Re: Index use with left join

From

"Julian Scarfe"

Date:

09 April 2005, 10:53:17

> "Julian Scarfe" <julian@avbrief.com> writes:
>> Does the planner "realise" that
>> the intersection, Query 6, will still return 150 rows, or does it assume
>> independence of the filters in some way and estimate
>> 20,000*(150/20,000)*(396/20,000)?

From: "Tom Lane" <tgl@sss.pgh.pa.us>

> It assumes independence of the conditions --- which is why having two
> of them reduced the rowcount estimate so much.  There are some limited
> cases in which it can recognize redundant conditions, but offhand I
> think that only works for scalar inequalities (like "x < 5 AND x < 6").

Even that's smarter than I dared hope for!

>> Any general suggestions for workarounds?
>
> Not much, other than trying to avoid redundant conditions.
>
> Did you look into the state of the PostGIS work on geometric statistics?

No, though PostGIS is clearly the way forward for my needs in the
medium/long term.

PostGIS stores bounding boxes for its geometric features.  The operators
like && and @ work as intersect and containment for the bounding boxes,
while Intersects() and Contains() use more exact but presumably
computationally expensive functions.  I don't yet know how these, GiST
indexes and the planner get along together.  But I imagine the issue I've
come across is one of the, if not the, most important one in spatially
enabled databases.

Thanks again

Julian