Re: join selectivity - Mailing list pgsql-hackers

From Mark Cave-Ayland
Subject Re: join selectivity
Date
Msg-id 9EB50F1A91413F4FA63019487FCD251DAD1F@WEBBASEDDC.webbasedltd.local
Whole thread Raw
In response to Re: join selectivity  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: join selectivity
List pgsql-hackers
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 13 December 2004 17:16
> To: Mark Cave-Ayland
> Cc: strk@refractions.net; pgsql-hackers@postgresql.org;
> postgis-devel@postgis.refractions.net
> Subject: Re: [HACKERS] join selectivity
>
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > For a query like this:
> >
> > SELECT id FROM table1, table2
> > WHERE table1.geom && table2.geom;
> >
> > RESTRICT selectivity is invoked twice and
> > JOIN selectivity is invoked once.
>
> Hm, are you testing in a context where both tables have
> indexes that are relevant to the && operator?
>
> The estimated join result size is computed from the join
> selectivity estimate for the && operator.  I was about to say
> that restriction selectivity wouldn't be used at all, but on
> second thought I believe that it would be invoked while
> considering nestloop with inner indexscan plans.  That is,
> we'd consider
>
>     NestLoop
>         Seq Scan on table2
>         Indexscan on table1
>             IndexCond: table1.geom && outer.geom
>
> and to determine the estimated cost of each indexscan, we
> would invoke restriction selectivity for &&, with varRelid
> referencing table1. Given this call you are supposed to treat
> table2.geom as a constant of uncertain value, so the thing is
> semantically sensible as a restriction clause for table1
> (whether you can produce a really good estimate is another
> question :-().
>
> Similarly, we'd consider the reverse plan with table1 as
> outer, and that would give rise to another restriction
> selectivity check with varRelid = table2.

Just to clarify, here are the explain results from strk's query:


strk=# explain analyze select * from test1, test2 where test1.geom &&
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.000005)                                                 QUERY PLAN

----------------------------------------------------------------------------
----------------------------------Nested Loop  (cost=3.27..105.84 rows=1 width=64) (actual time=0.217..39.305
rows=2700 loops=1)  Join Filter: ("inner".geom && "outer".geom)  ->  Seq Scan on test2  (cost=0.00..28.32 rows=132
width=32)(actual 
time=0.081..1.111 rows=108 loops=1)  ->  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.001..0.011 rows=25 loops=108)        ->  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.043..0.129 rows=25 loops=1)  Total runtime: 40.471 ms (6 rows)


.... so with no indices the JOIN function is called once, RESTRICT never. I
can understand this :)


strk=# create index test2_gist on test2 using gist (geom gist_geometry_ops);
CREATE INDEX
strk=# explain analyze select * from test1, test2 where test1.geom &&
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.000005)
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity                                                 QUERY
PLAN

----------------------------------------------------------------------------
----------------------------------Nested Loop  (cost=3.27..92.11 rows=1 width=64) (actual time=0.046..39.219
rows=2700 loops=1)  Join Filter: ("inner".geom && "outer".geom)  ->  Seq Scan on test2  (cost=0.00..28.08 rows=108
width=32)(actual 
time=0.009..0.198 rows=108 loops=1)  ->  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.000..0.013 rows=25 loops=108)        ->  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.002..0.052 rows=25 loops=1)  Total runtime: 40.307 ms (6 rows)


...with one index RESTRICT is called twice.....


strk=# create index test1_gist on test1 using gist (geom gist_geometry_ops);
CREATE INDEX
strk=# explain analyze select * from test1, test2 where test1.geom &&
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.000005)
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity                                                 QUERY
PLAN

----------------------------------------------------------------------------
----------------------------------Nested Loop  (cost=3.27..92.11 rows=1 width=64) (actual time=0.052..38.867
rows=2700 loops=1)  Join Filter: ("inner".geom && "outer".geom)  ->  Seq Scan on test2  (cost=0.00..28.08 rows=108
width=32)(actual 
time=0.012..0.181 rows=108 loops=1)  ->  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.000..0.010 rows=25 loops=108)        ->  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.002..0.032 rows=25 loops=1)  Total runtime: 40.027 ms (6 rows)


...and with two indices RESTRICT is called four times. The part I find
confusing is why with one index that RESTRICT is called twice. Surely for a
nested loop plan of the form you gave before which was:
    NestLoop        Seq Scan on table2        Indexscan on table1            IndexCond: table1.geom && outer.geom

Then if we just have an index on table1.geom this is the only plan that can
be considered - surely we cannot begin to consider the reverse plan because
there is no index on table2.geom. Based upon what you have suggested, I
would have expected that with one index RESTRICT would be called once, and
with two indices RESTRICT would be called twice.

I was also thinking whether calling RESTRICT when comparing with an unknown
value is worth doing at all, however I did think that perhaps if you are
using a cast to perform an operation on two datatypes, then you may be able
to imply something from the index, such as its physical size, and hint that
the planner should use a particular index in preference for the other. Would
it be correct to assume that if returning the same value for RESTRICT for
both means that the planner will choose one at random?


Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




pgsql-hackers by date:

Previous
From: "Jim Buttafuoco"
Date:
Subject: Re: [Fwd: Re: race condition for drop schema cascade?]
Next
From: Tom Lane
Date:
Subject: Re: join selectivity