Thread: Spatial join insists on sequential scan of larger table

Spatial join insists on sequential scan of larger table

From
Clive Page
Date:
I am trying to do a spatial join between two tables each of which has a
column of type BOX called ERRBOX, with R-TREE indices created on both.

The smaller table, xmm1, has      56,711 rows,
the larger one, twomass, has 177,757,299 rows.

The most efficient way to join these is to do a sequential scan of the
smaller table, and an R-tree lookup on the larger.  However for a simple
inner join the optimiser seems to want to do the reverse, for example:

EXPLAIN
SELECT x.ra AS xra, x.decl AS xdecl, t.ra AS tra, t.decl AS tdecl
FROM xmm1 AS x INNER JOIN twomass AS t
ON x.errbox && t.errbox;

                                    QUERY PLAN
----------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..196642756520.34 rows=49506496044 width=32)
   ->  Seq Scan on twomass t  (cost=0.00..9560002.72 rows=177023872 width=48)
   ->  Index Scan using xmm1box on xmm1 x  (cost=0.00..1107.28 rows=280 width=48)
         Index Cond: (x.errbox && "outer".errbox)


Reversing the join condition (i.e. t.errbox && x.errbox) and similar make
no difference, nor does using the old implicit join syntax.

If, however, I specify an outer join such as:

EXPLAIN
SELECT x.ra AS xra, x.decl AS xdecl, t.ra AS tra, t.decl AS tdecl
FROM xmm1 AS x LEFT OUTER JOIN twomass AS t
ON x.errbox && t.errbox;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..198945259325.90 rows=49506496044
width=32)
   ->  Seq Scan on xmm1 x  (cost=0.00..8592.32 rows=55932 width=48)
   ->  Index Scan using tbox on twomass t  (cost=0.00..3545848.88 rows=885119 width=48)
         Index Cond: ("outer".errbox && t.errbox)


This executes, it need hardly be said, a whole lot faster.

I found that I can also force a sequential scan of the smaller table by
dropping its R-tree index, but I may need this in other operations, so
this isn't a very satisfactory solution.  It's odd that an outer join
should be faster than an inner one, or to put it another way, after
dropping an index there is more than an order of magnitude speed increase.

I'm using Postgres 7.4.1 on Red Hat Linux.  Has anyone had similar
problems with spatial joins?


--
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.



Re: Spatial join insists on sequential scan of larger table

From
Tom Lane
Date:
Clive Page <cgp@star.le.ac.uk> writes:
> This executes, it need hardly be said, a whole lot faster.

Could we see EXPLAIN ANALYZE output?

The estimated costs for the two cases are nearly the same, which says to
me that there's something wrong with the cost model for r-tree lookups,
but I don't know what it is.

            regards, tom lane

Re: Spatial join insists on sequential scan of larger

From
Clive Page
Date:
On Fri, 2 Apr 2004, Tom Lane wrote:

> Could we see EXPLAIN ANALYZE output?

Certainly, but that's going to take a little time (as the ANALYZE causes
it to run the actual query, which I only just discovered), so may have to
wait until Monday if I don't get time to finish it this afternoon.


--
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.


Re: Spatial join insists on sequential scan of larger

From
Clive Page
Date:
On Fri, 2 Apr 2004, Tom Lane wrote:

> Could we see EXPLAIN ANALYZE output?

The original EXPLAIN output was:

                                    QUERY PLAN
----------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..196642756520.34 rows=49506496044 width=32)
   ->  Seq Scan on twomass t  (cost=0.00..9560002.72 rows=177023872 width=48)
   ->  Index Scan using xmm1box on xmm1 x  (cost=0.00..1107.28 rows=280 width=48)
         Index Cond: (x.errbox && "outer".errbox)

The EXPLAIN ANALYZE query was:

explain analyze
SELECT x.ra AS xra, x.decl AS xdecl, t.ra AS tra, t.decl AS tdecl
INTO tempjoin
FROM xmm1 AS x INNER JOIN twomass AS t
ON x.errbox && t.errbox;

And this produced:

\timing
Timing is on.
dw=# \i join1.sql
                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..196642756520.34 rows=49506496044 width=32) (actual time=701.919..7796111.624 rows=1513
loops=1)
   ->  Seq Scan on twomass t  (cost=0.00..9560002.72 rows=177023872 width=48) (actual time=22.064..617462.486
rows=177757299loops=1) 
   ->  Index Scan using xmmbox on xmm1 x  (cost=0.00..1107.28 rows=280 width=48) (actual time=0.036..0.036 rows=0
loops=177757299)
         Index Cond: (x.errbox && "outer".errbox)
 Total runtime: 7796410.533 ms
(5 rows)

Time: 7796996.093 ms


--
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.