Re: Index oddity (still) - Mailing list pgsql-performance

From ken
Subject Re: Index oddity (still)
Date
Msg-id 1087256828.28557.115.camel@pesky
Whole thread Raw
In response to Re: Index oddity  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Index oddity (still)
List pgsql-performance
Apologies in advance for the length of this post but I want to be as
thorough as possible in describing my problem to avoid too much banter
back and forth.

First off, thanks to all for your help with my index problem on my
multi-column index made up of 5 double precision columns.

Unfortunately, I was unable to make it work with any of the suggestions
provided.  However, Tom's suggestion of experimenting with alternative
data representations led me to explore using the built-in geometric data
object box to store this information since that is exactly what I am
storing.

By adding an rtree index on this box column I was able to make this new
method work beautifully for most cases!  The query that was taking over
a second went down under 20 milliseconds!  Wow!

However, for *some* cases the query now behaves very badly.  My new
table is now essentially the following (with the unnecessary bits
removed for your ease of viewing) ...

     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 fid1           | numeric(64,0)               | not null
 fid2           | numeric(64,0)               | not null
 boundingbox    | box                         |

Indexes:
    "nrgfeature_pkey" primary key, btree (fid2, fid1)
    "nrgfeature_bb_idx" rtree (boundingbox)
    "nrgfeature_bblength_idx" btree (length(lseg(boundingbox)))


... with 763,809 rows.

The query I run is of the form ...


SELECT * FROM nrgFeature WHERE
boundingbox && '((213854.57920364887, 91147.4541420119),
(212687.30997287965, 90434.4541420119))'
AND length(lseg(boundingbox)) > 12.916666666666668;


... If the bounding box is relatively small (like the one defined above)
then the query is very fast as relatively few rows are investigated by
the index.  The explain analyze for the above is ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14987.30
rows=1274 width=256) (actual time=0.046..0.730 rows=89 loops=1)
   Index Cond: (boundingbox &&
'(213854.579203649,91147.4541420119),(212687.30997288,90434.4541420119)'::box)
   Filter: (length(lseg(boundingbox)) > 12.9166666666667::double
precision)
 Total runtime: 0.830 ms


... Notice the statistics aren't great at guessing the number of rows,
however, since the number is sufficient to tell the optimizer to use the
index, it does and the query is blindingly fast.  However, now let's try
and retrieve all the rows that overlap a much, much bigger bounding box
but limit the results to rows with very large bounding boxes (we are
displaying these things on the screen and if they are too small to see
at this scale there is no reason to return them in the query)...


SELECT * FROM nrgFeature WHERE
boundingbox && '((793846.1538461539, 423000.0), (-109846.15384615387,
-129000.0))'
AND length(lseg(boundingbox)) > 10000.0;


... and its explain analyze is ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14987.30
rows=1274 width=256) (actual time=1.861..6427.876 rows=686 loops=1)
   Index Cond: (boundingbox &&
'(793846.153846154,423000),(-109846.153846154,-129000)'::box)
   Filter: (length(lseg(boundingbox)) > 10000::double precision)
 Total runtime: 6428.838 ms


... notice that the query now takes 6.4 seconds even though the
statistics look to be pretty close and only 686 rows are returned.  The
reason is due to the condition on the length(lseg()) functions.  Without
this condition the explain analyze is the following ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14958.66
rows=3820 width=256) (actual time=21.356..7750.360 rows=763768 loops=1)
   Index Cond: (boundingbox &&
'(793846.153846154,423000),(-109846.153846154,-129000)'::box)
 Total runtime: 8244.213 ms


... in which it can be seen that the statistics are way, way off.  It
thinks its only going to get back 3820 rows but instead almost every row
in the table is returned!  It should *not* be using this index in this
case.  Is there something wrong with rtree indexes on box data types?

So that is problem number one.

Problem number two, and this is possibly a bug, is that postgres doesn't
seem to use functional indexes on geometric data.  In the case
immediately above, the optimizer should choose the
nrgfeature_bblength_idx instead as it would immediately give the 686
rows that satisfy the length(lseg()) condition and voila it would be
done.  However, even if I *drop* the rtree index on the boundingbox
column, so that it can't use that index, the optimizer does not choose
the other index.  Instead it reverts to doing a sequential scan of the
entire table and its really slow.

Again, sorry for the long post.  Hope someone has experience with either
of these problems.

Ken




> I don't have any advice to magically solve this problem.  I would
> suggest experimenting with alternative data representations -- for
> example, maybe it would help to store "leftX" and "width" in place
> of "leftX" and "rightX", etc.  What you want to do is try to decorrelate
> the column values.  leftX and rightX are likely to have a strong
> correlation, but maybe leftX and width don't.
>
>             regards, tom lane
>


pgsql-performance by date:

Previous
From: "borajetta"
Date:
Subject: 50 000 000 Table entries and I have to do a keyword search HELP NEEDED
Next
From: Josh Berkus
Date:
Subject: Re: 50 000 000 Table entries and I have to do a keyword search HELP NEEDED