Re: Index oddity - Mailing list pgsql-performance

From Tom Lane
Subject Re: Index oddity
Date
Msg-id 19759.1086843718@sss.pgh.pa.us
Whole thread Raw
In response to Re: Index oddity  (ken <southerland@samsixedd.com>)
Responses Re: Index oddity (still)  (ken <southerland@samsixedd.com>)
List pgsql-performance
ken <southerland@samsixedd.com> writes:
> ... and here is the plan with statistics set to 1000 ...

>  Seq Scan on nrgfeature f  (cost=0.00..31675.57 rows=18608 width=218)
> (actual time=63.544..1002.701 rows=225 loops=1)
>    Filter: ((upperrightx > 321264.236977215::double precision) AND
> (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> 123286.261898636::double precision) AND (lowerlefty <
> 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> precision))

> ... so yeah, its obviously finding way, way less rows than it thinks it
> will.

Yup.  I think your problem here is that the conditions on the different
columns are highly correlated, but the planner doesn't know anything
about that, because it has no cross-column statistics.

You could check that the individual conditions are accurately estimated
by looking at the estimated and actual row counts in

explain analyze
SELECT * FROM nrgfeature f WHERE upperRightX > 321264.23697721504;

explain analyze
SELECT * FROM nrgfeature f WHERE lowerLeftX < 324046.79981208267;

etc --- but I'll bet lunch that they are right on the money with the
higher statistics targets, and probably not too far off even at the
default.  The trouble is that the planner is simply multiplying these
probabilities together to get its estimate for the combined query,
and when the columns are not independent that leads to a very bad
estimate.

In particular it sounds like you have a *whole lot* of rows that have
diagonalSize just under 50, and so changing the diagonalSize condition
to include those makes for a big change in the predicted number of rows,
even though for the specific values of upperRightX and friends in your
test query there isn't any change in the actual number of matching rows.

I don't have any advice to magically solve this problem.  I would
suggest experimenting with alternative data representations -- for
example, maybe it would help to store "leftX" and "width" in place
of "leftX" and "rightX", etc.  What you want to do is try to decorrelate
the column values.  leftX and rightX are likely to have a strong
correlation, but maybe leftX and width don't.

            regards, tom lane

pgsql-performance by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Index oddity
Next
From: Frank van Vugt
Date:
Subject: *very* inefficient choice made by the planner (regarding IN(...))