Thread: Index oddity

Index oddity

From
ken
Date:
I'm having a performance issue that I just can't resolve and its very,
very curious.  Thought someone here might be able to shed some light on
the subject.

I'm using Postgres 7.4.2 on Red Hat 9.  I have a table with 763,809 rows
in it defined as follows ...

ksedb=# \d nrgfeature
                Table "public.nrgfeature"
     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 fid1           | numeric(64,0)               | not null
 fid2           | numeric(64,0)               | not null
 created        | timestamp without time zone | not null
 createdby      | character varying(30)       | not null
 modified       | timestamp without time zone |
 modifiedby     | character varying(30)       |
 geommodified   | timestamp without time zone |
 geommodifiedby | character varying(30)       |
 deleted        | timestamp without time zone |
 deletedby      | character varying(30)       |
 featuretypeid  | smallint                    | not null
 description    | text                        |
 datasourceid   | smallint                    | not null
 lowerleftx     | double precision            | not null
 lowerlefty     | double precision            | not null
 upperrightx    | double precision            | not null
 upperrighty    | double precision            | not null
 diagonalsize   | double precision            |
 login          | character varying(25)       |
Indexes:
    "nrgfeature_pkey" primary key, btree (fid1, fid2)
    "nrgfeature_ft_index" btree (featuretypeid)
    "nrgfeature_xys_index" btree (upperrightx, lowerleftx, upperrighty,
lowerlefty, diagonalsize)
Inherits: commonfidattrs,
          commonrevisionattrs


... If I write a query as follows ...

SELECT *
FROM nrgfeature f
WHERE
    upperRightX > 321264.23697721504
    AND lowerLeftX < 324046.79981208267
    AND upperRightY > 123286.26189863647
    AND lowerLeftY < 124985.92745047594
    AND diagonalSize > 50.000
;

... (or any value for diagonalsize over 50) then my query runs in 50-100
milliseconds.  However, if the diagonalSize value is changed to 49.999
or any value below 50, then the query takes over a second for a factor
of 10 degradation in speed, even though the exact same number of rows is
returned.

The query plan for diagonalSize > 50.000 is ...

Index Scan using nrgfeature_xys_index on nrgfeature f
(cost=0.00..17395.79 rows=4618 width=220)
   Index Cond: ((upperrightx > 321264.236977215::double precision) AND
(lowerleftx < 324046.799812083::double precision) AND (upperrighty >
123286.261898636::double precision) AND (lowerlefty <
124985.927450476::double precision) AND (diagonalsize > 50::double
precision))

... while for diagonalSize > 49.999 is ...

 Seq Scan on nrgfeature f  (cost=0.00..31954.70 rows=18732 width=220)
   Filter: ((upperrightx > 321264.236977215::double precision) AND
(lowerleftx < 324046.799812083::double precision) AND (upperrighty >
123286.261898636::double precision) AND (lowerlefty <
124985.927450476::double precision) AND (diagonalsize > 49.999::double
precision))

... and yes, if I set enable_seqscan=false then the index is forced to
be used.  However, despite this being an undesirable solution for this
simple case it doesn't solve the problem for the general case.  As soon
as I add in joins with a couple of tables to perform the actual query I
want to perform, the seq scan setting doesn't force the index to be used
anymore.  Instead, the primary key index is used at this same
diagonalSize cutoff and the 5-part double precision clause is used as a
filter to the index scan and the result is again a very slow query.

I can provide those queries and results but that would only complicate
this already lengthy email and the above seems to be the crux of the
problem anyway.

Any help or thoughts would be greatly appreciated of course.

Thanks,

Ken Southerland


--
------s----a----m----s----i----x----e----d----d------
--

Ken Southerland
Senior Consultant
Sam Six EDD
http://www.samsixedd.com

503-236-4288 (office)
503-358-6542 (cell)



Re: Index oddity

From
Rod Taylor
Date:
It seems to believe that the number of rows returned for the >49.999
case will be 4 times the number for the >50 case. If that was true, then
the sequential scan would be correct.

ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000;
ANALZYE <table>;

Send back EXPLAIN ANALYZE output for the >49.999 case.

> The query plan for diagonalSize > 50.000 is ...
>
> Index Scan using nrgfeature_xys_index on nrgfeature f
> (cost=0.00..17395.79 rows=4618 width=220)
>    Index Cond: ((upperrightx > 321264.236977215::double precision) AND
> (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> 123286.261898636::double precision) AND (lowerlefty <
> 124985.927450476::double precision) AND (diagonalsize > 50::double
> precision))
>
> ... while for diagonalSize > 49.999 is ...
>
>  Seq Scan on nrgfeature f  (cost=0.00..31954.70 rows=18732 width=220)
>    Filter: ((upperrightx > 321264.236977215::double precision) AND
> (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> 123286.261898636::double precision) AND (lowerlefty <
> 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> precision))


Re: Index oddity

From
ken
Date:
Thanks Rod,

This setting has no effect however.  If I set statistics to 1000, or
even 0, (and then reanalyze the table) I see no change in the behaviour
of the query plans.  i.e. there is still the odd transtion in the plans
at diagonalSize = 50.

Ken



On Wed, 2004-06-09 at 13:12, Rod Taylor wrote:
> It seems to believe that the number of rows returned for the >49.999
> case will be 4 times the number for the >50 case. If that was true, then
> the sequential scan would be correct.
>
> ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000;
> ANALZYE <table>;
>
> Send back EXPLAIN ANALYZE output for the >49.999 case.
>
> > The query plan for diagonalSize > 50.000 is ...
> >
> > Index Scan using nrgfeature_xys_index on nrgfeature f
> > (cost=0.00..17395.79 rows=4618 width=220)
> >    Index Cond: ((upperrightx > 321264.236977215::double precision) AND
> > (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> > 123286.261898636::double precision) AND (lowerlefty <
> > 124985.927450476::double precision) AND (diagonalsize > 50::double
> > precision))
> >
> > ... while for diagonalSize > 49.999 is ...
> >
> >  Seq Scan on nrgfeature f  (cost=0.00..31954.70 rows=18732 width=220)
> >    Filter: ((upperrightx > 321264.236977215::double precision) AND
> > (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> > 123286.261898636::double precision) AND (lowerlefty <
> > 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> > precision))
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>


Re: Index oddity

From
Rod Taylor
Date:
On Wed, 2004-06-09 at 16:50, ken wrote:
> Thanks Rod,
>
> This setting has no effect however.  If I set statistics to 1000, or

Okay.. but you never did send EXPLAIN ANALYZE output. I want to know
what it is really finding.

> On Wed, 2004-06-09 at 13:12, Rod Taylor wrote:
> > It seems to believe that the number of rows returned for the >49.999
> > case will be 4 times the number for the >50 case. If that was true, then
> > the sequential scan would be correct.
> >
> > ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000;
> > ANALZYE <table>;
> >
> > Send back EXPLAIN ANALYZE output for the >49.999 case.
> >
> > > The query plan for diagonalSize > 50.000 is ...
> > >
> > > Index Scan using nrgfeature_xys_index on nrgfeature f
> > > (cost=0.00..17395.79 rows=4618 width=220)
> > >    Index Cond: ((upperrightx > 321264.236977215::double precision) AND
> > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> > > 123286.261898636::double precision) AND (lowerlefty <
> > > 124985.927450476::double precision) AND (diagonalsize > 50::double
> > > precision))
> > >
> > > ... while for diagonalSize > 49.999 is ...
> > >
> > >  Seq Scan on nrgfeature f  (cost=0.00..31954.70 rows=18732 width=220)
> > >    Filter: ((upperrightx > 321264.236977215::double precision) AND
> > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> > > 123286.261898636::double precision) AND (lowerlefty <
> > > 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> > > precision))
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: the planner will ignore your desire to choose an index scan if your
> >       joining column's datatypes do not match
> >


Re: Index oddity

From
ken
Date:
On Wed, 2004-06-09 at 13:56, Rod Taylor wrote:
> On Wed, 2004-06-09 at 16:50, ken wrote:
> > Thanks Rod,
> >
> > This setting has no effect however.  If I set statistics to 1000, or
>
> Okay.. but you never did send EXPLAIN ANALYZE output. I want to know
> what it is really finding.

Ah, sorry, missed the ANALYZE portion of your request (thought you only
wanted the result of explain if it changed due to the setting).

Here is the query plan with statistics on diagonalsize set to the
default (-1) ...

 Seq Scan on nrgfeature f  (cost=0.00..32176.98 rows=19134 width=218)
(actual time=61.640..1009.414 rows=225 loops=1)
   Filter: ((upperrightx > 321264.236977215::double precision) AND
(lowerleftx < 324046.799812083::double precision) AND (upperrighty >
123286.261898636::double precision) AND (lowerlefty <
124985.927450476::double precision) AND (diagonalsize > 49.999::double
precision))

... and here is the plan with statistics set to 1000 ...

 Seq Scan on nrgfeature f  (cost=0.00..31675.57 rows=18608 width=218)
(actual time=63.544..1002.701 rows=225 loops=1)
   Filter: ((upperrightx > 321264.236977215::double precision) AND
(lowerleftx < 324046.799812083::double precision) AND (upperrighty >
123286.261898636::double precision) AND (lowerlefty <
124985.927450476::double precision) AND (diagonalsize > 49.999::double
precision))

... so yeah, its obviously finding way, way less rows than it thinks it
will.

thanks,

ken


>
> > On Wed, 2004-06-09 at 13:12, Rod Taylor wrote:
> > > It seems to believe that the number of rows returned for the >49.999
> > > case will be 4 times the number for the >50 case. If that was true, then
> > > the sequential scan would be correct.
> > >
> > > ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000;
> > > ANALZYE <table>;
> > >
> > > Send back EXPLAIN ANALYZE output for the >49.999 case.
> > >



Re: Index oddity

From
Rod Taylor
Date:
> ... and here is the plan with statistics set to 1000 ...
>
>  Seq Scan on nrgfeature f  (cost=0.00..31675.57 rows=18608 width=218)
> (actual time=63.544..1002.701 rows=225 loops=1)
>    Filter: ((upperrightx > 321264.236977215::double precision) AND
> (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> 123286.261898636::double precision) AND (lowerlefty <
> 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> precision))

It's better like this, but still way off the mark. Even your good query
which uses the index was out by more than an order of magnitude.

Try raising the statistics levels for upperrightx, lowerleftx,
upperrighty and lowerlefty.

Failing that, you might be able to push it back down again by giving
diagonalsize an upper limit. Perhaps 500 is a value that would never
occur.

    AND (diagonalsize BETWEEN 49.999::double precision AND 500)



Re: Index oddity

From
ken
Date:
I had already tried setting the statistics to 1000 for all five of these
double precision fields with effectively no improvement.  Should have
mentioned that.

Also the between makes all values for diagonalSize bad since it is
effectively doing digonalSize > X and diagonalSize < Y.  If I do a query
with the same values for the four x,y values and diagonalSize < Y, then
for Y=49.999 the query is fast but for anything 50.000 and greater the
query is slow.  The exact opposite of the greater than queries not
surprisingly.

I also think I originally reported that the two queries gave the same
number of rows.  That is not true.  It was true when I had other joins
in, but when I stripped the query down to this core problem I should
have noticed that the number of results now differs between the two,
which I didn't at first.

If I take away the diagonalSize condition in my query I find that there
are 225 rows that satisfy the other conditions.  155 of these have a
diagonalSize value of exactly 50.000, while the remaining 70 rows all
have values larger than 50.  Thus there is a big discrete jump in the
number of rows at a diagonalSize of 50.  However, the statistics are off
by two orders of magnitude in guessing how many rows there are going to
be in this case and thus is not using my index.  How can I fix that?

Ken



On Wed, 2004-06-09 at 14:29, Rod Taylor wrote:
> > ... and here is the plan with statistics set to 1000 ...
> >
> >  Seq Scan on nrgfeature f  (cost=0.00..31675.57 rows=18608 width=218)
> > (actual time=63.544..1002.701 rows=225 loops=1)
> >    Filter: ((upperrightx > 321264.236977215::double precision) AND
> > (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> > 123286.261898636::double precision) AND (lowerlefty <
> > 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> > precision))
>
> It's better like this, but still way off the mark. Even your good query
> which uses the index was out by more than an order of magnitude.
>
> Try raising the statistics levels for upperrightx, lowerleftx,
> upperrighty and lowerlefty.
>
> Failing that, you might be able to push it back down again by giving
> diagonalsize an upper limit. Perhaps 500 is a value that would never
> occur.
>
>     AND (diagonalsize BETWEEN 49.999::double precision AND 500)
>
>
>


Re: Index oddity

From
Christopher Kings-Lynne
Date:
> If I take away the diagonalSize condition in my query I find that there
> are 225 rows that satisfy the other conditions.  155 of these have a
> diagonalSize value of exactly 50.000, while the remaining 70 rows all
> have values larger than 50.  Thus there is a big discrete jump in the
> number of rows at a diagonalSize of 50.  However, the statistics are off
> by two orders of magnitude in guessing how many rows there are going to
> be in this case and thus is not using my index.  How can I fix that?

Maybe you should drop your random_page_cost to something less than 4,
eg. 3 or even 2...

Chris

Re: Index oddity

From
Rod Taylor
Date:
On Wed, 2004-06-09 at 21:45, Christopher Kings-Lynne wrote:
> > If I take away the diagonalSize condition in my query I find that there
> > are 225 rows that satisfy the other conditions.  155 of these have a

> Maybe you should drop your random_page_cost to something less than 4,
> eg. 3 or even 2...

The big problem is a very poor estimate (off by a couple orders of
magnitude). I was hoping someone with more knowledge in fixing those
would jump in.



Re: Index oddity

From
Mark Kirkwood
Date:

Rod Taylor wrote:

>
>The big problem is a very poor estimate (off by a couple orders of
>magnitude). I was hoping someone with more knowledge in fixing those
>would jump in.
>
>
>
>
ANALYZE might be producing poor stats due to :

i) many dead tuples or
ii) high proportion of dead tuples in the first few pages of the table

Does a VACUUM FULL followed by ANALYZE change the estimates (or have you
tried this already)?

(p.s. - I probably don't qualify for the 'more knowledge' bit either...)

regards

Mark

Re: Index oddity

From
"Joshua D. Drake"
Date:
>>
> ANALYZE might be producing poor stats due to :
>
> i) many dead tuples or
> ii) high proportion of dead tuples in the first few pages of the table
>
> Does a VACUUM FULL followed by ANALYZE change the estimates (or have
> you tried this already)?
>
> (p.s. - I probably don't qualify for the 'more knowledge' bit either...)

You can also increase your statistics_target which will make ANALYZE
take longer but can help a great deal
with larger data sets.

Sincerely,

Joshua D. Drake



>
> regards
>
> Mark
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org



--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Re: Index oddity

From
Tom Lane
Date:
ken <southerland@samsixedd.com> writes:
> ... and here is the plan with statistics set to 1000 ...

>  Seq Scan on nrgfeature f  (cost=0.00..31675.57 rows=18608 width=218)
> (actual time=63.544..1002.701 rows=225 loops=1)
>    Filter: ((upperrightx > 321264.236977215::double precision) AND
> (lowerleftx < 324046.799812083::double precision) AND (upperrighty >
> 123286.261898636::double precision) AND (lowerlefty <
> 124985.927450476::double precision) AND (diagonalsize > 49.999::double
> precision))

> ... so yeah, its obviously finding way, way less rows than it thinks it
> will.

Yup.  I think your problem here is that the conditions on the different
columns are highly correlated, but the planner doesn't know anything
about that, because it has no cross-column statistics.

You could check that the individual conditions are accurately estimated
by looking at the estimated and actual row counts in

explain analyze
SELECT * FROM nrgfeature f WHERE upperRightX > 321264.23697721504;

explain analyze
SELECT * FROM nrgfeature f WHERE lowerLeftX < 324046.79981208267;

etc --- but I'll bet lunch that they are right on the money with the
higher statistics targets, and probably not too far off even at the
default.  The trouble is that the planner is simply multiplying these
probabilities together to get its estimate for the combined query,
and when the columns are not independent that leads to a very bad
estimate.

In particular it sounds like you have a *whole lot* of rows that have
diagonalSize just under 50, and so changing the diagonalSize condition
to include those makes for a big change in the predicted number of rows,
even though for the specific values of upperRightX and friends in your
test query there isn't any change in the actual number of matching rows.

I don't have any advice to magically solve this problem.  I would
suggest experimenting with alternative data representations -- for
example, maybe it would help to store "leftX" and "width" in place
of "leftX" and "rightX", etc.  What you want to do is try to decorrelate
the column values.  leftX and rightX are likely to have a strong
correlation, but maybe leftX and width don't.

            regards, tom lane

Re: Index oddity (still)

From
ken
Date:
Apologies in advance for the length of this post but I want to be as
thorough as possible in describing my problem to avoid too much banter
back and forth.

First off, thanks to all for your help with my index problem on my
multi-column index made up of 5 double precision columns.

Unfortunately, I was unable to make it work with any of the suggestions
provided.  However, Tom's suggestion of experimenting with alternative
data representations led me to explore using the built-in geometric data
object box to store this information since that is exactly what I am
storing.

By adding an rtree index on this box column I was able to make this new
method work beautifully for most cases!  The query that was taking over
a second went down under 20 milliseconds!  Wow!

However, for *some* cases the query now behaves very badly.  My new
table is now essentially the following (with the unnecessary bits
removed for your ease of viewing) ...

     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 fid1           | numeric(64,0)               | not null
 fid2           | numeric(64,0)               | not null
 boundingbox    | box                         |

Indexes:
    "nrgfeature_pkey" primary key, btree (fid2, fid1)
    "nrgfeature_bb_idx" rtree (boundingbox)
    "nrgfeature_bblength_idx" btree (length(lseg(boundingbox)))


... with 763,809 rows.

The query I run is of the form ...


SELECT * FROM nrgFeature WHERE
boundingbox && '((213854.57920364887, 91147.4541420119),
(212687.30997287965, 90434.4541420119))'
AND length(lseg(boundingbox)) > 12.916666666666668;


... If the bounding box is relatively small (like the one defined above)
then the query is very fast as relatively few rows are investigated by
the index.  The explain analyze for the above is ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14987.30
rows=1274 width=256) (actual time=0.046..0.730 rows=89 loops=1)
   Index Cond: (boundingbox &&
'(213854.579203649,91147.4541420119),(212687.30997288,90434.4541420119)'::box)
   Filter: (length(lseg(boundingbox)) > 12.9166666666667::double
precision)
 Total runtime: 0.830 ms


... Notice the statistics aren't great at guessing the number of rows,
however, since the number is sufficient to tell the optimizer to use the
index, it does and the query is blindingly fast.  However, now let's try
and retrieve all the rows that overlap a much, much bigger bounding box
but limit the results to rows with very large bounding boxes (we are
displaying these things on the screen and if they are too small to see
at this scale there is no reason to return them in the query)...


SELECT * FROM nrgFeature WHERE
boundingbox && '((793846.1538461539, 423000.0), (-109846.15384615387,
-129000.0))'
AND length(lseg(boundingbox)) > 10000.0;


... and its explain analyze is ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14987.30
rows=1274 width=256) (actual time=1.861..6427.876 rows=686 loops=1)
   Index Cond: (boundingbox &&
'(793846.153846154,423000),(-109846.153846154,-129000)'::box)
   Filter: (length(lseg(boundingbox)) > 10000::double precision)
 Total runtime: 6428.838 ms


... notice that the query now takes 6.4 seconds even though the
statistics look to be pretty close and only 686 rows are returned.  The
reason is due to the condition on the length(lseg()) functions.  Without
this condition the explain analyze is the following ...


 Index Scan using nrgfeature_bb_idx on nrgfeature  (cost=0.00..14958.66
rows=3820 width=256) (actual time=21.356..7750.360 rows=763768 loops=1)
   Index Cond: (boundingbox &&
'(793846.153846154,423000),(-109846.153846154,-129000)'::box)
 Total runtime: 8244.213 ms


... in which it can be seen that the statistics are way, way off.  It
thinks its only going to get back 3820 rows but instead almost every row
in the table is returned!  It should *not* be using this index in this
case.  Is there something wrong with rtree indexes on box data types?

So that is problem number one.

Problem number two, and this is possibly a bug, is that postgres doesn't
seem to use functional indexes on geometric data.  In the case
immediately above, the optimizer should choose the
nrgfeature_bblength_idx instead as it would immediately give the 686
rows that satisfy the length(lseg()) condition and voila it would be
done.  However, even if I *drop* the rtree index on the boundingbox
column, so that it can't use that index, the optimizer does not choose
the other index.  Instead it reverts to doing a sequential scan of the
entire table and its really slow.

Again, sorry for the long post.  Hope someone has experience with either
of these problems.

Ken




> I don't have any advice to magically solve this problem.  I would
> suggest experimenting with alternative data representations -- for
> example, maybe it would help to store "leftX" and "width" in place
> of "leftX" and "rightX", etc.  What you want to do is try to decorrelate
> the column values.  leftX and rightX are likely to have a strong
> correlation, but maybe leftX and width don't.
>
>             regards, tom lane
>


Re: Index oddity (still)

From
Tom Lane
Date:
ken <southerland@samsixedd.com> writes:
> Is there something wrong with rtree indexes on box data types?

Not per se, but the selectivity estimator for && is just a stub :-(.
Feel free to take a stab at improving this, or take a look at PostGIS
--- I think those guys are working the same problem even now.

> However, even if I *drop* the rtree index on the boundingbox
> column, so that it can't use that index, the optimizer does not choose
> the other index.  Instead it reverts to doing a sequential scan of the
> entire table and its really slow.

This is also a lack-of-statistical-knowledge problem.  The optimizer has
no stats about the distribution of length(lseg(boundingbox)) and so it
does not realize that it'd be reasonable to use that index for a query
like "length(lseg(boundingbox)) > largevalue".  As of 7.5 this issue
will be addressed, but in existing releases I think you could only get
that index to be used by forcing it with enable_seqscan = off :-(

As a short-term workaround it might be reasonable to store that length
as an independent table column, which you could put an index on.  You
could use triggers to ensure that it always matches the boundingbox.

            regards, tom lane