Thread: Index oddity
I'm having a performance issue that I just can't resolve and its very, very curious. Thought someone here might be able to shed some light on the subject. I'm using Postgres 7.4.2 on Red Hat 9. I have a table with 763,809 rows in it defined as follows ... ksedb=# \d nrgfeature Table "public.nrgfeature" Column | Type | Modifiers ----------------+-----------------------------+----------- fid1 | numeric(64,0) | not null fid2 | numeric(64,0) | not null created | timestamp without time zone | not null createdby | character varying(30) | not null modified | timestamp without time zone | modifiedby | character varying(30) | geommodified | timestamp without time zone | geommodifiedby | character varying(30) | deleted | timestamp without time zone | deletedby | character varying(30) | featuretypeid | smallint | not null description | text | datasourceid | smallint | not null lowerleftx | double precision | not null lowerlefty | double precision | not null upperrightx | double precision | not null upperrighty | double precision | not null diagonalsize | double precision | login | character varying(25) | Indexes: "nrgfeature_pkey" primary key, btree (fid1, fid2) "nrgfeature_ft_index" btree (featuretypeid) "nrgfeature_xys_index" btree (upperrightx, lowerleftx, upperrighty, lowerlefty, diagonalsize) Inherits: commonfidattrs, commonrevisionattrs ... If I write a query as follows ... SELECT * FROM nrgfeature f WHERE upperRightX > 321264.23697721504 AND lowerLeftX < 324046.79981208267 AND upperRightY > 123286.26189863647 AND lowerLeftY < 124985.92745047594 AND diagonalSize > 50.000 ; ... (or any value for diagonalsize over 50) then my query runs in 50-100 milliseconds. However, if the diagonalSize value is changed to 49.999 or any value below 50, then the query takes over a second for a factor of 10 degradation in speed, even though the exact same number of rows is returned. The query plan for diagonalSize > 50.000 is ... Index Scan using nrgfeature_xys_index on nrgfeature f (cost=0.00..17395.79 rows=4618 width=220) Index Cond: ((upperrightx > 321264.236977215::double precision) AND (lowerleftx < 324046.799812083::double precision) AND (upperrighty > 123286.261898636::double precision) AND (lowerlefty < 124985.927450476::double precision) AND (diagonalsize > 50::double precision)) ... while for diagonalSize > 49.999 is ... Seq Scan on nrgfeature f (cost=0.00..31954.70 rows=18732 width=220) Filter: ((upperrightx > 321264.236977215::double precision) AND (lowerleftx < 324046.799812083::double precision) AND (upperrighty > 123286.261898636::double precision) AND (lowerlefty < 124985.927450476::double precision) AND (diagonalsize > 49.999::double precision)) ... and yes, if I set enable_seqscan=false then the index is forced to be used. However, despite this being an undesirable solution for this simple case it doesn't solve the problem for the general case. As soon as I add in joins with a couple of tables to perform the actual query I want to perform, the seq scan setting doesn't force the index to be used anymore. Instead, the primary key index is used at this same diagonalSize cutoff and the 5-part double precision clause is used as a filter to the index scan and the result is again a very slow query. I can provide those queries and results but that would only complicate this already lengthy email and the above seems to be the crux of the problem anyway. Any help or thoughts would be greatly appreciated of course. Thanks, Ken Southerland -- ------s----a----m----s----i----x----e----d----d------ -- Ken Southerland Senior Consultant Sam Six EDD http://www.samsixedd.com 503-236-4288 (office) 503-358-6542 (cell)
It seems to believe that the number of rows returned for the >49.999 case will be 4 times the number for the >50 case. If that was true, then the sequential scan would be correct. ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000; ANALZYE <table>; Send back EXPLAIN ANALYZE output for the >49.999 case. > The query plan for diagonalSize > 50.000 is ... > > Index Scan using nrgfeature_xys_index on nrgfeature f > (cost=0.00..17395.79 rows=4618 width=220) > Index Cond: ((upperrightx > 321264.236977215::double precision) AND > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > 123286.261898636::double precision) AND (lowerlefty < > 124985.927450476::double precision) AND (diagonalsize > 50::double > precision)) > > ... while for diagonalSize > 49.999 is ... > > Seq Scan on nrgfeature f (cost=0.00..31954.70 rows=18732 width=220) > Filter: ((upperrightx > 321264.236977215::double precision) AND > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > 123286.261898636::double precision) AND (lowerlefty < > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > precision))
Thanks Rod, This setting has no effect however. If I set statistics to 1000, or even 0, (and then reanalyze the table) I see no change in the behaviour of the query plans. i.e. there is still the odd transtion in the plans at diagonalSize = 50. Ken On Wed, 2004-06-09 at 13:12, Rod Taylor wrote: > It seems to believe that the number of rows returned for the >49.999 > case will be 4 times the number for the >50 case. If that was true, then > the sequential scan would be correct. > > ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000; > ANALZYE <table>; > > Send back EXPLAIN ANALYZE output for the >49.999 case. > > > The query plan for diagonalSize > 50.000 is ... > > > > Index Scan using nrgfeature_xys_index on nrgfeature f > > (cost=0.00..17395.79 rows=4618 width=220) > > Index Cond: ((upperrightx > 321264.236977215::double precision) AND > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > > 123286.261898636::double precision) AND (lowerlefty < > > 124985.927450476::double precision) AND (diagonalsize > 50::double > > precision)) > > > > ... while for diagonalSize > 49.999 is ... > > > > Seq Scan on nrgfeature f (cost=0.00..31954.70 rows=18732 width=220) > > Filter: ((upperrightx > 321264.236977215::double precision) AND > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > > 123286.261898636::double precision) AND (lowerlefty < > > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > > precision)) > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match >
On Wed, 2004-06-09 at 16:50, ken wrote: > Thanks Rod, > > This setting has no effect however. If I set statistics to 1000, or Okay.. but you never did send EXPLAIN ANALYZE output. I want to know what it is really finding. > On Wed, 2004-06-09 at 13:12, Rod Taylor wrote: > > It seems to believe that the number of rows returned for the >49.999 > > case will be 4 times the number for the >50 case. If that was true, then > > the sequential scan would be correct. > > > > ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000; > > ANALZYE <table>; > > > > Send back EXPLAIN ANALYZE output for the >49.999 case. > > > > > The query plan for diagonalSize > 50.000 is ... > > > > > > Index Scan using nrgfeature_xys_index on nrgfeature f > > > (cost=0.00..17395.79 rows=4618 width=220) > > > Index Cond: ((upperrightx > 321264.236977215::double precision) AND > > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > > > 123286.261898636::double precision) AND (lowerlefty < > > > 124985.927450476::double precision) AND (diagonalsize > 50::double > > > precision)) > > > > > > ... while for diagonalSize > 49.999 is ... > > > > > > Seq Scan on nrgfeature f (cost=0.00..31954.70 rows=18732 width=220) > > > Filter: ((upperrightx > 321264.236977215::double precision) AND > > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > > > 123286.261898636::double precision) AND (lowerlefty < > > > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > > > precision)) > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match > >
On Wed, 2004-06-09 at 13:56, Rod Taylor wrote: > On Wed, 2004-06-09 at 16:50, ken wrote: > > Thanks Rod, > > > > This setting has no effect however. If I set statistics to 1000, or > > Okay.. but you never did send EXPLAIN ANALYZE output. I want to know > what it is really finding. Ah, sorry, missed the ANALYZE portion of your request (thought you only wanted the result of explain if it changed due to the setting). Here is the query plan with statistics on diagonalsize set to the default (-1) ... Seq Scan on nrgfeature f (cost=0.00..32176.98 rows=19134 width=218) (actual time=61.640..1009.414 rows=225 loops=1) Filter: ((upperrightx > 321264.236977215::double precision) AND (lowerleftx < 324046.799812083::double precision) AND (upperrighty > 123286.261898636::double precision) AND (lowerlefty < 124985.927450476::double precision) AND (diagonalsize > 49.999::double precision)) ... and here is the plan with statistics set to 1000 ... Seq Scan on nrgfeature f (cost=0.00..31675.57 rows=18608 width=218) (actual time=63.544..1002.701 rows=225 loops=1) Filter: ((upperrightx > 321264.236977215::double precision) AND (lowerleftx < 324046.799812083::double precision) AND (upperrighty > 123286.261898636::double precision) AND (lowerlefty < 124985.927450476::double precision) AND (diagonalsize > 49.999::double precision)) ... so yeah, its obviously finding way, way less rows than it thinks it will. thanks, ken > > > On Wed, 2004-06-09 at 13:12, Rod Taylor wrote: > > > It seems to believe that the number of rows returned for the >49.999 > > > case will be 4 times the number for the >50 case. If that was true, then > > > the sequential scan would be correct. > > > > > > ALTER TABLE <table> ALTER COLUMN diagonalsize SET STATISTICS 1000; > > > ANALZYE <table>; > > > > > > Send back EXPLAIN ANALYZE output for the >49.999 case. > > >
> ... and here is the plan with statistics set to 1000 ... > > Seq Scan on nrgfeature f (cost=0.00..31675.57 rows=18608 width=218) > (actual time=63.544..1002.701 rows=225 loops=1) > Filter: ((upperrightx > 321264.236977215::double precision) AND > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > 123286.261898636::double precision) AND (lowerlefty < > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > precision)) It's better like this, but still way off the mark. Even your good query which uses the index was out by more than an order of magnitude. Try raising the statistics levels for upperrightx, lowerleftx, upperrighty and lowerlefty. Failing that, you might be able to push it back down again by giving diagonalsize an upper limit. Perhaps 500 is a value that would never occur. AND (diagonalsize BETWEEN 49.999::double precision AND 500)
I had already tried setting the statistics to 1000 for all five of these double precision fields with effectively no improvement. Should have mentioned that. Also the between makes all values for diagonalSize bad since it is effectively doing digonalSize > X and diagonalSize < Y. If I do a query with the same values for the four x,y values and diagonalSize < Y, then for Y=49.999 the query is fast but for anything 50.000 and greater the query is slow. The exact opposite of the greater than queries not surprisingly. I also think I originally reported that the two queries gave the same number of rows. That is not true. It was true when I had other joins in, but when I stripped the query down to this core problem I should have noticed that the number of results now differs between the two, which I didn't at first. If I take away the diagonalSize condition in my query I find that there are 225 rows that satisfy the other conditions. 155 of these have a diagonalSize value of exactly 50.000, while the remaining 70 rows all have values larger than 50. Thus there is a big discrete jump in the number of rows at a diagonalSize of 50. However, the statistics are off by two orders of magnitude in guessing how many rows there are going to be in this case and thus is not using my index. How can I fix that? Ken On Wed, 2004-06-09 at 14:29, Rod Taylor wrote: > > ... and here is the plan with statistics set to 1000 ... > > > > Seq Scan on nrgfeature f (cost=0.00..31675.57 rows=18608 width=218) > > (actual time=63.544..1002.701 rows=225 loops=1) > > Filter: ((upperrightx > 321264.236977215::double precision) AND > > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > > 123286.261898636::double precision) AND (lowerlefty < > > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > > precision)) > > It's better like this, but still way off the mark. Even your good query > which uses the index was out by more than an order of magnitude. > > Try raising the statistics levels for upperrightx, lowerleftx, > upperrighty and lowerlefty. > > Failing that, you might be able to push it back down again by giving > diagonalsize an upper limit. Perhaps 500 is a value that would never > occur. > > AND (diagonalsize BETWEEN 49.999::double precision AND 500) > > >
> If I take away the diagonalSize condition in my query I find that there > are 225 rows that satisfy the other conditions. 155 of these have a > diagonalSize value of exactly 50.000, while the remaining 70 rows all > have values larger than 50. Thus there is a big discrete jump in the > number of rows at a diagonalSize of 50. However, the statistics are off > by two orders of magnitude in guessing how many rows there are going to > be in this case and thus is not using my index. How can I fix that? Maybe you should drop your random_page_cost to something less than 4, eg. 3 or even 2... Chris
On Wed, 2004-06-09 at 21:45, Christopher Kings-Lynne wrote: > > If I take away the diagonalSize condition in my query I find that there > > are 225 rows that satisfy the other conditions. 155 of these have a > Maybe you should drop your random_page_cost to something less than 4, > eg. 3 or even 2... The big problem is a very poor estimate (off by a couple orders of magnitude). I was hoping someone with more knowledge in fixing those would jump in.
Rod Taylor wrote: > >The big problem is a very poor estimate (off by a couple orders of >magnitude). I was hoping someone with more knowledge in fixing those >would jump in. > > > > ANALYZE might be producing poor stats due to : i) many dead tuples or ii) high proportion of dead tuples in the first few pages of the table Does a VACUUM FULL followed by ANALYZE change the estimates (or have you tried this already)? (p.s. - I probably don't qualify for the 'more knowledge' bit either...) regards Mark
>> > ANALYZE might be producing poor stats due to : > > i) many dead tuples or > ii) high proportion of dead tuples in the first few pages of the table > > Does a VACUUM FULL followed by ANALYZE change the estimates (or have > you tried this already)? > > (p.s. - I probably don't qualify for the 'more knowledge' bit either...) You can also increase your statistics_target which will make ANALYZE take longer but can help a great deal with larger data sets. Sincerely, Joshua D. Drake > > regards > > Mark > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
ken <southerland@samsixedd.com> writes: > ... and here is the plan with statistics set to 1000 ... > Seq Scan on nrgfeature f (cost=0.00..31675.57 rows=18608 width=218) > (actual time=63.544..1002.701 rows=225 loops=1) > Filter: ((upperrightx > 321264.236977215::double precision) AND > (lowerleftx < 324046.799812083::double precision) AND (upperrighty > > 123286.261898636::double precision) AND (lowerlefty < > 124985.927450476::double precision) AND (diagonalsize > 49.999::double > precision)) > ... so yeah, its obviously finding way, way less rows than it thinks it > will. Yup. I think your problem here is that the conditions on the different columns are highly correlated, but the planner doesn't know anything about that, because it has no cross-column statistics. You could check that the individual conditions are accurately estimated by looking at the estimated and actual row counts in explain analyze SELECT * FROM nrgfeature f WHERE upperRightX > 321264.23697721504; explain analyze SELECT * FROM nrgfeature f WHERE lowerLeftX < 324046.79981208267; etc --- but I'll bet lunch that they are right on the money with the higher statistics targets, and probably not too far off even at the default. The trouble is that the planner is simply multiplying these probabilities together to get its estimate for the combined query, and when the columns are not independent that leads to a very bad estimate. In particular it sounds like you have a *whole lot* of rows that have diagonalSize just under 50, and so changing the diagonalSize condition to include those makes for a big change in the predicted number of rows, even though for the specific values of upperRightX and friends in your test query there isn't any change in the actual number of matching rows. I don't have any advice to magically solve this problem. I would suggest experimenting with alternative data representations -- for example, maybe it would help to store "leftX" and "width" in place of "leftX" and "rightX", etc. What you want to do is try to decorrelate the column values. leftX and rightX are likely to have a strong correlation, but maybe leftX and width don't. regards, tom lane
Apologies in advance for the length of this post but I want to be as thorough as possible in describing my problem to avoid too much banter back and forth. First off, thanks to all for your help with my index problem on my multi-column index made up of 5 double precision columns. Unfortunately, I was unable to make it work with any of the suggestions provided. However, Tom's suggestion of experimenting with alternative data representations led me to explore using the built-in geometric data object box to store this information since that is exactly what I am storing. By adding an rtree index on this box column I was able to make this new method work beautifully for most cases! The query that was taking over a second went down under 20 milliseconds! Wow! However, for *some* cases the query now behaves very badly. My new table is now essentially the following (with the unnecessary bits removed for your ease of viewing) ... Column | Type | Modifiers ----------------+-----------------------------+----------- fid1 | numeric(64,0) | not null fid2 | numeric(64,0) | not null boundingbox | box | Indexes: "nrgfeature_pkey" primary key, btree (fid2, fid1) "nrgfeature_bb_idx" rtree (boundingbox) "nrgfeature_bblength_idx" btree (length(lseg(boundingbox))) ... with 763,809 rows. The query I run is of the form ... SELECT * FROM nrgFeature WHERE boundingbox && '((213854.57920364887, 91147.4541420119), (212687.30997287965, 90434.4541420119))' AND length(lseg(boundingbox)) > 12.916666666666668; ... If the bounding box is relatively small (like the one defined above) then the query is very fast as relatively few rows are investigated by the index. The explain analyze for the above is ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14987.30 rows=1274 width=256) (actual time=0.046..0.730 rows=89 loops=1) Index Cond: (boundingbox && '(213854.579203649,91147.4541420119),(212687.30997288,90434.4541420119)'::box) Filter: (length(lseg(boundingbox)) > 12.9166666666667::double precision) Total runtime: 0.830 ms ... Notice the statistics aren't great at guessing the number of rows, however, since the number is sufficient to tell the optimizer to use the index, it does and the query is blindingly fast. However, now let's try and retrieve all the rows that overlap a much, much bigger bounding box but limit the results to rows with very large bounding boxes (we are displaying these things on the screen and if they are too small to see at this scale there is no reason to return them in the query)... SELECT * FROM nrgFeature WHERE boundingbox && '((793846.1538461539, 423000.0), (-109846.15384615387, -129000.0))' AND length(lseg(boundingbox)) > 10000.0; ... and its explain analyze is ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14987.30 rows=1274 width=256) (actual time=1.861..6427.876 rows=686 loops=1) Index Cond: (boundingbox && '(793846.153846154,423000),(-109846.153846154,-129000)'::box) Filter: (length(lseg(boundingbox)) > 10000::double precision) Total runtime: 6428.838 ms ... notice that the query now takes 6.4 seconds even though the statistics look to be pretty close and only 686 rows are returned. The reason is due to the condition on the length(lseg()) functions. Without this condition the explain analyze is the following ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14958.66 rows=3820 width=256) (actual time=21.356..7750.360 rows=763768 loops=1) Index Cond: (boundingbox && '(793846.153846154,423000),(-109846.153846154,-129000)'::box) Total runtime: 8244.213 ms ... in which it can be seen that the statistics are way, way off. It thinks its only going to get back 3820 rows but instead almost every row in the table is returned! It should *not* be using this index in this case. Is there something wrong with rtree indexes on box data types? So that is problem number one. Problem number two, and this is possibly a bug, is that postgres doesn't seem to use functional indexes on geometric data. In the case immediately above, the optimizer should choose the nrgfeature_bblength_idx instead as it would immediately give the 686 rows that satisfy the length(lseg()) condition and voila it would be done. However, even if I *drop* the rtree index on the boundingbox column, so that it can't use that index, the optimizer does not choose the other index. Instead it reverts to doing a sequential scan of the entire table and its really slow. Again, sorry for the long post. Hope someone has experience with either of these problems. Ken > I don't have any advice to magically solve this problem. I would > suggest experimenting with alternative data representations -- for > example, maybe it would help to store "leftX" and "width" in place > of "leftX" and "rightX", etc. What you want to do is try to decorrelate > the column values. leftX and rightX are likely to have a strong > correlation, but maybe leftX and width don't. > > regards, tom lane >
ken <southerland@samsixedd.com> writes: > Is there something wrong with rtree indexes on box data types? Not per se, but the selectivity estimator for && is just a stub :-(. Feel free to take a stab at improving this, or take a look at PostGIS --- I think those guys are working the same problem even now. > However, even if I *drop* the rtree index on the boundingbox > column, so that it can't use that index, the optimizer does not choose > the other index. Instead it reverts to doing a sequential scan of the > entire table and its really slow. This is also a lack-of-statistical-knowledge problem. The optimizer has no stats about the distribution of length(lseg(boundingbox)) and so it does not realize that it'd be reasonable to use that index for a query like "length(lseg(boundingbox)) > largevalue". As of 7.5 this issue will be addressed, but in existing releases I think you could only get that index to be used by forcing it with enable_seqscan = off :-( As a short-term workaround it might be reasonable to store that length as an independent table column, which you could put an index on. You could use triggers to ensure that it always matches the boundingbox. regards, tom lane