Re: Index oddity (still) - Mailing list pgsql-performance
From | ken |
---|---|
Subject | Re: Index oddity (still) |
Date | |
Msg-id | 1087256828.28557.115.camel@pesky Whole thread Raw |
In response to | Re: Index oddity (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Index oddity (still)
|
List | pgsql-performance |
Apologies in advance for the length of this post but I want to be as thorough as possible in describing my problem to avoid too much banter back and forth. First off, thanks to all for your help with my index problem on my multi-column index made up of 5 double precision columns. Unfortunately, I was unable to make it work with any of the suggestions provided. However, Tom's suggestion of experimenting with alternative data representations led me to explore using the built-in geometric data object box to store this information since that is exactly what I am storing. By adding an rtree index on this box column I was able to make this new method work beautifully for most cases! The query that was taking over a second went down under 20 milliseconds! Wow! However, for *some* cases the query now behaves very badly. My new table is now essentially the following (with the unnecessary bits removed for your ease of viewing) ... Column | Type | Modifiers ----------------+-----------------------------+----------- fid1 | numeric(64,0) | not null fid2 | numeric(64,0) | not null boundingbox | box | Indexes: "nrgfeature_pkey" primary key, btree (fid2, fid1) "nrgfeature_bb_idx" rtree (boundingbox) "nrgfeature_bblength_idx" btree (length(lseg(boundingbox))) ... with 763,809 rows. The query I run is of the form ... SELECT * FROM nrgFeature WHERE boundingbox && '((213854.57920364887, 91147.4541420119), (212687.30997287965, 90434.4541420119))' AND length(lseg(boundingbox)) > 12.916666666666668; ... If the bounding box is relatively small (like the one defined above) then the query is very fast as relatively few rows are investigated by the index. The explain analyze for the above is ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14987.30 rows=1274 width=256) (actual time=0.046..0.730 rows=89 loops=1) Index Cond: (boundingbox && '(213854.579203649,91147.4541420119),(212687.30997288,90434.4541420119)'::box) Filter: (length(lseg(boundingbox)) > 12.9166666666667::double precision) Total runtime: 0.830 ms ... Notice the statistics aren't great at guessing the number of rows, however, since the number is sufficient to tell the optimizer to use the index, it does and the query is blindingly fast. However, now let's try and retrieve all the rows that overlap a much, much bigger bounding box but limit the results to rows with very large bounding boxes (we are displaying these things on the screen and if they are too small to see at this scale there is no reason to return them in the query)... SELECT * FROM nrgFeature WHERE boundingbox && '((793846.1538461539, 423000.0), (-109846.15384615387, -129000.0))' AND length(lseg(boundingbox)) > 10000.0; ... and its explain analyze is ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14987.30 rows=1274 width=256) (actual time=1.861..6427.876 rows=686 loops=1) Index Cond: (boundingbox && '(793846.153846154,423000),(-109846.153846154,-129000)'::box) Filter: (length(lseg(boundingbox)) > 10000::double precision) Total runtime: 6428.838 ms ... notice that the query now takes 6.4 seconds even though the statistics look to be pretty close and only 686 rows are returned. The reason is due to the condition on the length(lseg()) functions. Without this condition the explain analyze is the following ... Index Scan using nrgfeature_bb_idx on nrgfeature (cost=0.00..14958.66 rows=3820 width=256) (actual time=21.356..7750.360 rows=763768 loops=1) Index Cond: (boundingbox && '(793846.153846154,423000),(-109846.153846154,-129000)'::box) Total runtime: 8244.213 ms ... in which it can be seen that the statistics are way, way off. It thinks its only going to get back 3820 rows but instead almost every row in the table is returned! It should *not* be using this index in this case. Is there something wrong with rtree indexes on box data types? So that is problem number one. Problem number two, and this is possibly a bug, is that postgres doesn't seem to use functional indexes on geometric data. In the case immediately above, the optimizer should choose the nrgfeature_bblength_idx instead as it would immediately give the 686 rows that satisfy the length(lseg()) condition and voila it would be done. However, even if I *drop* the rtree index on the boundingbox column, so that it can't use that index, the optimizer does not choose the other index. Instead it reverts to doing a sequential scan of the entire table and its really slow. Again, sorry for the long post. Hope someone has experience with either of these problems. Ken > I don't have any advice to magically solve this problem. I would > suggest experimenting with alternative data representations -- for > example, maybe it would help to store "leftX" and "width" in place > of "leftX" and "rightX", etc. What you want to do is try to decorrelate > the column values. leftX and rightX are likely to have a strong > correlation, but maybe leftX and width don't. > > regards, tom lane >
pgsql-performance by date: