Statistical Analysis, Vacuum, and Selectivity Restriction (PostGIS) (fwd) - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Statistical Analysis, Vacuum, and Selectivity Restriction (PostGIS) (fwd) |
Date | |
Msg-id | 200210072022.g97KMFl05757@candle.pha.pa.us Whole thread Raw |
List | pgsql-hackers |
Request from sender: -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Bruce, I've been trying to send this message to the hacker's mail list, but so far every attempt has failed. Perhaps you could forward this message there for me? Thanks a lot, dave ---- Hackers, I'm trying to improve the performance of the PostGIS (http://postgis.refractions.net) GEOMETRY indexing and have a few questions for everyone (or just tell me I'm doing it all wrong). Basically, the problem is that I'm currently using a constant number for the restriction selectivity of the overlap (&&) operator (CREATE OPERATOR ... RESTRICT=). This has to be quite low to force the index to be used (the default value didnt use the index very often), but now it uses it too often. I'm currently implementing a much smarter method (see the proposal below) - basically creating a 2D histogram and using that to make a better guess at how many rows will be returned. So far, I've looked at how the RESTRICT function is called (in 7.2), but I've come up against a few problems that I'm not sure what the proper way to handle is: 1. Getting the histogram to be generated and where to store it. Ideally, I'd be able to store the statistics in pg_statistic and 'VACUUM ANALAYSE' would have hooks into my datatype's code. Is this even possible? Can I put a custom histogram in pg_statistics? Can I have VACUUM call my statistical analysis functions? Alternatively, would I have to store the 2d histogram in a separate [user] table and have the user explicitly call a function to update it? 2. How do I access the histogram from my RESTRICT function? I guess this is asking how do I efficiently do a query from within a function? The built-in RESTRICT functions seem to be able to access the system catalogs directly ("SearchSysCache"), but how does one access tables more directly (SPI?)? Is this going to be slow things down significantly? Am I doing the right thing? Here's a summary of the problem and the proposed solution, dave ------------- Several people have reported index selection problems with PostGIS: sometimes it ops to use the spatial (GiST) index when its inappropriate. I suggest you also read http://www.postgresql.org/idocs/index.php?planner-stats.html since it gives a good introduction to the statistics collected and used by the query planner. Postgresql query planning ------------------------- I'll start with a quick example, then move on to spatial indexing. Lets start with a simple table: name location age ------------------------------- dave james bay 31 paul james bay 30 oscar central park 23 chris downtown 22 With some indexes: CREATE index people_name_idx on people (name); CREATE index people_age_idx on people (age); We then start to execute some queries: #1) SELECT * FROM people WHERE location = 'james bay'; Postgresql's only possible query plan is to do a sequential scan of the table and check to see if location ='james bay' - there isnt an index to consult. The sequential scan is simple - it loads each row from the table and checks to see if location='james bay'. Here's what postgresql says about this query: dblasby=# explain analyse SELECT * FROM people WHERE location = 'james bay'; NOTICE: QUERY PLAN: Seq Scan on people (cost=0.00..1.05 rows=2 width=25) (actual time=0.02..0.03 rows=2 loops=1) Total runtime: 0.07 msec Note that postgresql is using a "Seq Scan" and predicts 2 result rows. The planner is very accurate in this estimate because the statistics collected explicitly say that 'james bay' is the most common value in this column (cf. pg_stats and pg_statistics tables). #2) SELECT * FROM people WHERE name ='dave'; Here the query planner has two option - it can do a sequential scan or it can use the people_name_idx. Here's what the query planner says (I explicitly tell it to use the index in the 2nd run): dblasby=# explain analyse SELECT * FROM people WHERE name ='dave'; NOTICE: QUERY PLAN: Seq Scan on people (cost=0.00..1.05 rows=1 width=25) (actual time=0.02..0.03 rows=1 loops=1) Total runtime: 0.07 msec dblasby=# set enable_seqscan=off; dblasby=# explain analyse SELECT * FROM people WHERE name ='dave'; NOTICE: QUERY PLAN: Index Scan using people_name_idx on people (cost=0.00..4.41 rows=1 width=25) (actual time=33.72..33.73 rows=1 loops=1) Total runtime: 33.82 msec In this case the sequential scan is faster because it only has to load one page from the disk and check 4 rows. The index scan will have to load in the index, perform the scan, then load in the page with the data in it. On a larger table, the index scan would probably be significantly faster. #3)SELECT * FROM people WHERE age < 30; Again, we have a chose of the index or sequential scan. The estimate of the number of rows is calculated by the "<" operator for integers and the table statistics. We'll talk about row # estimates later. dblasby=# explain analyse SELECT * FROM people WHERE age < 30; NOTICE: QUERY PLAN: Seq Scan on people (cost=0.00..1.05 rows=3 width=25) (actual time=0.03..0.03 rows=2 loops=1) Total runtime: 0.10 msec EXPLAIN dblasby=# set enable_seqscan=off; SET VARIABLE dblasby=# explain analyse SELECT * FROM people WHERE age < 30; NOTICE: QUERY PLAN: Index Scan using people_age_idx on people (cost=0.00..3.04 rows=3 width=25) (actual time=0.06..0.07 rows=2 loops=1) Total runtime: 0.15 msec #4) SELECT * FROM people WHERE age < 30 AND name='dave'; Here we have 3 plans - sequence scan and index scan (age) and index scan (name). The actual plan chosen will be determined by looking at all the plans and estimating which is fastest. If the 'people' table had a lot of rows, the "name='dave'" clause would probably be much more selective (return less results) than the "age<30" clause. The planner would probably use the name index. Spatial Indexing ---------------- For spatial indexing, queries look like: SELECT * FROM mygeomtable WHERE the_geom && 'BOX3D(...)'::box3d; The planner has two options - do a sequential scan of the table (and check each geometry against the BOX3D) or do an index scan using the GiST index. Its further confused because TOASTed (large) geometries will be expensive to test in the sequential scan because the entire geometry much be read. The planner makes it's choice by estimating how many rows the " the_geom && 'BOX3D(...)'::box3d " clause will return - if its just a few row, the index will be extremely efficient. If it returns a large number of rows, the cost involved in consulting the index AND then actually reading the data from the table will be high. When PostGIS was first created, it was very difficult to get postgresql to actually use the index because the estimate of the number of rows returned was always high. The planner thought the sequential scan was faster. A later release of PostGIS tried to fix this by lowering this estimate. This made postgresql use the spatial index almost always. Since the GiST index has low overhead, this worked well even if the index was used when it wasnt the best plan. The estimate was very simple - it was a fixed % of the total data. Everything was great until people started doing queries like: SELECT * FROM mygeomtable WHERE the_geom && 'BOX3D(...)'::box3d AND road_type =27; Postgresql now has to make a decision between the spatial index and the attribute index (on 'road_type'). Since we were doing a very simple estimate for the selectivity of the && operator, the spatial index is often used when another index (ie. road_type) would be more appropriate. When the BOX3D is "large" the problem is applified because the estimate is very wrong. Second, problems with joins and other queries occur because the estimate of the number of rows returned was much lower than in actuality and postgresql choose poor table joining strategies. New Proposal ------------ The current selectivity of the "&&" operator is fixed - it always reports (i think 5%) the same number of rows no matter where the query BOX3D is or its size. This is as simple as it gets. The next simplest method is to look at the extents of the data, and the size of the query window. If the query window is 1/4 the size of the data, then we can estimate that 1/4 of the rows will be returned. This is quite workable, but has problem because spatial data is not uniformly distributed. The proposed method extends the simple method - we use another index to estimate the number of results for the "&&" operator. WHEW - indexing our indexes! In a nutshell, the plan is to use a modified quad tree (ie. has fixed cell sizes and location) to store a 2d histogram of the spatial data. This is best explained in a few diagrams. See the attached 2 diagrams. In diagram one we see typical spatial data - road in the province of BC. In place like Vancouver, Victoria, Kelowna, Prince George there are LOTS of roads. Elsewhere there are hardly any. There are a total of 200,000 rows in the roads table. The proposed method does a 3 stage pre-process. (See figure 2) 1. computes the extents of the data. Basically SELECT extent(the_geom) FROM mygeometrytable; 2. make a fixed size grid from the extents - something like a 15*15 matrix 3. scan through each row of the geometry table and determine which of the grid's cells overlap the bounding box of the geometry. Increment those grid cells by 1. The result will be a 2d histogram of the spatial data. Cells with lots of geometries in them will have a high value, and cells with few will have a low value. This is stored somewhere (ie. the geometry_columns metatable but it would be nice to have it siting in shared memory somewhere) - the histogram will be small enough to sit on one disk page. An incomming query's (ie. the_geom && 'BOX3D(..)') selectivity is calculated this way: 1. Load the appropriate 2d histogram (one disk page) 2. find which cells the BOX3D overlaps and the % area of the cell that the BOX3D overlaps 3. sum ( % overlap * <number of geometries in that cell>) This should give a 'reasonable' estimate of the particular "&&" selectivity. The estimate will be poor if the data is extreamly skewed. An extention of this would be to make a proper quad tree (instead of fixed cells), then simplify it. Unfortunately this could be very difficult and isnt proposed (yet). Thoughts? dave
pgsql-hackers by date: