Request from sender:
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce,
I've been trying to send this message to the hacker's mail list, but so
far every attempt has failed. Perhaps you could forward this message
there for me?
Thanks a lot,
dave
----
Hackers,
I'm trying to improve the performance of the PostGIS
(http://postgis.refractions.net) GEOMETRY indexing and have a few
questions for everyone (or just tell me I'm doing it all wrong).
Basically, the problem is that I'm currently using a constant number for
the restriction selectivity of the overlap (&&) operator (CREATE
OPERATOR ... RESTRICT=). This has to be quite low to force the index to
be used (the default value didnt use the index very often), but now it
uses it too often.
I'm currently implementing a much smarter method (see the proposal
below) - basically creating a 2D histogram and using that to make a
better guess at how many rows will be returned.
So far, I've looked at how the RESTRICT function is called (in 7.2), but
I've come up against a few problems that I'm not sure what the proper
way to handle is:
1. Getting the histogram to be generated and where to store it.
Ideally, I'd be able to store the statistics in pg_statistic and
'VACUUM ANALAYSE' would have hooks into my datatype's code.
Is this even possible? Can I put a custom histogram in
pg_statistics?
Can I have VACUUM call my statistical analysis functions?
Alternatively, would I have to store the 2d histogram in a separate
[user] table and have the user explicitly call a function to update
it?
2. How do I access the histogram from my RESTRICT function?
I guess this is asking how do I efficiently do a query from within
a function? The built-in RESTRICT functions seem to be able to
access the system catalogs directly ("SearchSysCache"), but how does
one access tables more directly (SPI?)? Is this going to be slow
things down significantly?
Am I doing the right thing?
Here's a summary of the problem and the proposed solution,
dave
-------------
Several people have reported index selection problems with PostGIS:
sometimes it ops to use the spatial (GiST) index when its inappropriate.
I suggest you also read
http://www.postgresql.org/idocs/index.php?planner-stats.html
since it gives a good introduction to the statistics collected and used
by
the query planner.
Postgresql query planning
-------------------------
I'll start with a quick example, then move on to spatial indexing.
Lets start with a simple table:
name location age
-------------------------------
dave james bay 31
paul james bay 30
oscar central park 23
chris downtown 22
With some indexes:
CREATE index people_name_idx on people (name);
CREATE index people_age_idx on people (age);
We then start to execute some queries:
#1) SELECT * FROM people WHERE location = 'james bay';
Postgresql's only possible query plan is to do a sequential scan of the
table and check to see if location ='james bay' - there isnt an index to
consult. The sequential scan is simple - it loads each row from the
table
and checks to see if location='james bay'.
Here's what postgresql says about this query:
dblasby=# explain analyse SELECT * FROM people WHERE location = 'james
bay';
NOTICE: QUERY PLAN:
Seq Scan on people (cost=0.00..1.05 rows=2 width=25) (actual
time=0.02..0.03 rows=2 loops=1)
Total runtime: 0.07 msec
Note that postgresql is using a "Seq Scan" and predicts 2 result rows.
The
planner is very accurate in this estimate because the statistics
collected
explicitly say that 'james bay' is the most common value in this column
(cf.
pg_stats and pg_statistics tables).
#2) SELECT * FROM people WHERE name ='dave';
Here the query planner has two option - it can do a sequential scan or
it
can use the people_name_idx. Here's what the query planner says (I
explicitly tell it to use the index in the 2nd run):
dblasby=# explain analyse SELECT * FROM people WHERE name ='dave';
NOTICE: QUERY PLAN:
Seq Scan on people (cost=0.00..1.05 rows=1 width=25) (actual
time=0.02..0.03 rows=1 loops=1)
Total runtime: 0.07 msec
dblasby=# set enable_seqscan=off;
dblasby=# explain analyse SELECT * FROM people WHERE name ='dave';
NOTICE: QUERY PLAN:
Index Scan using people_name_idx on people (cost=0.00..4.41 rows=1
width=25) (actual time=33.72..33.73 rows=1 loops=1)
Total runtime: 33.82 msec
In this case the sequential scan is faster because it only has to load
one
page from the disk and check 4 rows. The index scan will have to load
in
the index, perform the scan, then load in the page with the data in it.
On a larger table, the index scan would probably be significantly
faster.
#3)SELECT * FROM people WHERE age < 30;
Again, we have a chose of the index or sequential scan. The estimate of
the
number of rows is calculated by the "<" operator for integers and the
table
statistics. We'll talk about row # estimates later.
dblasby=# explain analyse SELECT * FROM people WHERE age < 30;
NOTICE: QUERY PLAN:
Seq Scan on people (cost=0.00..1.05 rows=3 width=25) (actual
time=0.03..0.03 rows=2 loops=1)
Total runtime: 0.10 msec
EXPLAIN
dblasby=# set enable_seqscan=off;
SET VARIABLE
dblasby=# explain analyse SELECT * FROM people WHERE age < 30;
NOTICE: QUERY PLAN:
Index Scan using people_age_idx on people (cost=0.00..3.04 rows=3
width=25)
(actual time=0.06..0.07 rows=2 loops=1)
Total runtime: 0.15 msec
#4) SELECT * FROM people WHERE age < 30 AND name='dave';
Here we have 3 plans - sequence scan and index scan (age) and index scan
(name). The actual plan chosen will be determined by looking at all the
plans and estimating which is fastest.
If the 'people' table had a lot of rows, the "name='dave'" clause would
probably be much more selective (return less results) than the "age<30"
clause. The planner would probably use the name index.
Spatial Indexing
----------------
For spatial indexing, queries look like:
SELECT * FROM mygeomtable WHERE the_geom && 'BOX3D(...)'::box3d;
The planner has two options - do a sequential scan of the table (and
check
each geometry against the BOX3D) or do an index scan using the GiST
index.
Its further confused because TOASTed (large) geometries will be
expensive to
test in the sequential scan because the entire geometry much be read.
The planner makes it's choice by estimating how many rows the " the_geom
&&
'BOX3D(...)'::box3d " clause will return - if its just a few row, the
index
will be extremely efficient. If it returns a large number of rows, the
cost
involved in consulting the index AND then actually reading the data from
the
table will be high.
When PostGIS was first created, it was very difficult to get postgresql
to
actually use the index because the estimate of the number of rows
returned
was always high. The planner thought the sequential scan was faster.
A later release of PostGIS tried to fix this by lowering this estimate.
This made postgresql use the spatial index almost always. Since the
GiST
index has low overhead, this worked well even if the index was used when
it
wasnt the best plan. The estimate was very simple - it was a fixed % of
the
total data.
Everything was great until people started doing queries like:
SELECT * FROM mygeomtable WHERE
the_geom && 'BOX3D(...)'::box3d AND
road_type =27;
Postgresql now has to make a decision between the spatial index and the
attribute index (on 'road_type'). Since we were doing a very simple
estimate for the selectivity of the && operator, the spatial index is
often
used when another index (ie. road_type) would be more appropriate. When
the
BOX3D is "large" the problem is applified because the estimate is very
wrong.
Second, problems with joins and other queries occur because the estimate
of
the number of rows returned was much lower than in actuality and
postgresql
choose poor table joining strategies.
New Proposal
------------
The current selectivity of the "&&" operator is fixed - it always
reports (i
think 5%) the same number of rows no matter where the query BOX3D is or
its
size. This is as simple as it gets.
The next simplest method is to look at the extents of the data, and the
size
of the query window. If the query window is 1/4 the size of the data,
then
we can estimate that 1/4 of the rows will be returned. This is quite
workable, but has problem because spatial data is not uniformly
distributed.
The proposed method extends the simple method - we use another index to
estimate the number of results for the "&&" operator. WHEW - indexing
our
indexes! In a nutshell, the plan is to use a modified quad tree (ie.
has
fixed cell sizes and location) to store a 2d histogram of the spatial
data.
This is best explained in a few diagrams. See the attached 2 diagrams.
In diagram one we see typical spatial data - road in the province of
BC. In
place like Vancouver, Victoria, Kelowna, Prince George there are LOTS of
roads. Elsewhere there are hardly any. There are a total of 200,000
rows
in the roads table. The proposed method does a 3 stage pre-process.
(See
figure 2)
1. computes the extents of the data. Basically
SELECT extent(the_geom) FROM mygeometrytable;
2. make a fixed size grid from the extents - something like a 15*15
matrix
3. scan through each row of the geometry table and determine which of
the
grid's cells overlap the bounding box of the geometry. Increment
those grid cells by 1.
The result will be a 2d histogram of the spatial data. Cells with lots
of
geometries in them will have a high value, and cells with few will have
a
low value. This is stored somewhere (ie. the geometry_columns metatable
but
it would be nice to have it siting in shared memory somewhere) - the
histogram will be small enough to sit on one disk page.
An incomming query's (ie. the_geom && 'BOX3D(..)') selectivity is
calculated
this way:
1. Load the appropriate 2d histogram (one disk page)
2. find which cells the BOX3D overlaps and the % area of the cell that
the
BOX3D overlaps
3. sum ( % overlap * <number of geometries in that cell>)
This should give a 'reasonable' estimate of the particular "&&"
selectivity. The estimate will be poor if the data is extreamly skewed.
An extention of this would be to make a proper quad tree (instead of
fixed
cells), then simplify it. Unfortunately this could be very difficult
and
isnt proposed (yet).
Thoughts?
dave