Re: KNN-GiST with recheck - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: KNN-GiST with recheck
Date
Msg-id CAPpHfdvdL0dtoNBmbCsJ=W0bFE7A87hLOtoBN7Pe4hnXkQ9LxA@mail.gmail.com
Whole thread Raw
In response to Re: KNN-GiST with recheck  (Bruce Momjian <bruce@momjian.us>)
Responses Re: KNN-GiST with recheck
List pgsql-hackers
On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
>     Does this also fix the identical PostGIS problem or is there something
>     PostGIS needs to do?
>
>
> This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
> need corresponding change in its GiST opclass. Since PostGIS already define <->
> and <#> operators as distance to bounding box border and bounding box center,
> it can't change their behaviour.
> it has to support new operator "exact distance" in opclass. 

Ah, OK, so they just need something that can be used for the recheck.  I
think they currently use ST_Distance() for that.  Does it have to be an
operator?  If they defined an operator for ST_Distance(), would
ST_Distance() work too for KNN-GiST?

Currently, ST_Distance is a function, but it's no problem to make it an operator with KNN-GiST support.
 
In summary, you still create a normal GiST index on the column:

        http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

        CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);

which indexes by the bounding box.  The new code will allow ordered
index hits to be filtered by something like ST_Distance(), rather than
having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:

        EXPLAIN ANALYZE WITH distance AS (
            SELECT way AS road, ref AS route
                FROM planet_osm_line
                WHERE highway = 'secondary'
                ORDER BY ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913) <#> way
                LIMIT 50
                )
        SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
            FROM distance
            ORDER BY true_distance
            LIMIT 1;

Yeah. It this query 50 is pure empirical value. It could be both too low or too high. Too low value can cause wrong query answers. Too high value can cause lower performance. With patch simple KNN query will work like this query with always right value in LIMIT clause.
 
Notice the CTE uses <#> (bounding box center), and then the outer query
uses ST_Distance and LIMIT 1 to find the closest item.

Excellent!

Thanks. The main question now is design of this patch. Currently, it does all the work inside access method. We already have some discussion of pro and cons of this method. I would like to clarify alternatives now. I can see following way:
  1. Implement new executor node which performs sorting by priority queue. Let's call it "Priority queue". I think it should be separate node from "Sort" node. Despite "Priority queue" and "Sort" are essentially similar from user view, they would be completely different in implementation.
  2. Implement some interface to transfer distance values from access method to "Priority queue" node.
  3. Somehow tell the planner that it could use "Priority queue" in corresponding cases. I see two ways of doing this:
    • Add flag to operator in opclass indicating that index can only order by lower bound of "col op value", not by "col op value" itself.
    • Define new relation between operators. Value of one operator could be lower bound for value of another operator. So, planner can put "Priority queue" node when lower bound ordering is possible from index. Also "ALTER OPERATOR" command would be reasonable, so extensions could upgrade.
Besides overhead, this way makes significant infrastructural changes. So, it may be over-engineering. However, it's probably more clean and beautiful solution.
I would like to get some feedback from people familiar with KNN-GiST like Heikki or Tom. What do you think about this? Any other ideas?

------
With best regards,
Alexander Korotkov.  

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Next
From: Robert Haas
Date:
Subject: Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}