Re: KNN-GiST with recheck - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: KNN-GiST with recheck
Date
Msg-id CAPpHfdu3JmBrsgsrtaAQJMpXyMgrW+T2P7Gmf9oP8L1V7zLvOg@mail.gmail.com
Whole thread Raw
In response to Re: KNN-GiST with recheck  (Emre Hasegeli <emre@hasegeli.com>)
Responses Re: KNN-GiST with recheck
List pgsql-hackers
On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
> > 1. This patch introduces a new "polygon <-> point" operator. That seems
> > useful on its own, with or without this patch.
>
> Yeah, but exact-knn cant come with no one implementation. But it would
> better come in a separate patch.

I tried to split them.  Separated patches are attached.  I changed
the order of the arguments as point <-> polygon, because point was
the first one on all the others.  Its commutator was required for
the index, so I added it on the second patch.  I also added tests
for the operator.  I think it is ready for committer as a separate
patch.  We can add it to the open CommitFest.

I have made some cosmetic changes on the patches.  I hope they are
useful.

I added support to point <-> circle operator with the same GiST
distance function you added for polygon.  I can change it, if it is not
the right way.

Great, thanks! 

 > > 2. I wonder how useful it really is to allow mixing exact and non-exact
> > return values from the distance function. The distance function included in
> > the patch always returns recheck=true. I have a feeling that all other
> > distance function will also always return either true or false.
>
> For geometrical datatypes recheck variations in consistent methods are also
> very rare (I can't remember any). But imagine opclass for arrays where keys
> have different representation depending on array length. For such opclass
> and knn on similarity recheck flag could be useful.

I also wonder how useful it is.  Your example is convincing, but maybe
setting it index-wide will make the decisions on the framework easier.
For example, how hard would it be to decide if further sorting is
required or not on the planner?

I think that for most use cases just some operators require further sorting and some of them not. But it could appear one day that some index gives part of its knn answers exact and part of them inexact. Same happen to recheck of regular operators. Initially recheck flag was defined in opclass. But later recheck became runtime flag.

> > 4. (as you mentioned in the other thread: ) It's a modularity violation
> > that you peek into the heap tuple from gist. I think the proper way to do
> > this would be to extend the IndexScan executor node to perform the
> > re-shuffling of tuples that come from the index in wrong order, or perhaps
> > add a new node type for it.
> >
> > Of course that's exactly what your partial sort patch does :-). I haven't
> > looked at that in detail, but I don't think the approach the partial sort
> > patch takes will work here as is. In the KNN-GiST case, the index is
> > returning tuples roughly in the right order, but a tuple that it returns
> > might in reality belong somewhere later in the ordering. In the partial
> > sort patch, the "input stream" of tuples is divided into non-overlapping
> > groups, so that the tuples within the group are not sorted, but the groups
> > are. I think the partial sort case is a special case of the KNN-GiST case,
> > if you consider the lower bound of each tuple to be the leading keys that
> > you don't need to sort.
>
> Yes. But, for instance btree accesses heap for unique checking. Is really
> it so crimilal? :-)
> This is not only question of a new node or extending existing node. We need
> to teach planner/executor access method can return value of some expression
> which is lower bound of another expression. AFICS now access method can
> return only original indexed datums and TIDs. So, I afraid that enormous
> infrastructure changes are required. And I can hardly imagine what they
> should look like.

Unfortunately, I am not experienced enough to judge your implementation.
As far as I understand the problem is partially sorting rows on
the index scan node.  It can lead the planner to choose non-optimal
plans, because of not taking into account the cost of sorting.

Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access method breaks incapsulation. One idea about this is to do sorting in another nodes. However, I wonder if it would be an overengineering and overhead. In attached patch I propose a different approach: put code touching heap into separate index_get_heap_values function. Also new version of patch includes regression tests and some cleanup.

------
With best regards,
Alexander Korotkov.

Attachment

pgsql-hackers by date:

Previous
From: Claudio Freire
Date:
Subject: Re: Proposal: Incremental Backup
Next
From: Alvaro Herrera
Date:
Subject: Re: pg_ctl non-idempotent behavior change