Re: [HACKERS] How to implement a SP-GiST index as a extension module? - Mailing list pgsql-hackers

From Connor Wolf
Subject Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date
Msg-id CAAVqP=r2ADz49eKpc5uPwfDftAwVKKhra=ty1Vtr0xjung-TgQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Connor Wolf <connorw@imaginaryindustries.com>)
Responses Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Ok, more questions.

I've been studying the implementation Alexander Korotkov sent, and I'm not seeing how to map 
some of the components onto the changes in the SP-GiST system that occured between Postgresql 9.2 and 9.3.

The changes at that point seem to have been to change xxx_inner_consistent and xxx_leaf_consistent from taking 
an input containing a Datum pointing to the query conditional to a list of ScanKeys which each encapsulate one filter condition.

The issue here is that for VP trees (or the BK tree I want to eventually implement), the filter condition requires two parameters:
 - The maximum allowed distance from the target value
 - The actual target value.

The ScanKeys struct appears to only be able to contain a conditional type, and a single parameter, such as "less then x", "above y", 
and so forth. Looking at the existing spgkdtreeproc.c and spgquadtreeproc.c, their mechanism for passing more complex conditions
through to the xxx_consistent functions appears to be to encapsulate the entire condition in a custom type. For example,
their mechanism for querying if something is within a certain envelope is done by having the envelope be described 
by the "BOX" type.

As such:
Will compound queries as I describe above basically require a custom type to make it possible? My (admittedly naive) expectation
is that the eventual query for this index will look something like "SELECT * FROM example_table WHERE indexed_column <=> target_value < 4;", 
with "<=>" being the operator for the relevant distance calculation (hamming, for the BK tree, numeric for the VP-tree).

The existing VP-tree code appears to not support multiple operators whatsoever, probably because it was very preliminary.

Thanks!
Connor

On Mon, Oct 30, 2017 at 7:04 PM, Connor Wolf <connorw@imaginaryindustries.com> wrote:
I was mostly unclear on how I'd go about attaching the extension functions to the relevant indexing mechanism. From the looks of the vptree.tar.gz file (which is really, *really* helpful, incidentally!), a it's done via a custom operator class, which then gets passed to the actual index creation mechanism when you're declaring the index.

I think I had looked at that at one point, but it's been a while. In my case, I'm using discrete-cosine-transform based perceptual hashes for searching. They are nice and compact (64-bits per hash), while still producing good search results. I have a dataset of ~36 million images, and it does searches in < 50 milliseconds with a hamming distance of 4, while touching ~0.25% of the tree (And occupying ~18 GB of ram). 

My BK tree is up on github here, if anyone needs something like that (BSD licensed, pretty well tested). There's also a python wrapper for it.

I'll probably not have time to poke about until this weekend, but thanks!
Connor



On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!

On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf <wolf@imaginaryindustries.com> wrote:
I'm looking at implementing a custom indexing scheme, and I've been having trouble understanding the proper approach.

Basically, I need a BK tree, which is a tree-structure useful for indexing arbitrary discrete metric-spaces (in my case, I'm interested in indexing across the hamming edit-distance of perceptual hashes, for fuzzy image searching). I'm pretty sure a SP-GiST index is the correct index type, as my tree is intrinsically unbalanced.
 
Yes, SP-GiST is appropriate index type for BK tree.  I'm pretty sure BK tree could be implemented as SP-GiST opclass.
The only thing worrying me is selection pivot values for nodes.  SP-GiST builds by insertion of index tuples on by one.  First pivot value for root node in SP-GIST would be created once first leaf page overflows.  Thus, you would have to select this pivot value basing on very small fraction in the beginning of dataset.
As I know, BK tree is most efficient when root pivot value is selected after looking in whole dataset and then hierarchically to subtrees.

BTW, did you try my extension for searching similar images.  It's quite primitive, but works for some cases.

I have a functional stand-alone implementation of a BK-Tree, and it works very well, but the complexity of managing what is basically a external index for my database has reached the point where it's significantly problematic, and it seems to be it should be moved into the database.
 
Sure, moving this index to the database is right decision.

Anyways, looking at the contents of postgres/src/backend/access/spgist, it looks pretty straightforward in terms of the actual C implementation, but I'm stuck understanding how to "install" a custom SP-GiST implementation. There are several GiST indexing implementations in the contrib directory, but no examples for how I'd go about implementing a loadable SP-GiST index.

Basically, my questions are:
  • Is it possible to implement a SP-GiST indexing scheme as a loadable module?
 Yes, it's possible to define SP-GiST.
    • If so, how?
The pretty same way as GiST opclass extension.  You have to define supporting functions and operators and then define operator class over them.
    • And is there an example I can base my implementation off of?

    I'm relatively comfortable with C (much moreso with C++), but I haven't spent a lot of time looking at the postgresql codebase.  I don't think I could start from a empty folder and make a properly-implemented module in any reasonable period of time, so if I have a working example for some sort of index that uses the same interfaces that would really help a lot

    I don't think there is an example in PostgreSQL source code tree or on github.  But I've attached by early experiment with VP-tree (seems to be pretty same as BK tree) using SP-GiST (see vptree.tar.gz).  Basing on this experiment I realized that it's important to select root pivot value basing on the whole dataset.  However, for your metric/dataset/queries it might appear to be different.

    It also would be nice to someday improve SP-GiST to support some global strategies on index creation.  In particular, it would allow to resolve selection of pivot values problem that I mention above.  Right now my colleagues and me don't have time for that.  But I can assist you with advises if you will decide to implement that.

    ------
    Alexander Korotkov
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company 


    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers



    pgsql-hackers by date:

    Previous
    From: Masahiko Sawada
    Date:
    Subject: Re: [HACKERS] Explicit relation name in VACUUM VERBOSE log
    Next
    From: Ashutosh Bapat
    Date:
    Subject: Re: [HACKERS] Account for cost and selectivity of HAVING quals