Re: [HACKERS] How to implement a SP-GiST index as a extension module? - Mailing list pgsql-hackers

From Connor Wolf
Subject Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date
Msg-id CAAVqP=q=xuC+SsDc9fwMUFPzJ39obsZURriFjTnKTLxb-N5AWA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers
I initially assumed that as well, but I found some somewhat confusing documentation about implementing this as a plain GiST tree. Mostly, the BK-Tree is a inherently unbalanced tree, and some of the documentation for plain GiST indexes claims that GiST indexes can only be created on balanced tree structures.

GiST stands for Generalized Search Tree. It is a balanced, tree-structured access method,

Sidenote: What is a "signature tree?"

I also had a fair bit of trouble getting going due to the strange terminology. The use of "consistent" for "matches condition" is particularly odd. I'm not sure if this is a translation oddment, but it makes following things a lot more confusing. I suspect this is set match bleeding into the implementation, but I don't see any reason the "consistent" functions couldn't be simply named "matches_query_conditions", with the benefit of no longer requiring familiarity with mathematics terminology to easily understand how things are supposed to work.

Additionally, I don't see how to map some of the calls to a BK-tree. In particular, there's no way to implement the GiST "union" call in the context of a b-tree, since a single tree node can only represent a single value. 

Also, the lack of decent examples for the GiST index is challenging. Really, it'd be really helpful if there was a e a GiST and SP-GiST index example that only operates on one simple, one-dimensional datatype. I'm not sure about other people, but I basically always find it much more effective to start with a working project, and modify it rather then try to start something from scratch.

As it is, there is no example that is a good starting basis for a custom GiST index, and there was no example that was a good starting basis for a SP-GiST index. This is particularly true for people (like me) who haven't worked with postgresql's source or extensions at all before this.

I actually spent some time trying to simplify the GiST b-tree example to the point where it was possible to follow it, but all the complexity introduced by the fact that the b-tree example works for basically every data type (and the dynamic dispatch that entails) makes it very hard to follow for someone new to the codebase.

I'd still like to look at converting to a pure GiST based index, but the first step of that will probably be implementing a very basic, boring 2-ary B-tree index across a integral data-type. That way, I can separate implementation issues with logic issues. 

Connor



On Tue, Nov 14, 2017 at 1:53 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Tue, Nov 14, 2017 at 6:08 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote:

On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!

On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote:
Ok, I've managed to get my custom index working. 
 
Good!

It's all on github here: https://github.com/fake-name/pg-spgist_hamming, if anyone else needs a fuzzy-image searching system 
that can integrate into postgresql..

It should be a pretty good basis for anyone else to use if they want to implement a SP-GiST index too. 

I took a look at the code, and I feel myself a bit confused :)
It appears that you're indexing int8 values.  That seems like unrealistic short representation for image signature.

It is a int8, and nope, it's a surprisingly robust and functional signature. There's a document describing the hashing mechanism here:

Functionally, the procedure is relatively simple:
  • Convert to greyscale
  • Resize to intermediate resolution (32x32 is common)
  • Perform DCT on 32x32 image.
  • Crop 32x32 image to 8x8 by throwing away the high-frequency components
  • Threshold the 8x8 image by it's average
  • Serialize the 64 binary values into a int8
 
Also, name of repository make me think that hamming distance would be used to compare signatures.  But after look at the code, I see that plain absolute value of difference is used for that purpose.

static double
getDistance(Datum v1, Datum v2)
{
    int64_t a1 = DatumGetInt64(v1);
    int64_t a2 = DatumGetInt64(v2);
    int64_t diff = Abs(a1 - a2);
    fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
    return diff;
}

For such notion of distance, you don't need a VP-tree or another complex indexing.  B-tree is quite enough in this case.  Alternatively, distance function is not what it meant to be.


You're looking in the wrong place. 

https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the code you sent me, with some simplification to make it only work on single integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least implement a functional index using a much more basic structure.

https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the actual BK-tree index, and it does indeed use hamming distance for the search metric:

static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);

return ret;
}
 
It would be useful if you provide complete usage example of this extension: from image to signature conversion to search queries. 



Actual usage is done with this project: https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree I've implemented, and it's actually used here
I also have unit tests that sit on top of this here (see all the files that are named "Test_db_BKTree...".

OK.  That explains the things, thank you.
For such kind of index, it's probably not even necessary to use SP-GiST.  GiST signature tree could work in this case as well (it would be probably even better).
It would be nice if you write about it some blog post to planet PostgreSQL.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] pgbench regression test failure
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath