Re: [HACKERS] How to implement a SP-GiST index as a extension module? - Mailing list pgsql-hackers

From Connor Wolf
Subject Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date
Msg-id CAAVqP=ra29L6bUc5n3mjzG5+Mv37i+uaGCvh_kzL6x1FSo1hXw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers

On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!

On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote:
Ok, I've managed to get my custom index working. 
 
Good!

It's all on github here: https://github.com/fake-name/pg-spgist_hamming, if anyone else needs a fuzzy-image searching system 
that can integrate into postgresql..

It should be a pretty good basis for anyone else to use if they want to implement a SP-GiST index too. 

I took a look at the code, and I feel myself a bit confused :)
It appears that you're indexing int8 values.  That seems like unrealistic short representation for image signature.

It is a int8, and nope, it's a surprisingly robust and functional signature. There's a document describing the hashing mechanism here:

Functionally, the procedure is relatively simple:
  • Convert to greyscale
  • Resize to intermediate resolution (32x32 is common)
  • Perform DCT on 32x32 image.
  • Crop 32x32 image to 8x8 by throwing away the high-frequency components
  • Threshold the 8x8 image by it's average
  • Serialize the 64 binary values into a int8
 
Also, name of repository make me think that hamming distance would be used to compare signatures.  But after look at the code, I see that plain absolute value of difference is used for that purpose.

static double
getDistance(Datum v1, Datum v2)
{
    int64_t a1 = DatumGetInt64(v1);
    int64_t a2 = DatumGetInt64(v2);
    int64_t diff = Abs(a1 - a2);
    fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
    return diff;
}

For such notion of distance, you don't need a VP-tree or another complex indexing.  B-tree is quite enough in this case.  Alternatively, distance function is not what it meant to be.


You're looking in the wrong place. 

https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the code you sent me, with some simplification to make it only work on single integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least implement a functional index using a much more basic structure.

https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the actual BK-tree index, and it does indeed use hamming distance for the search metric:

static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);

return ret;
}
 
It would be useful if you provide complete usage example of this extension: from image to signature conversion to search queries. 



Actual usage is done with this project: https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree I've implemented, and it's actually used here
I also have unit tests that sit on top of this here (see all the files that are named "Test_db_BKTree...".

Connor

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] ginInsertCleanup called from vacuum could still misstuples to be deleted
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] path toward faster partition pruning