Re: [HACKERS] How to implement a SP-GiST index as a extension module? - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date
Msg-id CAPpHfduY_JFqe35dz_T9L=WhovUxJYMYPw7w2moYSxO-eQuUOw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Connor Wolf <connorw@imaginaryindustries.com>)
Responses Re: [HACKERS] How to implement a SP-GiST index as a extension module?  (Connor Wolf <connorw@imaginaryindustries.com>)
List pgsql-hackers
On Tue, Nov 14, 2017 at 6:08 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote:

On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!

On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote:
Ok, I've managed to get my custom index working. 
 
Good!

It's all on github here: https://github.com/fake-name/pg-spgist_hamming, if anyone else needs a fuzzy-image searching system 
that can integrate into postgresql..

It should be a pretty good basis for anyone else to use if they want to implement a SP-GiST index too. 

I took a look at the code, and I feel myself a bit confused :)
It appears that you're indexing int8 values.  That seems like unrealistic short representation for image signature.

It is a int8, and nope, it's a surprisingly robust and functional signature. There's a document describing the hashing mechanism here:

Functionally, the procedure is relatively simple:
  • Convert to greyscale
  • Resize to intermediate resolution (32x32 is common)
  • Perform DCT on 32x32 image.
  • Crop 32x32 image to 8x8 by throwing away the high-frequency components
  • Threshold the 8x8 image by it's average
  • Serialize the 64 binary values into a int8
 
Also, name of repository make me think that hamming distance would be used to compare signatures.  But after look at the code, I see that plain absolute value of difference is used for that purpose.

static double
getDistance(Datum v1, Datum v2)
{
    int64_t a1 = DatumGetInt64(v1);
    int64_t a2 = DatumGetInt64(v2);
    int64_t diff = Abs(a1 - a2);
    fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
    return diff;
}

For such notion of distance, you don't need a VP-tree or another complex indexing.  B-tree is quite enough in this case.  Alternatively, distance function is not what it meant to be.


You're looking in the wrong place. 

https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the code you sent me, with some simplification to make it only work on single integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least implement a functional index using a much more basic structure.

https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the actual BK-tree index, and it does indeed use hamming distance for the search metric:

static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);

return ret;
}
 
It would be useful if you provide complete usage example of this extension: from image to signature conversion to search queries. 



Actual usage is done with this project: https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree I've implemented, and it's actually used here
I also have unit tests that sit on top of this here (see all the files that are named "Test_db_BKTree...".

OK.  That explains the things, thank you.
For such kind of index, it's probably not even necessary to use SP-GiST.  GiST signature tree could work in this case as well (it would be probably even better).
It would be nice if you write about it some blog post to planet PostgreSQL.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: [HACKERS] Proposal: Local indexes for partitioned table
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Pluggable storage