Home > mailing lists

Re: [GENERAL] Understanding and implementing a GiST Index - Mailing list pgsql-hackers

From	Connor Wolf
Subject	Re: [GENERAL] Understanding and implementing a GiST Index
Date	October 10, 2014 17:48:54
Msg-id	543784DC.9020107@imaginaryindustries.com Whole thread Raw
List	pgsql-hackers

Tree view

I'd be ok with either SP-GiST or GiST, though there are tree structures (BK tree) that are particularly suited to this application that I think map to GiST better then SP-GiST.

Re: hamming in the RD-tree implementation: Where? I cloned the smlar git repo, and poked around, and it looks like it only can operate on set intersections, which is a non-starter for what I need to do. I spent a while looking through the source, and I didn't see anything that looked like hamming distance calculation (through I probably missed some stuff. The indirection through `FCall2()` makes things very hard to follow).

Anyways, right now, smlar is not useful to me, because it completely ignores the structure of incoming arrays:

postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar
-------
     1
(1 row)

For two radically different hashes (shortened for example purposes) which compare as identical.

I spent some time trying to see if I could easily disable the array uniquefication, and by disabling the calls to `qsort_arg`, and modifying `numOfIntersect` so it just counts the number of times the array elements do not match, and I'm getting the behaviour I want out of `smlar()` at this point:

postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# ); smlar
-------- 0.4375
(1 row)

postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
); smlar
-------   0.5
(1 row)

But the index does not seem to work after the changes I made, and the debugging print statements (well, `elog()` statements) I inserted into `cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don't understand how the index is even being built.

Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor

On 10/9/2014 8:11 AM, Oleg Bartunov wrote:

Just quick recommendation.
You need to ask -hackers mailing list for that. Also, do you want GiST or SP-GiST ?
We already use hamming distance in rd-tree, which implemented in GiST, see our presentation http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf, for example.

Oleg

On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf <wolf@imaginaryindustries.com> wrote:
Hi there!

I'm trying to implement a custom indexing system using the GiST index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across a set of 64 bit integers by hamming distance. This is intended to be used in searching for similar images, where the 64 bit integer is actually a phash of an image, and similarity between two images is reflected in the hamming distance between two integers.

Anyways, The appropriate approach here is to use something called a BK tree, for which I've written some test code and I think I have a decent grip of (my test code seems to work, in any event).

That said:
Is there a simple piece of example-code for implementing a GiST index for a single index? I've been looking through the /contrib/ directory, particularly the /contrib/btree_gist/ files, but it looks like 90% of the complexity there is related to supporting all the column types Postgres has, rather then actually tied to producing a functional index.
Once I have something compiling, how can I check to be sure that I'm actually using the indexing module I created? I can enable my compiled extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX test_index ON testing USING gist (val);` that lets me specify *which* GiST index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables somewhere, but I haven't had much luck with google finding out where.
Testing: What's the appropriate way to examine the generated tree structure of the index? I certainly went through a few bugs with my test tree system (and that was in python!). Is there any way to examine the index structure for debugging purposes?
Also, it looks like `ereport()` is the proper way to emit debugging information. Is this correct?
In that vein, is there any way to have information that is on the operations of an entire query? Looking at the number of tree nodes touched for a scan would be nice (and I would not be surprised if there is already a facility for it).
Project code is here if anyone is interested, any help would be great. I have very little idea what I'm doing.
Thanks,
Connor

pgsql-hackers by date:

From: Fabrízio de Royes Mello
Date: 10 October 2014, 17:43:47
Subject: Re: schema-only -n option in pg_restore fails

From: Robert Haas
Date: 10 October 2014, 18:05:34
Subject: Re: UPSERT wiki page, and SQL MERGE syntax

Re: [GENERAL] Understanding and implementing a GiST Index - Mailing list pgsql-hackers

Previous

Next