Designing an extension for feature-space similarity search - Mailing list pgsql-hackers

From Jay Levitt
Subject Designing an extension for feature-space similarity search
Date
Msg-id 4F3C16E9.90808@gmail.com
Whole thread Raw
Responses Re: Designing an extension for feature-space similarity search  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Designing an extension for feature-space similarity search  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
[Preamble: I've been told that the hackers list is appropriate for 
extension-related topics like this, even if it's not about contributing to 
core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product 
recommendations" or "Pinterest for people with your tastes". We use psych 
research to measure and predict your personality and traits along a number 
of scales (dimensions), and then we connect you with people, products and 
content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the 
distances of each dimension, along with arbitrary other arguments - for 
instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet) 
needs to happen in a SELECT list, which means it needs to be fast, which 
means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension, 
declare that everything (except inner pages) must be a zero-volume cube 
(cube_is_point()), map our non-float features onto floats somehow, and 
hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I 
also want to make it easy for us to add and subtract dimensions without too 
much pain; that should be DDL, not code. So thinking about how this might 
evolve...

- I'm not sure how to represent arbitrary column-like features without 
reinventing the wheel and putting a database in the database.  hstore only 
stores text, probably for this reason; I took a look at the earlier json 
patch and saw that it handled only a few core data types.  Have there been 
any other PoCs that involved collections of hetereogenous data? I almost 
want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather 
than on a column on that row? I'm fine with this extension requiring its own 
table, so I leave the data where it is in the row, and only worry about 
indexing it. I can't just use functional indexes, because I'll need to 
provide operators and support functions to GiST. Maybe I have a fake 
sentinel column, where all the operators use SPI to introspect the row, 
treat each column as a feature dimension, call the underlying operators on 
each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make 
sense" in the physical world? For instance, let's say NULL <-> NULL returns 
a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> 
NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that 
could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that 
this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we 
plan to open source whatever we produce here, and I don't have quite the 
theoretical background it takes to do this properly.

Jay Levitt


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Command Triggers
Next
From: Robert Haas
Date:
Subject: Re: [COMMITTERS] pgsql: Speed up in-memory tuplesorting.