Home > mailing lists

Designing an extension for feature-space similarity search - Mailing list pgsql-hackers

From	Jay Levitt
Subject	Designing an extension for feature-space similarity search
Date	February 15, 2012 19:35:10
Msg-id	4F3C16E9.90808@gmail.com Whole thread Raw
Responses	Re: Designing an extension for feature-space similarity search (Tom Lane <tgl@sss.pgh.pa.us>) Re: Designing an extension for feature-space similarity search (Alexander Korotkov <aekorotkov@gmail.com>)
List	pgsql-hackers

Tree view

[Preamble: I've been told that the hackers list is appropriate for 
extension-related topics like this, even if it's not about contributing to 
core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product 
recommendations" or "Pinterest for people with your tastes". We use psych 
research to measure and predict your personality and traits along a number 
of scales (dimensions), and then we connect you with people, products and 
content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the 
distances of each dimension, along with arbitrary other arguments - for 
instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet) 
needs to happen in a SELECT list, which means it needs to be fast, which 
means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension, 
declare that everything (except inner pages) must be a zero-volume cube 
(cube_is_point()), map our non-float features onto floats somehow, and 
hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I 
also want to make it easy for us to add and subtract dimensions without too 
much pain; that should be DDL, not code. So thinking about how this might 
evolve...

- I'm not sure how to represent arbitrary column-like features without 
reinventing the wheel and putting a database in the database.  hstore only 
stores text, probably for this reason; I took a look at the earlier json 
patch and saw that it handled only a few core data types.  Have there been 
any other PoCs that involved collections of hetereogenous data? I almost 
want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather 
than on a column on that row? I'm fine with this extension requiring its own 
table, so I leave the data where it is in the row, and only worry about 
indexing it. I can't just use functional indexes, because I'll need to 
provide operators and support functions to GiST. Maybe I have a fake 
sentinel column, where all the operators use SPI to introspect the row, 
treat each column as a feature dimension, call the underlying operators on 
each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make 
sense" in the physical world? For instance, let's say NULL <-> NULL returns 
a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> 
NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that 
could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that 
this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we 
plan to open source whatever we produce here, and I don't have quite the 
theoretical background it takes to do this properly.

Jay Levitt

pgsql-hackers by date:

From: Dimitri Fontaine
Date: 15 February 2012, 19:25:42
Subject: Re: Command Triggers

From: Robert Haas
Date: 15 February 2012, 19:54:30
Subject: Re: [COMMITTERS] pgsql: Speed up in-memory tuplesorting.

Designing an extension for feature-space similarity search - Mailing list pgsql-hackers

Previous

Next