Thread: Designing an extension for feature-space similarity search

Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

15 February 2012, 16:35:10

[Preamble: I've been told that the hackers list is appropriate for 
extension-related topics like this, even if it's not about contributing to 
core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product 
recommendations" or "Pinterest for people with your tastes". We use psych 
research to measure and predict your personality and traits along a number 
of scales (dimensions), and then we connect you with people, products and 
content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the 
distances of each dimension, along with arbitrary other arguments - for 
instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet) 
needs to happen in a SELECT list, which means it needs to be fast, which 
means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension, 
declare that everything (except inner pages) must be a zero-volume cube 
(cube_is_point()), map our non-float features onto floats somehow, and 
hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I 
also want to make it easy for us to add and subtract dimensions without too 
much pain; that should be DDL, not code. So thinking about how this might 
evolve...

- I'm not sure how to represent arbitrary column-like features without 
reinventing the wheel and putting a database in the database.  hstore only 
stores text, probably for this reason; I took a look at the earlier json 
patch and saw that it handled only a few core data types.  Have there been 
any other PoCs that involved collections of hetereogenous data? I almost 
want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather 
than on a column on that row? I'm fine with this extension requiring its own 
table, so I leave the data where it is in the row, and only worry about 
indexing it. I can't just use functional indexes, because I'll need to 
provide operators and support functions to GiST. Maybe I have a fake 
sentinel column, where all the operators use SPI to introspect the row, 
treat each column as a feature dimension, call the underlying operators on 
each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make 
sense" in the physical world? For instance, let's say NULL <-> NULL returns 
a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> 
NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that 
could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that 
this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we 
plan to open source whatever we produce here, and I don't have quite the 
theoretical background it takes to do this properly.

Jay Levitt

Re: Designing an extension for feature-space similarity search

From

Tom Lane

Date:

15 February 2012, 18:29:57

Jay Levitt <jay.levitt@gmail.com> writes:
> - I'm not sure how to represent arbitrary column-like features without 
> reinventing the wheel and putting a database in the database.

ISTM you could define a composite type and then create operators and an
operator class over that type.  If you were trying to make a btree
opclass there might be a conflict with the built-in record_ops opclass,
but since you're only interested in GIST I don't see any real
roadblocks.  The main potential disadvantage of this is that you'd have
the standard tuple header as overhead in index entries --- but maybe the
entries are large enough that that doesn't matter, and in any case you
could probably make use of the GIST "compress" method to get rid of most
of the header.  Maybe convert to MinimalTuple, for instance, if you want
to still be able to leverage existing support code for field extraction.

> - Can domains have operators, or are operators defined on types?

I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.  However, we only
support domains over base types not composites, so this isn't really
going to be a profitable direction for you anyway.

> - Does KNN-GiST run into problems when <-> returns values that don't "make 
> sense" in the physical world?

Wouldn't surprise me.  In general, non-strict index operators are a bad
idea.  However, if the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.
        regards, tom lane

Re: Designing an extension for feature-space similarity search

From

Alexander Korotkov

Date:

16 February 2012, 07:37:22

On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay.levitt@gmail.com> wrote:

- But a dimension might be in any domain, not just floats
- The distance along each dimension is a domain-specific function

What exact domains do you expect? Some domains could appear to be quite hard for index-based similarity search using GiST (for example, sets, strings etc.).

------
With best regards,
Alexander Korotkov.

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

16 February 2012, 13:12:43

Alexander Korotkov wrote:
> On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay.levitt@gmail.com
> <mailto:jay.levitt@gmail.com>> wrote:
>
>     - But a dimension might be in any domain, not just floats
>     - The distance along each dimension is a domain-specific function
>
> What exact domains do you expect? Some domains could appear to be quite hard
> for index-based similarity search using GiST (for example, sets, strings etc.).

Oh, nothing nearly so complex, and (to Tom's point) no composite types, 
either. Right now we have demographics like gender, geolocation, and 
birthdate; I think any domain will be a type that's easily expressible in 
linear terms.  I was thinking in domains rather than types because there 
isn't one distance function for "date" or "float"; me.birthdate <-> 
you.birthdate "birthdate" is normalized to a different curve than now() <-> 
posting_date, and raw_score <-> raw_score would differ from z_score <-> z_score.

It would have been elegant to express that distance with <->, but since 
domains can't have operators, I can create distance(this, other) functions 
for each domain. It just won't look as pretty!

Jay

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

16 February 2012, 14:18:45

Tom Lane wrote:
> Jay Levitt<jay.levitt@gmail.com>  writes:
>> - I'm not sure how to represent arbitrary column-like features without
>> reinventing the wheel and putting a database in the database.
>
> ISTM you could define a composite type and then create operators and an
> operator class over that type.  If you were trying to make a btree
> opclass there might be a conflict with the built-in record_ops opclass,
> but since you're only interested in GIST I don't see any real
> roadblocks.

Perfect. Composite types are exactly what I need here; the application can 
declare its composite type and provide distance functions for each member, 
and the extension can use those to calculate similarity. How do I introspect 
the composite type's pg_class to see what it contains? I assume there's a 
better way than SPI on system catalogs :) Should I be using systable_* 
functions from genam, or is there an in-memory tree? I feel like funcapi 
gets me partway there but there's magic in the middle.

Can you think of any code that would serve as a sample, maybe whatever 
creates the output for psql's \d?

> The main potential disadvantage of this is that you'd have
> the standard tuple header as overhead in index entries --- but maybe the
> entries are large enough that that doesn't matter, and in any case you
> could probably make use of the GIST "compress" method to get rid of most
> of the header.  Maybe convert to MinimalTuple, for instance, if you want
> to still be able to leverage existing support code for field extraction.

Probably not worth it to save the 8 bytes; we're starting out at about 20 
floats per row. But good to know for later optimization...

>
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would.  However, we only
> support domains over base types not composites, so this isn't really
> going to be a profitable direction for you anyway.

Actually, as mentioned to Alexander, I'm thinking of domains per feature, 
not for the overall tuple, so birthdate<->birthdate differs from 
now()<->posting_date.  Sounds like that might work - I'll play.
>
>> - Does KNN-GiST run into problems when<->  returns values that don't "make
>> sense" in the physical world?
>
> Wouldn't surprise me.  In general, non-strict index operators are a bad
> idea.  However, if the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

Yeah, that example conflated NULLs in the feature fields (we don't know your 
birthdate) with <-> on the whole tuple.  Oops.

I guess I can just test this by verifying that KNN-GiST ordered by distance 
returns the same results as without the index.

Thanks for your help here.

Jay

Re: Designing an extension for feature-space similarity search

From

Tom Lane

Date:

16 February 2012, 15:34:55

Jay Levitt <jay.levitt@gmail.com> writes:
> Perfect. Composite types are exactly what I need here; the application can 
> declare its composite type and provide distance functions for each member, 
> and the extension can use those to calculate similarity. How do I introspect 
> the composite type's pg_class to see what it contains? I assume there's a 
> better way than SPI on system catalogs :)

Definitely.  Take a look at record_out, record_cmp, and sibling
functions on generic composites (src/backend/utils/adt/rowtypes.c).
You might or might not feel like wiring in more assumptions than those
have about the possible contents of the record type.
        regards, tom lane

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

16 February 2012, 18:56:27

Tom Lane wrote:
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would.

Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

create domain birthdate as date;
create function date_dist(birthdate, birthdate) returns integer as $$     select 123;
$$ language sql;
create operator <-> (procedure = date_dist,leftarg = birthdate,rightarg = birthdate);

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- 123


create domain activity_date as date;
create function date_dist(activity_date, activity_date)
returns integer as $$  select 432;
$$ language sql;
create operator <-> (procedure = date_dist,leftarg = activity_date,rightarg = activity_date);

select '2012-01-01'::activity_date <-> '2012-01-01'::activity_date;
-- 432

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- ERROR:  operator is not unique: date <-> date

Re: Designing an extension for feature-space similarity search

From

Tom Lane

Date:

16 February 2012, 19:42:34

Jay Levitt <jay.levitt@gmail.com> writes:
> Tom Lane wrote:
>>> - Can domains have operators, or are operators defined on types?
>> 
>> I think the current state of play is that you can have such things but
>> the system will only consider them for exact type matches, so you might
>> need more explicit casts than you ordinarily would.

> Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

Yeah, that sort of case will probably work all right.  The thing to keep
in mind is that operators/functions declared to take domains are
definitely second-class citizens, and will lose out in many
somewhat-ambiguous cases where an operator on a base type could get
selected due to the ambiguity resolution rules.  For your application it
will probably help if you can pick an operator name that's not in use
for any operation on the base type(s).  Also, it's conceivable that it
won't matter much to you, since I gather these operators will mostly get
invoked "behind the scenes" and not directly written in queries; it
should not be too hard to avoid ambiguity in mechanically-generated
references.

(There was a good deal of chatter some years ago about trying to improve
support of functions taking domains, but nothing's gotten done yet.)
        regards, tom lane

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

17 February 2012, 15:00:46

Tom Lane wrote:
> Jay Levitt<jay.levitt@gmail.com>  writes:
>> - Does KNN-GiST run into problems when<->  returns values that don't "make
>> sense" in the physical world?
>
> If the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

This turns out to be a bit challenging. Let's say I'm building a 
nullable_point type that allows the Y axis to be NULL (or any sentinel value 
for "missing data"), where the semantics are "NULL is infinitely far from 
the query".   I'll need my GiST functions to return useful results with NULL 
- not just correct results, but results that help partition the tree nicely.

At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates like 
LL = -infinity, UR = infinity, or (equivalently, I think) to store a 
saw_nulls bit in addition to LL and UR.

The real challenge is probably in picksplit and penalty - where in the tree 
should I stick (1,NULL)? - at which point you say "Yes, algorithms for 
efficient indexes are hard work and computer-science-y" and point me at 
surrogate splitters.

Just thinking out loud, I guess; if other GiST types have addressed this 
problem, I'd love to hear about it.

Jay

Re: Designing an extension for feature-space similarity search

From

Alexander Korotkov

Date:

17 February 2012, 15:14:07

On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com> wrote:

Tom Lane wrote:
Jay Levitt<jay.levitt@gmail.com> writes:
- Does KNN-GiST run into problems when<-> returns values that don't "make

sense" in the physical world?

If the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.

This turns out to be a bit challenging. Let's say I'm building a nullable_point type that allows the Y axis to be NULL (or any sentinel value for "missing data"), where the semantics are "NULL is infinitely far from the query". I'll need my GiST functions to return useful results with NULL - not just correct results, but results that help partition the tree nicely.

At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR.

The real challenge is probably in picksplit and penalty - where in the tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for efficient indexes are hard work and computer-science-y" and point me at surrogate splitters.

Just thinking out loud, I guess; if other GiST types have addressed this problem, I'd love to hear about it.

Similar problem appears at GiST indexing of ranges, because range can be empty. There additional "contain empty" flag was introduced. This "contain empty" flag indicates that underlying value can be empty. So, this flag is set when union with empty range or other range with this flag set. It's likely you need similar flag for each dimension.

------
With best regards,
Alexander Korotkov.

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

17 February 2012, 15:32:32

Alexander Korotkov wrote:
> On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com
> <mailto:jay.levitt@gmail.com>> wrote:
>    At first I thought this posed a challenge for union; if I have these 
points:
>
>     (1,2)
>     (2,1)
>     (1,NULL)
>
>     what's the union? I think the answer is to treat NULL box coordinates
>     like LL = -infinity, UR = infinity, or (equivalently, I think) to store
>     a saw_nulls bit in addition to LL and UR.
>
> Similar problem appears at GiST indexing of ranges, because range can be
> empty. There additional "contain empty" flag was introduced. This "contain
> empty" flag indicates that underlying value can be empty. So, this flag is
> set when union with empty range or other range with this flag set. It's
> likely you need similar flag for each dimension.

Ah, yes, exactly the same problem. So what led you to add a flag instead of 
using the range NULL..NULL? I'm on the fence about choosing.

Jay

Re: Designing an extension for feature-space similarity search

From

Alexander Korotkov

Date:

17 February 2012, 15:40:50

On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay.levitt@gmail.com> wrote:

Alexander Korotkov wrote:
On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com
<mailto:jay.levitt@gmail.com>> wrote:

At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates
like LL = -infinity, UR = infinity, or (equivalently, I think) to store
a saw_nulls bit in addition to LL and UR.

Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional "contain empty" flag was introduced. This "contain
empty" flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.

Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing.

At first, range bounds can't be NULL :) At second, if we have range (a;b)+"contain empty" in internal page, both facts:

1) All normal underlying ranges are contained in (a;b).

2) There can be empty underlying ranges.

are useful for search.

------
With best regards,
Alexander Korotkov.

Re: Designing an extension for feature-space similarity search

From

Jay Levitt

Date:

17 February 2012, 16:47:52

Alexander Korotkov wrote:
> On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay.levitt@gmail.com

>     Ah, yes, exactly the same problem. So what led you to add a flag instead
>     of using the range NULL..NULL? I'm on the fence about choosing.
>
>
> At first, range bounds can't be NULL :) At second, if we have range
> (a;b)+"contain empty" in internal page, both facts:
> 1) All normal underlying ranges are contained in (a;b).
> 2) There can be empty underlying ranges.
> are useful for search.

That makes sense; you're essentially keeping one bit of stats about the 
values present in the range.

I wonder: if I'm indexing a rowtype, then for each column in the row I need 
to store a lower-left and an upper-right bound, plus a might-have-nulls 
flag.  Sounds a lot like a range. Should I just use ranges for that? See a 
downside (overhead)? See an upside (seems less duplicative somehow)? I'm 
fine depending on 9.2.

Jay