Thread: Designing an extension for feature-space similarity search
[Preamble: I've been told that the hackers list is appropriate for extension-related topics like this, even if it's not about contributing to core. If I'm misappropriating, please let me know.] Goal: Personalized, context-relevant query results We are building a deeply personalized site; think "OKCupid for product recommendations" or "Pinterest for people with your tastes". We use psych research to measure and predict your personality and traits along a number of scales (dimensions), and then we connect you with people, products and content we think you'll like. I won't go into the design history, but you can read a little here: http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/ Suffice to say, this ends up needing something like KNN-GiST cubes, only: - The overall concept is more like N-dimensional vectors than cubes - But a dimension might be in any domain, not just floats - All vectors have the same number of dimensions with the same meanings - The distance along each dimension is a domain-specific function - NULLs are allowed (the distance function will handle the semantics) - The distance between two vectors is a function that aggregates the distances of each dimension, along with arbitrary other arguments - for instances, it might take the weighted average of the dimensions That aggregation (which may not literally be an aggregate; I'm not sure yet) needs to happen in a SELECT list, which means it needs to be fast, which means all this (or at least much of it) has to be C. The "simplest thing that works" is probably to hack up the cube extension, declare that everything (except inner pages) must be a zero-volume cube (cube_is_point()), map our non-float features onto floats somehow, and hard-code all the distance functions and the aggregation function. But I think this sort of similarity-search engine has general utility, and I also want to make it easy for us to add and subtract dimensions without too much pain; that should be DDL, not code. So thinking about how this might evolve... - I'm not sure how to represent arbitrary column-like features without reinventing the wheel and putting a database in the database. hstore only stores text, probably for this reason; I took a look at the earlier json patch and saw that it handled only a few core data types. Have there been any other PoCs that involved collections of hetereogenous data? I almost want an actual instance of an "anyarray". - Alternatively, is there a way to index an entire, arbitrary row, rather than on a column on that row? I'm fine with this extension requiring its own table, so I leave the data where it is in the row, and only worry about indexing it. I can't just use functional indexes, because I'll need to provide operators and support functions to GiST. Maybe I have a fake sentinel column, where all the operators use SPI to introspect the row, treat each column as a feature dimension, call the underlying operators on each column's data type, etc. - Can domains have operators, or are operators defined on types? - Does KNN-GiST run into problems when <-> returns values that don't "make sense" in the physical world? For instance, let's say NULL <-> NULL returns a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that could even describe a triangle - but my spidey sense is tingling on this. - Are there previous discussions, patches, abandoned projects, etc. that this reminds you of and that I should go research? Thanks for any thoughts, and I'd love collaborators or even mentors - we plan to open source whatever we produce here, and I don't have quite the theoretical background it takes to do this properly. Jay Levitt
Jay Levitt <jay.levitt@gmail.com> writes: > - I'm not sure how to represent arbitrary column-like features without > reinventing the wheel and putting a database in the database. ISTM you could define a composite type and then create operators and an operator class over that type. If you were trying to make a btree opclass there might be a conflict with the built-in record_ops opclass, but since you're only interested in GIST I don't see any real roadblocks. The main potential disadvantage of this is that you'd have the standard tuple header as overhead in index entries --- but maybe the entries are large enough that that doesn't matter, and in any case you could probably make use of the GIST "compress" method to get rid of most of the header. Maybe convert to MinimalTuple, for instance, if you want to still be able to leverage existing support code for field extraction. > - Can domains have operators, or are operators defined on types? I think the current state of play is that you can have such things but the system will only consider them for exact type matches, so you might need more explicit casts than you ordinarily would. However, we only support domains over base types not composites, so this isn't really going to be a profitable direction for you anyway. > - Does KNN-GiST run into problems when <-> returns values that don't "make > sense" in the physical world? Wouldn't surprise me. In general, non-strict index operators are a bad idea. However, if the indexed entities are records, it would be entirely your own business how you handled individual fields being NULL. regards, tom lane
On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay.levitt@gmail.com> wrote:
------- But a dimension might be in any domain, not just floats
- The distance along each dimension is a domain-specific function
What exact domains do you expect? Some domains could appear to be quite hard for index-based similarity search using GiST (for example, sets, strings etc.).
With best regards,
Alexander Korotkov.
Alexander Korotkov wrote: > On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay.levitt@gmail.com > <mailto:jay.levitt@gmail.com>> wrote: > > - But a dimension might be in any domain, not just floats > - The distance along each dimension is a domain-specific function > > What exact domains do you expect? Some domains could appear to be quite hard > for index-based similarity search using GiST (for example, sets, strings etc.). Oh, nothing nearly so complex, and (to Tom's point) no composite types, either. Right now we have demographics like gender, geolocation, and birthdate; I think any domain will be a type that's easily expressible in linear terms. I was thinking in domains rather than types because there isn't one distance function for "date" or "float"; me.birthdate <-> you.birthdate "birthdate" is normalized to a different curve than now() <-> posting_date, and raw_score <-> raw_score would differ from z_score <-> z_score. It would have been elegant to express that distance with <->, but since domains can't have operators, I can create distance(this, other) functions for each domain. It just won't look as pretty! Jay
Tom Lane wrote: > Jay Levitt<jay.levitt@gmail.com> writes: >> - I'm not sure how to represent arbitrary column-like features without >> reinventing the wheel and putting a database in the database. > > ISTM you could define a composite type and then create operators and an > operator class over that type. If you were trying to make a btree > opclass there might be a conflict with the built-in record_ops opclass, > but since you're only interested in GIST I don't see any real > roadblocks. Perfect. Composite types are exactly what I need here; the application can declare its composite type and provide distance functions for each member, and the extension can use those to calculate similarity. How do I introspect the composite type's pg_class to see what it contains? I assume there's a better way than SPI on system catalogs :) Should I be using systable_* functions from genam, or is there an in-memory tree? I feel like funcapi gets me partway there but there's magic in the middle. Can you think of any code that would serve as a sample, maybe whatever creates the output for psql's \d? > The main potential disadvantage of this is that you'd have > the standard tuple header as overhead in index entries --- but maybe the > entries are large enough that that doesn't matter, and in any case you > could probably make use of the GIST "compress" method to get rid of most > of the header. Maybe convert to MinimalTuple, for instance, if you want > to still be able to leverage existing support code for field extraction. Probably not worth it to save the 8 bytes; we're starting out at about 20 floats per row. But good to know for later optimization... > >> - Can domains have operators, or are operators defined on types? > > I think the current state of play is that you can have such things but > the system will only consider them for exact type matches, so you might > need more explicit casts than you ordinarily would. However, we only > support domains over base types not composites, so this isn't really > going to be a profitable direction for you anyway. Actually, as mentioned to Alexander, I'm thinking of domains per feature, not for the overall tuple, so birthdate<->birthdate differs from now()<->posting_date. Sounds like that might work - I'll play. > >> - Does KNN-GiST run into problems when<-> returns values that don't "make >> sense" in the physical world? > > Wouldn't surprise me. In general, non-strict index operators are a bad > idea. However, if the indexed entities are records, it would be > entirely your own business how you handled individual fields being NULL. Yeah, that example conflated NULLs in the feature fields (we don't know your birthdate) with <-> on the whole tuple. Oops. I guess I can just test this by verifying that KNN-GiST ordered by distance returns the same results as without the index. Thanks for your help here. Jay
Jay Levitt <jay.levitt@gmail.com> writes: > Perfect. Composite types are exactly what I need here; the application can > declare its composite type and provide distance functions for each member, > and the extension can use those to calculate similarity. How do I introspect > the composite type's pg_class to see what it contains? I assume there's a > better way than SPI on system catalogs :) Definitely. Take a look at record_out, record_cmp, and sibling functions on generic composites (src/backend/utils/adt/rowtypes.c). You might or might not feel like wiring in more assumptions than those have about the possible contents of the record type. regards, tom lane
Tom Lane wrote: >> - Can domains have operators, or are operators defined on types? > > I think the current state of play is that you can have such things but > the system will only consider them for exact type matches, so you might > need more explicit casts than you ordinarily would. Turns out it's even smarter than that; it seems to coerce when it's unambiguous: create domain birthdate as date; create function date_dist(birthdate, birthdate) returns integer as $$ select 123; $$ language sql; create operator <-> (procedure = date_dist,leftarg = birthdate,rightarg = birthdate); select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate; -- 123 select '2012-01-01'::date <-> '2012-01-01'::date ; -- 123 create domain activity_date as date; create function date_dist(activity_date, activity_date) returns integer as $$ select 432; $$ language sql; create operator <-> (procedure = date_dist,leftarg = activity_date,rightarg = activity_date); select '2012-01-01'::activity_date <-> '2012-01-01'::activity_date; -- 432 select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate; -- 123 select '2012-01-01'::date <-> '2012-01-01'::date ; -- ERROR: operator is not unique: date <-> date
Jay Levitt <jay.levitt@gmail.com> writes: > Tom Lane wrote: >>> - Can domains have operators, or are operators defined on types? >> >> I think the current state of play is that you can have such things but >> the system will only consider them for exact type matches, so you might >> need more explicit casts than you ordinarily would. > Turns out it's even smarter than that; it seems to coerce when it's unambiguous: Yeah, that sort of case will probably work all right. The thing to keep in mind is that operators/functions declared to take domains are definitely second-class citizens, and will lose out in many somewhat-ambiguous cases where an operator on a base type could get selected due to the ambiguity resolution rules. For your application it will probably help if you can pick an operator name that's not in use for any operation on the base type(s). Also, it's conceivable that it won't matter much to you, since I gather these operators will mostly get invoked "behind the scenes" and not directly written in queries; it should not be too hard to avoid ambiguity in mechanically-generated references. (There was a good deal of chatter some years ago about trying to improve support of functions taking domains, but nothing's gotten done yet.) regards, tom lane
Tom Lane wrote: > Jay Levitt<jay.levitt@gmail.com> writes: >> - Does KNN-GiST run into problems when<-> returns values that don't "make >> sense" in the physical world? > > If the indexed entities are records, it would be > entirely your own business how you handled individual fields being NULL. This turns out to be a bit challenging. Let's say I'm building a nullable_point type that allows the Y axis to be NULL (or any sentinel value for "missing data"), where the semantics are "NULL is infinitely far from the query". I'll need my GiST functions to return useful results with NULL - not just correct results, but results that help partition the tree nicely. At first I thought this posed a challenge for union; if I have these points: (1,2) (2,1) (1,NULL) what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR. The real challenge is probably in picksplit and penalty - where in the tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for efficient indexes are hard work and computer-science-y" and point me at surrogate splitters. Just thinking out loud, I guess; if other GiST types have addressed this problem, I'd love to hear about it. Jay
On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
Tom Lane wrote:Jay Levitt<jay.levitt@gmail.com> writes:- Does KNN-GiST run into problems when<-> returns values that don't "make
sense" in the physical world?If the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.
This turns out to be a bit challenging. Let's say I'm building a nullable_point type that allows the Y axis to be NULL (or any sentinel value for "missing data"), where the semantics are "NULL is infinitely far from the query". I'll need my GiST functions to return useful results with NULL - not just correct results, but results that help partition the tree nicely.
At first I thought this posed a challenge for union; if I have these points:
(1,2)
(2,1)
(1,NULL)
what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR.
The real challenge is probably in picksplit and penalty - where in the tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for efficient indexes are hard work and computer-science-y" and point me at surrogate splitters.
Just thinking out loud, I guess; if other GiST types have addressed this problem, I'd love to hear about it.
Similar problem appears at GiST indexing of ranges, because range can be empty. There additional "contain empty" flag was introduced. This "contain empty" flag indicates that underlying value can be empty. So, this flag is set when union with empty range or other range with this flag set. It's likely you need similar flag for each dimension.
With best regards,
Alexander Korotkov.
Alexander Korotkov wrote: > On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com > <mailto:jay.levitt@gmail.com>> wrote: > At first I thought this posed a challenge for union; if I have these points: > > (1,2) > (2,1) > (1,NULL) > > what's the union? I think the answer is to treat NULL box coordinates > like LL = -infinity, UR = infinity, or (equivalently, I think) to store > a saw_nulls bit in addition to LL and UR. > > Similar problem appears at GiST indexing of ranges, because range can be > empty. There additional "contain empty" flag was introduced. This "contain > empty" flag indicates that underlying value can be empty. So, this flag is > set when union with empty range or other range with this flag set. It's > likely you need similar flag for each dimension. Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing. Jay
On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay.levitt@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
Alexander Korotkov wrote:On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay.levitt@gmail.com<mailto:jay.levitt@gmail.com>> wrote:At first I thought this posed a challenge for union; if I have these points:
(1,2)
(2,1)
(1,NULL)
what's the union? I think the answer is to treat NULL box coordinates
like LL = -infinity, UR = infinity, or (equivalently, I think) to store
a saw_nulls bit in addition to LL and UR.Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional "contain empty" flag was introduced. This "contain
empty" flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.
Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing.
At first, range bounds can't be NULL :) At second, if we have range (a;b)+"contain empty" in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.
With best regards,
Alexander Korotkov.
Alexander Korotkov wrote: > On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay.levitt@gmail.com > Ah, yes, exactly the same problem. So what led you to add a flag instead > of using the range NULL..NULL? I'm on the fence about choosing. > > > At first, range bounds can't be NULL :) At second, if we have range > (a;b)+"contain empty" in internal page, both facts: > 1) All normal underlying ranges are contained in (a;b). > 2) There can be empty underlying ranges. > are useful for search. That makes sense; you're essentially keeping one bit of stats about the values present in the range. I wonder: if I'm indexing a rowtype, then for each column in the row I need to store a lower-left and an upper-right bound, plus a might-have-nulls flag. Sounds a lot like a range. Should I just use ranges for that? See a downside (overhead)? See an upside (seems less duplicative somehow)? I'm fine depending on 9.2. Jay