Re: efficient way to do "fuzzy" join - Mailing list pgsql-general
From | Andy Colson |
---|---|
Subject | Re: efficient way to do "fuzzy" join |
Date | |
Msg-id | 5348140C.4030409@squeakycode.net Whole thread Raw |
In response to | Re: efficient way to do "fuzzy" join (Rémi Cura <remi.cura@gmail.com>) |
List | pgsql-general |
> 2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net > <mailto:andy@squeakycode.net>>: > > On 4/11/2014 7:50 AM, Rémi Cura wrote: > > Hey dear List, > > I'm looking for some advice about the best way to perform a "fuzzy" > join, that is joining two table based on approximate matching. > > It is about temporal matching > given a table A with rows containing data and a control_time (for > instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) > > given another table B with lines on no precise timing (eg > control_time = > 2.3 ; 5.8 ; 6.2 for example) > > How to join every row of B to A based on > min(@(A.control_time-B.__control_time)) > (that is, for every row of B, get the row of A that is temporaly the > closest), > in an efficient way? > (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) > > Optionnaly, how to get interpolation efficiently (meaning one > has to get > the previous time and next time for 1 st order interpolation, 2 > before > and 2 after for 2nd order interpolation, and so on)? > (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 > and 0.8 > respectively) > > > Currently my data is spatial so I use Postgis function to > interpolate a > point on a line, but is is far from efficient or general, and I > don't > have control on interpolation (only the spatial values are > interpolated). > > > Cheers, > Rémi-C > > > > Have you seen the range type? > > http://www.postgresql.org/__docs/9.3/static/rangetypes.__html > <http://www.postgresql.org/docs/9.3/static/rangetypes.html> > > Not fuzzy, but is indexable. > > -Andy > > On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey, > thanks for your answer. > > I think you are right, range type with index could at least provide a > fast matching, > thus avoiding the numrow(A) * numrow(B) complexity . > > Though I don't see how to use it to interpolate for more than 1st order. > > Cheers, > Rémi-C > > Hum.. Would you like to set an upper bound on the number of seconds the join would match? Maybe range types could give you an indexed upper bound ("match within +/- 2 seconds only"), then use another match to find the actual min. (I do something like this in PostGis, use bounding box to do quick index lookup, then st_distance to find the nearest) I can see two row's in A matching the same row in B. Would that be ok? TableA ------ 1 5 6 TableB ------ 0.9 1.1 6.6 7.7 How should those two tables join? -Andy
pgsql-general by date: