Re: efficient way to do "fuzzy" join - Mailing list pgsql-general
From | Andy Colson |
---|---|
Subject | Re: efficient way to do "fuzzy" join |
Date | |
Msg-id | 5348307F.1080908@squeakycode.net Whole thread Raw |
In response to | Re: efficient way to do "fuzzy" join (Andy Colson <andy@squeakycode.net>) |
List | pgsql-general |
On 4/11/2014 12:16 PM, Andy Colson wrote: > On 4/11/2014 7:50 AM, Rémi Cura wrote: >> Hey dear List, >> >> I'm looking for some advice about the best way to perform a "fuzzy" >> join, that is joining two table based on approximate matching. >> >> It is about temporal matching >> given a table A with rows containing data and a control_time (for >> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) >> >> given another table B with lines on no precise timing (eg control_time = >> 2.3 ; 5.8 ; 6.2 for example) >> >> How to join every row of B to A based on >> min(@(A.control_time-B.control_time)) >> (that is, for every row of B, get the row of A that is temporaly the >> closest), >> in an efficient way? >> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) >> >> Optionnaly, how to get interpolation efficiently (meaning one has to get >> the previous time and next time for 1 st order interpolation, 2 before >> and 2 after for 2nd order interpolation, and so on)? >> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 >> respectively) >> >> >> Currently my data is spatial so I use Postgis function to interpolate a >> point on a line, but is is far from efficient or general, and I don't >> have control on interpolation (only the spatial values are interpolated). >> >> >> Cheers, >> Rémi-C > > > Ok, here is a just sql way. No ranges. No idea if its right. A first > pass, so to speak. > > > > create table a(t float, data text); > create table b(t float, data text); > > insert into a values (1), (5), (6); > insert into b values (2.3), (5.8), (6.2); > > > select a.t, b.t > from ( > select t, least( least(t, mint), least(t, maxt)) as t2 from ( > select t, > (select t from a where a.t >= b.t order by a.t limit 1) as mint, > (select t from a where a.t < b.t order by a.t desc limit 1) as maxt > from b > ) as tmp > ) as tmp2 > inner join a on (tmp2.t2 = a.t) > inner join b on (tmp2.t = b.t) > > > > > The middle part is the magic: > > select t, > (select t from a where a.t >= b.t order by a.t limit 1) as mint, > (select t from a where a.t < b.t order by a.t desc limit 1) as maxt > from b > > The rest is just to make it usable. If t is indexed, it'll probably be > fast too. > > -Andy > > > > Here is a guess with ranges: select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]') order by abs(a.t-b.t) limit 1) from a It returns: t t ---------- ---------- 1 2.3 5 5.8 6 5.8 which is different than the previous sql, but its not wrong. 6 is the same distance between 5.8 and 6.2, so both are the correct choice. I had to change my tables (or type cast a lot): create table a(t numeric); create table b(t numeric); insert into a values (1), (5), (6); insert into b values (2.3), (5.8), (6.2); -Andy
pgsql-general by date: