Home > mailing lists

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

From	Andy Colson
Subject	Re: efficient way to do "fuzzy" join
Date	April 11, 2014 21:12:24
Msg-id	5348307F.1080908@squeakycode.net Whole thread Raw
In response to	Re: efficient way to do "fuzzy" join (Andy Colson <andy@squeakycode.net>)
List	pgsql-general

Tree view

On 4/11/2014 12:16 PM, Andy Colson wrote:
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>
>
> Ok, here is a just sql way.  No ranges.  No idea if its right.  A first
> pass, so to speak.
>
>
>
> create table a(t float, data text);
> create table b(t float, data text);
>
> insert into a values (1), (5), (6);
> insert into b values (2.3), (5.8), (6.2);
>
>
> select a.t, b.t
> from (
>    select t, least( least(t, mint), least(t, maxt)) as t2 from (
>      select t,
>       (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>       (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>    from b
>    ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
>
>
>
>
> The middle part is the magic:
>
> select t,
>   (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>   (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> The rest is just to make it usable.  If t is indexed, it'll probably be
> fast too.
>
> -Andy
>
>
>
>

Here is a guess with ranges:

select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]')
order by abs(a.t-b.t) limit 1)
from a


It returns:
          t           t
----------  ----------
          1         2.3
          5         5.8
          6         5.8


which is different than the previous sql, but its not wrong.  6 is the
same distance between 5.8 and 6.2, so both are the correct choice.

I had to change my tables (or type cast a lot):

create table a(t numeric);
create table b(t numeric);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);


-Andy

pgsql-general by date:

From: Susan Cassidy
Date: 11 April 2014, 20:50:55
Subject: Problem with query

From: Rémi Cura
Date: 11 April 2014, 21:18:51
Subject: Re: efficient way to do "fuzzy" join

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

Previous

Next