Thread: efficient way to do "fuzzy" join

efficient way to do "fuzzy" join

From

Rémi Cura

Date:

11 April 2014, 12:50:39

Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy" join, that is joining two table based on approximate matching.

It is about temporal matching

given a table A with rows containing data and a control_time (for instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time = 2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the closest),

in an efficient way?

(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get the previous time and next time for 1 st order interpolation, 2 before and 2 after for 2nd order interpolation, and so on)?

(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 respectively)

Currently my data is spatial so I use Postgis function to interpolate a point on a line, but is is far from efficient or general, and I don't have control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C

Re: efficient way to do "fuzzy" join

From

Andy Colson

Date:

11 April 2014, 15:10:16

On 4/11/2014 7:50 AM, Rémi Cura wrote:
> Hey dear List,
>
> I'm looking for some advice about the best way to perform a "fuzzy"
> join, that is joining two table based on approximate matching.
>
> It is about temporal matching
> given a table A with rows containing data and a control_time (for
> instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>
> given another table B with lines on no precise timing (eg control_time =
> 2.3 ; 5.8 ; 6.2 for example)
>
> How to join every row of B to A based on
> min(@(A.control_time-B.control_time))
> (that is, for every row of B, get the row of A that is temporaly the
> closest),
> in an efficient way?
> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
> Optionnaly, how to get interpolation efficiently (meaning one has to get
> the previous time and next time for 1 st order interpolation, 2 before
> and 2 after for 2nd order interpolation, and so on)?
> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
> respectively)
>
>
> Currently my data is spatial so I use Postgis function to interpolate a
> point on a line, but is is far from efficient or general, and I don't
> have control on interpolation (only the spatial values are interpolated).
>
>
> Cheers,
> Rémi-C


Have you seen the range type?

http://www.postgresql.org/docs/9.3/static/rangetypes.html

Not fuzzy, but is indexable.

-Andy

Re: efficient way to do "fuzzy" join

From

Rémi Cura

Date:

11 April 2014, 15:57:59

Hey,

thanks for your answer.

I think you are right, range type with index could at least provide a fast matching,
thus avoiding the numrow(A) * numrow(B) complexity .

Though I don't see how to use it to interpolate for more than 1st order.

Cheers,
Rémi-C

2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net>:

On 4/11/2014 7:50 AM, Rémi Cura wrote:
Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.

It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)

Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C

Have you seen the range type?

http://www.postgresql.org/docs/9.3/static/rangetypes.html

Not fuzzy, but is indexable.

-Andy

Re: efficient way to do "fuzzy" join

From

Andy Colson

Date:

11 April 2014, 16:11:00

> 2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net
> <mailto:andy@squeakycode.net>>:
>
>     On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
>         Hey dear List,
>
>         I'm looking for some advice about the best way to perform a "fuzzy"
>         join, that is joining two table based on approximate matching.
>
>         It is about temporal matching
>         given a table A with rows containing data and a control_time (for
>         instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>
>         given another table B with lines on no precise timing (eg
>         control_time =
>         2.3 ; 5.8 ; 6.2 for example)
>
>         How to join every row of B to A based on
>         min(@(A.control_time-B.__control_time))
>         (that is, for every row of B, get the row of A that is temporaly the
>         closest),
>         in an efficient way?
>         (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
>         Optionnaly, how to get interpolation efficiently (meaning one
>         has to get
>         the previous time and next time for 1 st order interpolation, 2
>         before
>         and 2 after for 2nd order interpolation, and so on)?
>         (to be explicit 5.8 would match to 5 and 6, the weight being 0.2
>         and 0.8
>         respectively)
>
>
>         Currently my data is spatial so I use Postgis function to
>         interpolate a
>         point on a line, but is is far from efficient or general, and I
>         don't
>         have control on interpolation (only the spatial values are
>         interpolated).
>
>
>         Cheers,
>         Rémi-C
>
>
>
>     Have you seen the range type?
>
>     http://www.postgresql.org/__docs/9.3/static/rangetypes.__html
>     <http://www.postgresql.org/docs/9.3/static/rangetypes.html>
>
>     Not fuzzy, but is indexable.
>
>     -Andy
>
>

On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey,
 > thanks for your answer.
 >
 > I think you are right, range type with index could at least provide a
 > fast matching,
 > thus avoiding the numrow(A) * numrow(B) complexity .
 >
 > Though I don't see how to use it to interpolate for more than 1st order.
 >
 > Cheers,
 > Rémi-C
 >
 >


Hum..  Would you like to set an upper bound on the number of seconds the
join would match?  Maybe range types could give you an indexed upper
bound ("match within +/- 2 seconds only"), then use another match to
find the actual min.  (I do something like this in PostGis, use bounding
box to do quick index lookup, then st_distance to find the nearest)

I can see two row's in A matching the same row in B.  Would that be ok?

TableA
------
1
5
6

TableB
------
0.9
1.1
6.6
7.7

How should those two tables join?

-Andy

Re: efficient way to do "fuzzy" join

From

Andy Colson

Date:

11 April 2014, 17:16:22

On 4/11/2014 7:50 AM, Rémi Cura wrote:
> Hey dear List,
>
> I'm looking for some advice about the best way to perform a "fuzzy"
> join, that is joining two table based on approximate matching.
>
> It is about temporal matching
> given a table A with rows containing data and a control_time (for
> instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>
> given another table B with lines on no precise timing (eg control_time =
> 2.3 ; 5.8 ; 6.2 for example)
>
> How to join every row of B to A based on
> min(@(A.control_time-B.control_time))
> (that is, for every row of B, get the row of A that is temporaly the
> closest),
> in an efficient way?
> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
> Optionnaly, how to get interpolation efficiently (meaning one has to get
> the previous time and next time for 1 st order interpolation, 2 before
> and 2 after for 2nd order interpolation, and so on)?
> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
> respectively)
>
>
> Currently my data is spatial so I use Postgis function to interpolate a
> point on a line, but is is far from efficient or general, and I don't
> have control on interpolation (only the spatial values are interpolated).
>
>
> Cheers,
> Rémi-C


Ok, here is a just sql way.  No ranges.  No idea if its right.  A first
pass, so to speak.



create table a(t float, data text);
create table b(t float, data text);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);


select a.t, b.t
from (
   select t, least( least(t, mint), least(t, maxt)) as t2 from (
     select t,
      (select t from a where a.t >= b.t order by a.t limit 1) as mint,
      (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
   from b
   ) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)




The middle part is the magic:

select t,
  (select t from a where a.t >= b.t order by a.t limit 1) as mint,
  (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

The rest is just to make it usable.  If t is indexed, it'll probably be
fast too.

-Andy

Re: efficient way to do "fuzzy" join

From

Andy Colson

Date:

11 April 2014, 18:12:24

On 4/11/2014 12:16 PM, Andy Colson wrote:
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>
>
> Ok, here is a just sql way.  No ranges.  No idea if its right.  A first
> pass, so to speak.
>
>
>
> create table a(t float, data text);
> create table b(t float, data text);
>
> insert into a values (1), (5), (6);
> insert into b values (2.3), (5.8), (6.2);
>
>
> select a.t, b.t
> from (
>    select t, least( least(t, mint), least(t, maxt)) as t2 from (
>      select t,
>       (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>       (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>    from b
>    ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
>
>
>
>
> The middle part is the magic:
>
> select t,
>   (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>   (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> The rest is just to make it usable.  If t is indexed, it'll probably be
> fast too.
>
> -Andy
>
>
>
>

Here is a guess with ranges:

select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]')
order by abs(a.t-b.t) limit 1)
from a


It returns:
          t           t
----------  ----------
          1         2.3
          5         5.8
          6         5.8


which is different than the previous sql, but its not wrong.  6 is the
same distance between 5.8 and 6.2, so both are the correct choice.

I had to change my tables (or type cast a lot):

create table a(t numeric);
create table b(t numeric);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);


-Andy

Re: efficient way to do "fuzzy" join

From

Rémi Cura

Date:

11 April 2014, 18:18:51

Wow many thanks!

I had thought about the order by and limit because it is the natural way to express the problem,

but I had discarded it for fear of suchbad complexity

(theoretically, for each row of B , compute the distance to every other row of A!)
.

And it's okay if 2 row from B share the same join to row from A, because when interpolating it will be different.

Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)

-------------------------------------------------------

--usefull function to fill with random text
    CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
    RETURNS text AS $$
    SELECT array_to_string(
    ARRAY(
    SELECT
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
    FROM generate_series(1,$1)
    )
    ,'')
    $$ LANGUAGE sql;

--creating tables
    DROP TABLE IF EXISTS a;
    DROP TABLE IF EXISTS b;
    create table a(t float, data text);
    create table b(t float, data text);
    CREATE INDEX ON a (t);
    CREATE INDEX ON b (t);

--filling tables
    WITH the_serie AS (
        SELECT s+random()/2 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s

    )
    insert into a SELECT s, data
    FROM the_serie;

    WITH the_serie AS (
        SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s

    )
    insert into b SELECT s, data
    FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
    select a.t As a_t, b.t as b_t
    from (
    select t, least( least(t, mint), least(t, maxt)) as t2 from (
        select t,
         (select t from a where a.t >= b.t order by a.t limit 1) as mint,
         (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
    from b
    ) as tmp
    ) as tmp2
    inner join a on (tmp2.t2 = a.t)
    inner join b on (tmp2.t = b.t)
--------------------------------------------------------

2014-04-11 19:16 GMT+02:00 Andy Colson <andy@squeakycode.net>:

On 4/11/2014 7:50 AM, Rémi Cura wrote:
Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.

It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)

Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C

Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak.

create table a(t float, data text);
create table b(t float, data text);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)

The middle part is the magic:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

The rest is just to make it usable. If t is indexed, it'll probably be fast too.

-Andy

Re: efficient way to do "fuzzy" join

From

Rémi Cura

Date:

12 April 2014, 11:29:58

(please note that this random string function is NOT the good way to do it,

i should random int then use it as index to an array containing all the letter)

Thanks a lot for this new version!

It seems to be slower than your first solution (no index use I guess, I gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't make assumption about a fixed interval (2 sec in your example). But I think I see where you are going.

After some test, the fastest is using BETWEEN and range.

(it is way faster than using the <@, strangely)

Here is the code :
-------------------------------------------------------

--usefull function to fill with random text
    CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
    RETURNS text AS $$
    SELECT array_to_string(
    ARRAY(
    SELECT
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
    FROM generate_series(1,$1)
    )
    ,'')
    $$ LANGUAGE sql;

--creating tables
    DROP TABLE IF EXISTS a;
    DROP TABLE IF EXISTS b;

    create table a(gid int, t numeric, r numrange, data text);
    create table b(gid int, t numeric, data text);
    CREATE INDEX ON a (t);
    CREATE INDEX ON b (t);
    --CREATE INDEX ON a USING spgist (r);
    CREATE INDEX ON a (r);

--filling tables
    WITH the_serie AS (
        SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s

    )
    insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data
    FROM the_serie;


    WITH the_serie AS (
        SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s
    )
    insert into b (gid, t, data) SELECT gid,s, data
    FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing join with range

    --slow : 80 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.*
        FROM b LEFT JOIN a ON (b.t <@ a.r)
        ORDER BY gid ASC
        LIMIT 30

    --slow: 80 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.*
        FROM a,b
        WHERE b.t <@a.r

    --fast : 3sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.* , a.data as d2
        FROM a,b
        WHERE b.t BETWEEN lower(a.r) AND upper(a.r)

    --fast : 8 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        select a.t As a_t, b.t as b_t

        from (
          select t, least( least(t, mint), least(t, maxt)) as t2 from (
        select t,
        (select t from a where a.t >= b.t order by a.t limit 1) as mint,
        (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
          from b
          ) as tmp
        ) as tmp2
        inner join a on (tmp2.t2 = a.t)
        inner join b on (tmp2.t = b.t)
-------------------------------------------------------

Thanks again,
Rémi-C

2014-04-11 20:18 GMT+02:00 Rémi Cura <remi.cura@gmail.com>:

Wow many thanks!

I had thought about the order by and limit because it is the natural way to express the problem,
but I had discarded it for fear of suchbad complexity
(theoretically, for each row of B , compute the distance to every other row of A!)
.

And it's okay if 2 row from B share the same join to row from A, because when interpolating it will be different.

Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)
-------------------------------------------------------

--usefull function to fill with random text
    CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
    RETURNS text AS $$
    SELECT array_to_string(
    ARRAY(
    SELECT
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
    FROM generate_series(1,$1)
    )
    ,'')
    $$ LANGUAGE sql;

--creating tables
    DROP TABLE IF EXISTS a;
    DROP TABLE IF EXISTS b;

    create table a(t float, data text);
    create table b(t float, data text);
    CREATE INDEX ON a (t);
    CREATE INDEX ON b (t);

--filling tables
    WITH the_serie AS (
        SELECT s+random()/2 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s

    )
    insert into a SELECT s, data
    FROM the_serie;

    WITH the_serie AS (
        SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
        FROM generate_series(1,100000) AS s

    )
    insert into b SELECT s, data
    FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
    select a.t As a_t, b.t as b_t

    from (
    select t, least( least(t, mint), least(t, maxt)) as t2 from (
        select t,
         (select t from a where a.t >= b.t order by a.t limit 1) as mint,
         (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
    from b
    ) as tmp
    ) as tmp2
    inner join a on (tmp2.t2 = a.t)
    inner join b on (tmp2.t = b.t)
--------------------------------------------------------

2014-04-11 19:16 GMT+02:00 Andy Colson <andy@squeakycode.net>:

On 4/11/2014 7:50 AM, Rémi Cura wrote:
Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.

It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)

Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C

Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak.

create table a(t float, data text);
create table b(t float, data text);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)

The middle part is the magic:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

The rest is just to make it usable. If t is indexed, it'll probably be fast too.

-Andy

Re: efficient way to do "fuzzy" join

From

Andy Colson

Date:

12 April 2014, 13:04:30

On 04/12/2014 06:29 AM, Rémi Cura wrote:
> (please note that this random string function is NOT the good way to
> do it, i should random int then use it as index to an array
> containing all the letter)
>
> Thanks a lot for this new version! It seems to be slower than your
> first solution (no index use I guess, I gave up after 5 minutes vs 5
> sec for the previous). Morevover, I canno't make assumption about a
> fixed interval (2 sec in your example). But I think I see where you
> are going.
>
>
> After some test, the fastest is using BETWEEN and range. (it is way
> faster than using the <@, strangely)
>
> Here is the code :

Ah, sorry about that.  I got pulled away to work on work stuff.  I was trying to figure out how to use an index on the
rangequery, but not sure, without adding a new column if it would even work. 

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call.  I'd have never guess it would be so fast.

If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:

select t,
  (select t from a where a.t >= b.t order by a.t limit 1) as mint,
  (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest.  Given this:

create table a(t float);

insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5?  If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Re: efficient way to do "fuzzy" join

From

Rémi Cura

Date:

14 April 2014, 09:58:13

2014-04-12 15:04 GMT+02:00 Andy Colson <andy@squeakycode.net>:

On 04/12/2014 06:29 AM, Rémi Cura wrote:
(please note that this random string function is NOT the good way to
do it, i should random int then use it as index to an array
containing all the letter)

Thanks a lot for this new version! It seems to be slower than your
first solution (no index use I guess, I gave up after 5 minutes vs 5
sec for the previous). Morevover, I canno't make assumption about a
fixed interval (2 sec in your example). But I think I see where you
are going.

After some test, the fastest is using BETWEEN and range. (it is way
faster than using the <@, strangely)

Here is the code :

Ah, sorry about that. I got pulled away to work on work stuff. I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call. I'd have never guess it would be so fast.

If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest. Given this:

create table a(t float);

insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5? If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------

--fast-ish: 10sec

DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
    SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
        , lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
        , lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
    FROM    (
            SELECT b.gid, b.t
                , (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1 ) AS gid_l_b
            FROM b) as lower_b_a
        LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C

Re: efficient way to do "fuzzy" join

From

Rémi Cura

Date:

15 April 2014, 09:25:54

A little related bonus :

when doing the time-join,

the next step is to interpolate to have a more accurate estimation :

-------------------------------------------------------------------------------------------
DROP FUNCTION IF EXISTS range_interpolate(nr anyrange,obs anyelement) ;
    CREATE OR REPLACE FUNCTION range_interpolate(nr anyrange,obs anyelement)
        RETURNS TABLE(lower_weight NUMERIC,upper_weight NUMERIC)
    AS $$
        --@param a range
        --@param an observation (value) of the same type as the range
        --@return the weight to apply to lower bound and upper bound of range to get the value.

        --exceptions : -inf or +inf : weight of the bound is 0, the other 1.
        --exceptions : range = a point : returns weight of 0.5 for each bound (they are identical but the asociated data may not be)
        SELECT
        CASE     WHEN upper(nr)=lower(nr) THEN ROW(0.5,0.5)
            --WHEN obs=lower(nr) AND obs=upper(nr) THEN ARRAY[0.5,0.5]
            --THEN (obs-lower(nr))/ra, (upper(nr)-obs)/ra
            WHEN lower_inf(nr)=TRUE OR lower(nr) IS NULL THEN ROW(0,1)
            WHEN upper_inf(nr)=TRUE OR upper(nr) IS NULL THEN ROW(1,0)
            ELSE ROW( (upper(nr)-obs)/(upper(nr)-lower(nr)),(obs-lower(nr))/(upper(nr)-lower(nr)))
            END

        --testing :
        --SELECT * FROM range_interpolate(numrange(1,10) , round(10,2))
    $$
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;
--------------------------------------------------------------

Cheers,
Rémi-C