Home > mailing lists

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

From	Andy Colson
Subject	Re: efficient way to do "fuzzy" join
Date	April 11, 2014 19:11:00
Msg-id	5348140C.4030409@squeakycode.net Whole thread Raw
In response to	Re: efficient way to do "fuzzy" join (Rémi Cura <remi.cura@gmail.com>)
List	pgsql-general

Tree view

> 2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net
> <mailto:andy@squeakycode.net>>:
>
>     On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
>         Hey dear List,
>
>         I'm looking for some advice about the best way to perform a "fuzzy"
>         join, that is joining two table based on approximate matching.
>
>         It is about temporal matching
>         given a table A with rows containing data and a control_time (for
>         instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>
>         given another table B with lines on no precise timing (eg
>         control_time =
>         2.3 ; 5.8 ; 6.2 for example)
>
>         How to join every row of B to A based on
>         min(@(A.control_time-B.__control_time))
>         (that is, for every row of B, get the row of A that is temporaly the
>         closest),
>         in an efficient way?
>         (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
>         Optionnaly, how to get interpolation efficiently (meaning one
>         has to get
>         the previous time and next time for 1 st order interpolation, 2
>         before
>         and 2 after for 2nd order interpolation, and so on)?
>         (to be explicit 5.8 would match to 5 and 6, the weight being 0.2
>         and 0.8
>         respectively)
>
>
>         Currently my data is spatial so I use Postgis function to
>         interpolate a
>         point on a line, but is is far from efficient or general, and I
>         don't
>         have control on interpolation (only the spatial values are
>         interpolated).
>
>
>         Cheers,
>         Rémi-C
>
>
>
>     Have you seen the range type?
>
>     http://www.postgresql.org/__docs/9.3/static/rangetypes.__html
>     <http://www.postgresql.org/docs/9.3/static/rangetypes.html>
>
>     Not fuzzy, but is indexable.
>
>     -Andy
>
>

On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey,
 > thanks for your answer.
 >
 > I think you are right, range type with index could at least provide a
 > fast matching,
 > thus avoiding the numrow(A) * numrow(B) complexity .
 >
 > Though I don't see how to use it to interpolate for more than 1st order.
 >
 > Cheers,
 > Rémi-C
 >
 >


Hum..  Would you like to set an upper bound on the number of seconds the
join would match?  Maybe range types could give you an indexed upper
bound ("match within +/- 2 seconds only"), then use another match to
find the actual min.  (I do something like this in PostGis, use bounding
box to do quick index lookup, then st_distance to find the nearest)

I can see two row's in A matching the same row in B.  Would that be ok?

TableA
------
1
5
6

TableB
------
0.9
1.1
6.6
7.7

How should those two tables join?

-Andy

pgsql-general by date:

From: Jack.O'Sullivan@tessella.com
Date: 11 April 2014, 18:59:44
Subject: Re: CLOB & BLOB limitations in PostgreSQL

From: Gaurav Jindal
Date: 11 April 2014, 19:19:02
Subject: Need query

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

Previous

Next