Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

From Rémi Cura
Subject Re: efficient way to do "fuzzy" join
Date
Msg-id CAJvUf_s-N7eDJ9mSNUdkmNv4rMJ7iz38sa5Xi_=vfXCNR9ftbQ@mail.gmail.com
Whole thread Raw
In response to Re: efficient way to do "fuzzy" join  (Andy Colson <andy@squeakycode.net>)
Responses Re: efficient way to do "fuzzy" join  (Rémi Cura <remi.cura@gmail.com>)
List pgsql-general



2014-04-12 15:04 GMT+02:00 Andy Colson <andy@squeakycode.net>:
On 04/12/2014 06:29 AM, Rémi Cura wrote:
(please note that this random string function is NOT the good way to
do it, i should random int then use it as index to an array
containing all the letter)

Thanks a lot for this new version! It seems to be slower than your
first solution (no index use I guess, I gave up after 5 minutes vs 5
sec for the previous). Morevover, I canno't make assumption about a
fixed interval (2 sec in your example). But I think I see where you
are going.


After some test, the fastest is using BETWEEN and range. (it is way
faster than using the <@, strangely)

Here is the code :

Ah, sorry about that.  I got pulled away to work on work stuff.  I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call.  I'd have never guess it would be so fast.


If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:


select t,
 (select t from a where a.t >= b.t order by a.t limit 1) as mint,
 (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest.  Given this:

create table a(t float);


insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5?  If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------
--fast-ish: 10sec
DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
    SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
        , lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
        , lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
    FROM     (
            SELECT b.gid, b.t
                , (SELECT  gid  FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1  ) AS gid_l_b
            FROM b) as lower_b_a
        LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON  (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C

pgsql-general by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: CLOB & BLOB limitations in PostgreSQL
Next
From: Ivan Voras
Date:
Subject: Re: CLOB & BLOB limitations in PostgreSQL