Home > mailing lists

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

From	Rémi Cura
Subject	Re: efficient way to do "fuzzy" join
Date	April 14, 2014 12:58:13
Msg-id	CAJvUf_s-N7eDJ9mSNUdkmNv4rMJ7iz38sa5Xi_=vfXCNR9ftbQ@mail.gmail.com Whole thread Raw
In response to	Re: efficient way to do "fuzzy" join (Andy Colson <andy@squeakycode.net>)
Responses	Re: efficient way to do "fuzzy" join (Rémi Cura <remi.cura@gmail.com>)
List	pgsql-general

Tree view

2014-04-12 15:04 GMT+02:00 Andy Colson <andy@squeakycode.net>:

On 04/12/2014 06:29 AM, Rémi Cura wrote:
(please note that this random string function is NOT the good way to
do it, i should random int then use it as index to an array
containing all the letter)

Thanks a lot for this new version! It seems to be slower than your
first solution (no index use I guess, I gave up after 5 minutes vs 5
sec for the previous). Morevover, I canno't make assumption about a
fixed interval (2 sec in your example). But I think I see where you
are going.

After some test, the fastest is using BETWEEN and range. (it is way
faster than using the <@, strangely)

Here is the code :

Ah, sorry about that. I got pulled away to work on work stuff. I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call. I'd have never guess it would be so fast.

If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest. Given this:

create table a(t float);

insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5? If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------

--fast-ish: 10sec

DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
    SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
        , lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
        , lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
    FROM    (
            SELECT b.gid, b.t
                , (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1 ) AS gid_l_b
            FROM b) as lower_b_a
        LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C

pgsql-general by date:

From: Albe Laurenz
Date: 14 April 2014, 11:19:59
Subject: Re: CLOB & BLOB limitations in PostgreSQL

From: Ivan Voras
Date: 14 April 2014, 15:01:42
Subject: Re: CLOB & BLOB limitations in PostgreSQL

Re: efficient way to do "fuzzy" join - Mailing list pgsql-general

Previous

Next