Thread: efficient way to do "fuzzy" join
Hey dear List,
I'm looking for some advice about the best way to perform a "fuzzy" join, that is joining two table based on approximate matching. It is about temporal matching
given a table A with rows containing data and a control_time (for instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)given another table B with lines on no precise timing (eg control_time = 2.3 ; 5.8 ; 6.2 for example)
How to join every row of B to A based on min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
Optionnaly, how to get interpolation efficiently (meaning one has to get the previous time and next time for 1 st order interpolation, 2 before and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 respectively)
Currently my data is spatial so I use Postgis function to interpolate a point on a line, but is is far from efficient or general, and I don't have control on interpolation (only the spatial values are interpolated).
Rémi-C
On 4/11/2014 7:50 AM, Rémi Cura wrote: > Hey dear List, > > I'm looking for some advice about the best way to perform a "fuzzy" > join, that is joining two table based on approximate matching. > > It is about temporal matching > given a table A with rows containing data and a control_time (for > instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) > > given another table B with lines on no precise timing (eg control_time = > 2.3 ; 5.8 ; 6.2 for example) > > How to join every row of B to A based on > min(@(A.control_time-B.control_time)) > (that is, for every row of B, get the row of A that is temporaly the > closest), > in an efficient way? > (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) > > Optionnaly, how to get interpolation efficiently (meaning one has to get > the previous time and next time for 1 st order interpolation, 2 before > and 2 after for 2nd order interpolation, and so on)? > (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 > respectively) > > > Currently my data is spatial so I use Postgis function to interpolate a > point on a line, but is is far from efficient or general, and I don't > have control on interpolation (only the spatial values are interpolated). > > > Cheers, > Rémi-C Have you seen the range type? http://www.postgresql.org/docs/9.3/static/rangetypes.html Not fuzzy, but is indexable. -Andy
Hey,
thanks for your answer.I think you are right, range type with index could at least provide a fast matching,
thus avoiding the numrow(A) * numrow(B) complexity .
Though I don't see how to use it to interpolate for more than 1st order.thus avoiding the numrow(A) * numrow(B) complexity .
Cheers,
Rémi-C
Rémi-C
2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net>:
Have you seen the range type?On 4/11/2014 7:50 AM, Rémi Cura wrote:Hey dear List,
I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.
It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)
How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)
Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).
Cheers,
Rémi-C
http://www.postgresql.org/docs/9.3/static/rangetypes.html
Not fuzzy, but is indexable.
-Andy
> 2014-04-11 17:09 GMT+02:00 Andy Colson <andy@squeakycode.net > <mailto:andy@squeakycode.net>>: > > On 4/11/2014 7:50 AM, Rémi Cura wrote: > > Hey dear List, > > I'm looking for some advice about the best way to perform a "fuzzy" > join, that is joining two table based on approximate matching. > > It is about temporal matching > given a table A with rows containing data and a control_time (for > instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) > > given another table B with lines on no precise timing (eg > control_time = > 2.3 ; 5.8 ; 6.2 for example) > > How to join every row of B to A based on > min(@(A.control_time-B.__control_time)) > (that is, for every row of B, get the row of A that is temporaly the > closest), > in an efficient way? > (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) > > Optionnaly, how to get interpolation efficiently (meaning one > has to get > the previous time and next time for 1 st order interpolation, 2 > before > and 2 after for 2nd order interpolation, and so on)? > (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 > and 0.8 > respectively) > > > Currently my data is spatial so I use Postgis function to > interpolate a > point on a line, but is is far from efficient or general, and I > don't > have control on interpolation (only the spatial values are > interpolated). > > > Cheers, > Rémi-C > > > > Have you seen the range type? > > http://www.postgresql.org/__docs/9.3/static/rangetypes.__html > <http://www.postgresql.org/docs/9.3/static/rangetypes.html> > > Not fuzzy, but is indexable. > > -Andy > > On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey, > thanks for your answer. > > I think you are right, range type with index could at least provide a > fast matching, > thus avoiding the numrow(A) * numrow(B) complexity . > > Though I don't see how to use it to interpolate for more than 1st order. > > Cheers, > Rémi-C > > Hum.. Would you like to set an upper bound on the number of seconds the join would match? Maybe range types could give you an indexed upper bound ("match within +/- 2 seconds only"), then use another match to find the actual min. (I do something like this in PostGis, use bounding box to do quick index lookup, then st_distance to find the nearest) I can see two row's in A matching the same row in B. Would that be ok? TableA ------ 1 5 6 TableB ------ 0.9 1.1 6.6 7.7 How should those two tables join? -Andy
On 4/11/2014 7:50 AM, Rémi Cura wrote: > Hey dear List, > > I'm looking for some advice about the best way to perform a "fuzzy" > join, that is joining two table based on approximate matching. > > It is about temporal matching > given a table A with rows containing data and a control_time (for > instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) > > given another table B with lines on no precise timing (eg control_time = > 2.3 ; 5.8 ; 6.2 for example) > > How to join every row of B to A based on > min(@(A.control_time-B.control_time)) > (that is, for every row of B, get the row of A that is temporaly the > closest), > in an efficient way? > (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) > > Optionnaly, how to get interpolation efficiently (meaning one has to get > the previous time and next time for 1 st order interpolation, 2 before > and 2 after for 2nd order interpolation, and so on)? > (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 > respectively) > > > Currently my data is spatial so I use Postgis function to interpolate a > point on a line, but is is far from efficient or general, and I don't > have control on interpolation (only the spatial values are interpolated). > > > Cheers, > Rémi-C Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak. create table a(t float, data text); create table b(t float, data text); insert into a values (1), (5), (6); insert into b values (2.3), (5.8), (6.2); select a.t, b.t from ( select t, least( least(t, mint), least(t, maxt)) as t2 from ( select t, (select t from a where a.t >= b.t order by a.t limit 1) as mint, (select t from a where a.t < b.t order by a.t desc limit 1) as maxt from b ) as tmp ) as tmp2 inner join a on (tmp2.t2 = a.t) inner join b on (tmp2.t = b.t) The middle part is the magic: select t, (select t from a where a.t >= b.t order by a.t limit 1) as mint, (select t from a where a.t < b.t order by a.t desc limit 1) as maxt from b The rest is just to make it usable. If t is indexed, it'll probably be fast too. -Andy
On 4/11/2014 12:16 PM, Andy Colson wrote: > On 4/11/2014 7:50 AM, Rémi Cura wrote: >> Hey dear List, >> >> I'm looking for some advice about the best way to perform a "fuzzy" >> join, that is joining two table based on approximate matching. >> >> It is about temporal matching >> given a table A with rows containing data and a control_time (for >> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) >> >> given another table B with lines on no precise timing (eg control_time = >> 2.3 ; 5.8 ; 6.2 for example) >> >> How to join every row of B to A based on >> min(@(A.control_time-B.control_time)) >> (that is, for every row of B, get the row of A that is temporaly the >> closest), >> in an efficient way? >> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) >> >> Optionnaly, how to get interpolation efficiently (meaning one has to get >> the previous time and next time for 1 st order interpolation, 2 before >> and 2 after for 2nd order interpolation, and so on)? >> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 >> respectively) >> >> >> Currently my data is spatial so I use Postgis function to interpolate a >> point on a line, but is is far from efficient or general, and I don't >> have control on interpolation (only the spatial values are interpolated). >> >> >> Cheers, >> Rémi-C > > > Ok, here is a just sql way. No ranges. No idea if its right. A first > pass, so to speak. > > > > create table a(t float, data text); > create table b(t float, data text); > > insert into a values (1), (5), (6); > insert into b values (2.3), (5.8), (6.2); > > > select a.t, b.t > from ( > select t, least( least(t, mint), least(t, maxt)) as t2 from ( > select t, > (select t from a where a.t >= b.t order by a.t limit 1) as mint, > (select t from a where a.t < b.t order by a.t desc limit 1) as maxt > from b > ) as tmp > ) as tmp2 > inner join a on (tmp2.t2 = a.t) > inner join b on (tmp2.t = b.t) > > > > > The middle part is the magic: > > select t, > (select t from a where a.t >= b.t order by a.t limit 1) as mint, > (select t from a where a.t < b.t order by a.t desc limit 1) as maxt > from b > > The rest is just to make it usable. If t is indexed, it'll probably be > fast too. > > -Andy > > > > Here is a guess with ranges: select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]') order by abs(a.t-b.t) limit 1) from a It returns: t t ---------- ---------- 1 2.3 5 5.8 6 5.8 which is different than the previous sql, but its not wrong. 6 is the same distance between 5.8 and 6.2, so both are the correct choice. I had to change my tables (or type cast a lot): create table a(t numeric); create table b(t numeric); insert into a values (1), (5), (6); insert into b values (2.3), (5.8), (6.2); -Andy
Wow many thanks!
I had thought about the order by and limit because it is the natural way to express the problem,.
Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(t float, data text);
create table b(t float, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--filling tables
WITH the_serie AS (
SELECT s+random()/2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a SELECT s, data
FROM the_serie;
WITH the_serie AS (
SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b SELECT s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
--------------------------------------------------------
2014-04-11 19:16 GMT+02:00 Andy Colson <andy@squeakycode.net>:
On 4/11/2014 7:50 AM, Rémi Cura wrote:Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak.Hey dear List,
I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.
It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)
How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)
Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).
Cheers,
Rémi-C
create table a(t float, data text);
create table b(t float, data text);
insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);
select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
The middle part is the magic:
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
The rest is just to make it usable. If t is indexed, it'll probably be fast too.
-Andy
(please note that this random string function is NOT the good way to do it,
i should random int then use it as index to an array containing all the letter)Thanks a lot for this new version!
It seems to be slower than your first solution (no index use I guess, I gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't make assumption about a fixed interval (2 sec in your example). But I think I see where you are going.After some test, the fastest is using BETWEEN and range.
(it is way faster than using the <@, strangely)
Here is the code :
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(gid int, t numeric, r numrange, data text);
create table b(gid int, t numeric, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--CREATE INDEX ON a USING spgist (r);
CREATE INDEX ON a (r);
--filling tables
WITH the_serie AS (
SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data
FROM the_serie;
WITH the_serie AS (
SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b (gid, t, data) SELECT gid,s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing join with range
--slow : 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM b LEFT JOIN a ON (b.t <@ a.r)
ORDER BY gid ASC
LIMIT 30
--slow: 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM a,b
WHERE b.t <@a.r
--fast : 3sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.* , a.data as d2
FROM a,b
WHERE b.t BETWEEN lower(a.r) AND upper(a.r)
--fast : 8 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
-------------------------------------------------------
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(gid int, t numeric, r numrange, data text);
create table b(gid int, t numeric, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--CREATE INDEX ON a USING spgist (r);
CREATE INDEX ON a (r);
--filling tables
WITH the_serie AS (
SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data
FROM the_serie;
WITH the_serie AS (
SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b (gid, t, data) SELECT gid,s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing join with range
--slow : 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM b LEFT JOIN a ON (b.t <@ a.r)
ORDER BY gid ASC
LIMIT 30
--slow: 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM a,b
WHERE b.t <@a.r
--fast : 3sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.* , a.data as d2
FROM a,b
WHERE b.t BETWEEN lower(a.r) AND upper(a.r)
--fast : 8 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
-------------------------------------------------------
Rémi-C
2014-04-11 20:18 GMT+02:00 Rémi Cura <remi.cura@gmail.com>:
And it's okay if 2 row from B share the same join to row from A, because when interpolating it will be different.(theoretically, for each row of B , compute the distance to every other row of A!)but I had discarded it for fear of suchbad complexityWow many thanks!I had thought about the order by and limit because it is the natural way to express the problem,
.Here is the test env with realistic number, your solution is very fast, I have to raise my hat (4 sec!)-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;CREATE INDEX ON a (t);
create table a(t float, data text);
create table b(t float, data text);
CREATE INDEX ON b (t);
--filling tables
WITH the_serie AS (
SELECT s+random()/2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a SELECT s, data
FROM the_serie;
WITH the_serie AS (
SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b SELECT s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t--------------------------------------------------------
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)2014-04-11 19:16 GMT+02:00 Andy Colson <andy@squeakycode.net>:On 4/11/2014 7:50 AM, Rémi Cura wrote:Ok, here is a just sql way. No ranges. No idea if its right. A first pass, so to speak.Hey dear List,
I'm looking for some advice about the best way to perform a "fuzzy"
join, that is joining two table based on approximate matching.
It is about temporal matching
given a table A with rows containing data and a control_time (for
instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)
How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before
and 2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)
Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't
have control on interpolation (only the spatial values are interpolated).
Cheers,
Rémi-C
create table a(t float, data text);
create table b(t float, data text);
insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);
select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
The middle part is the magic:
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
The rest is just to make it usable. If t is indexed, it'll probably be fast too.
-Andy
On 04/12/2014 06:29 AM, Rémi Cura wrote: > (please note that this random string function is NOT the good way to > do it, i should random int then use it as index to an array > containing all the letter) > > Thanks a lot for this new version! It seems to be slower than your > first solution (no index use I guess, I gave up after 5 minutes vs 5 > sec for the previous). Morevover, I canno't make assumption about a > fixed interval (2 sec in your example). But I think I see where you > are going. > > > After some test, the fastest is using BETWEEN and range. (it is way > faster than using the <@, strangely) > > Here is the code : Ah, sorry about that. I got pulled away to work on work stuff. I was trying to figure out how to use an index on the rangequery, but not sure, without adding a new column if it would even work. I've never had the need for ranges yet, this is the first time I've gotten to play with them. I would not have thought about between like that, good call. I'd have never guess it would be so fast. If you can't use the fixed interval, then ranges are out. I was thinking this could be improved: select t, (select t from a where a.t >= b.t order by a.t limit 1) as mint, (select t from a where a.t < b.t order by a.t desc limit 1) as maxt from b It does two selects into a to find the nearest. Given this: create table a(t float); insert into a values (1), (5), (6); could you write a single query to find the number nearest 3.5? If so we might cut the work by 50%. -Andy PS: This list prefers you don't top post.
2014-04-12 15:04 GMT+02:00 Andy Colson <andy@squeakycode.net>:
On 04/12/2014 06:29 AM, Rémi Cura wrote:Ah, sorry about that. I got pulled away to work on work stuff. I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.(please note that this random string function is NOT the good way to
do it, i should random int then use it as index to an array
containing all the letter)
Thanks a lot for this new version! It seems to be slower than your
first solution (no index use I guess, I gave up after 5 minutes vs 5
sec for the previous). Morevover, I canno't make assumption about a
fixed interval (2 sec in your example). But I think I see where you
are going.
After some test, the fastest is using BETWEEN and range. (it is way
faster than using the <@, strangely)
Here is the code :
I've never had the need for ranges yet, this is the first time I've gotten to play with them.
I would not have thought about between like that, good call. I'd have never guess it would be so fast.
If you can't use the fixed interval, then ranges are out.
I was thinking this could be improved:It does two selects into a to find the nearest. Given this:
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
create table a(t float);could you write a single query to find the number nearest 3.5? If so we might cut the work by 50%.
insert into a values (1), (5), (6);
-Andy
PS: This list prefers you don't top post.
Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------
the best I can come up with using your original idea is :
------------------------------------------------------------------
--fast-ish: 10sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
, lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
, lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
FROM (
SELECT b.gid, b.t
, (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1 ) AS gid_l_b
FROM b) as lower_b_a
LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------
CREATE TABLE t AS
SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
, lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
, lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
FROM (
SELECT b.gid, b.t
, (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1 ) AS gid_l_b
FROM b) as lower_b_a
LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------
As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.
Yet it is still slower :-/
The way to go seems to be the numrange.
Thanks a lot for the help in this optimization !
Cheers,
Rémi-C
Rémi-C
A little related bonus :
when doing the time-join,-------------------------------------------------------------------------------------------
DROP FUNCTION IF EXISTS range_interpolate(nr anyrange,obs anyelement) ;
CREATE OR REPLACE FUNCTION range_interpolate(nr anyrange,obs anyelement)
RETURNS TABLE(lower_weight NUMERIC,upper_weight NUMERIC)
AS $$
--@param a range
--@param an observation (value) of the same type as the range
--@return the weight to apply to lower bound and upper bound of range to get the value.
--exceptions : -inf or +inf : weight of the bound is 0, the other 1.
--exceptions : range = a point : returns weight of 0.5 for each bound (they are identical but the asociated data may not be)
SELECT
CASE WHEN upper(nr)=lower(nr) THEN ROW(0.5,0.5)
--WHEN obs=lower(nr) AND obs=upper(nr) THEN ARRAY[0.5,0.5]
--THEN (obs-lower(nr))/ra, (upper(nr)-obs)/ra
WHEN lower_inf(nr)=TRUE OR lower(nr) IS NULL THEN ROW(0,1)
WHEN upper_inf(nr)=TRUE OR upper(nr) IS NULL THEN ROW(1,0)
ELSE ROW( (upper(nr)-obs)/(upper(nr)-lower(nr)),(obs-lower(nr))/(upper(nr)-lower(nr)))
END
--testing :
--SELECT * FROM range_interpolate(numrange(1,10) , round(10,2))
$$
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT;
--------------------------------------------------------------
Rémi-C