[HACKERS] ASOF join - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject [HACKERS] ASOF join
Date
Msg-id bc494762-26bd-b100-e1f9-a97901ddad57@postgrespro.ru
Whole thread Raw
Responses Re: [HACKERS] ASOF join  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
I wonder if there were some discussion/attempts to add ASOF join to Postgres  (sorry, may be there is better term for it, I am refereeing KDB definition: http://code.kx.com/wiki/Reference/aj ).
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:

//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];

...and not only. Below is one example of how people now manually coding ASOF join:

select
    count(*),
    count(*)
        filter (where timedelta_prev < -30),
    count(*)
        filter (where ride_prev = ride_next),
    ... -- many different aggregates
from
    (
        select
            p.provider_id,
            p.ts,
            (
                select extract(epoch from t.ts - p.ts)
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
                order by t.ts desc
                limit 1
            ) as timedelta_prev,
            (
                select extract(epoch from t.ts - p.ts)
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
                order by t.ts
                limit 1
            ) as timedelta,
            (
                select ride_id
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
                order by t.ts desc
                limit 1
            ) as ride_prev,
            (
                select ride_id
                from providers_positions t
                where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
                order by t.ts
                limit 1
            ) as ride_next
        from (
                 select
                     provider_id,
                     ts,
                     event_name
                 from
                     lvl2_681_parsed p
             ) p
        where
            p.event_name = 'GPS signal restored'
       -- offset 0
    ) z;

Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans  corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.

Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:

   select  * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);

It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Yuan Dong
Date:
Subject: [HACKERS] GiST API Adancement
Next
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] Get stuck when dropping a subscription duringsynchronizing table