Re: [HACKERS] [PROPOSAL] Temporal query processing with range types - Mailing list pgsql-hackers
From | Peter Moser |
---|---|
Subject | Re: [HACKERS] [PROPOSAL] Temporal query processing with range types |
Date | |
Msg-id | CAHO0eLZ--LPCqaMpH1f6eKQZWNoU-BjYQ1p-Kj7VHJ=01vpsCQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [PROPOSAL] Temporal query processing with range types (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
|
List | pgsql-hackers |
On Thu, 2017-11-30 at 12:26 -0500, Robert Haas wrote: > I wonder if we could instead think about R NORMALIZE S ON R.x = S.y > WITH (R.time, S.time) as an ordinary join on R.x = S.y with some > extra processing afterwards. After finding all the join partners for > a row in R, extract all the lower and upper bounds from the S.time > fields of the join partners and use that to emit as many rows from R > as may be necessary. We have now a new approach to plan and execute NORMALIZE as a special join node of type NORMALIZE, an append-plan on the inner join path, and a merge-join executor. For the latter, we would need to extend nodeMergejoin.c with an point-in-range-containment join. That is, we create a new join path within sort_inner_and_outer (joinpath.c). First two projection nodes to extract the start- and end-timepoints of the range type used as interval, and above an append-plan to merge both subplans. In detail, outer_path is just sort, whereas inner_path is append of (B, ts) projection with (B, te) projection. Hereby, B is a set of non-temporal attributes used in join equality predicates, and [ts,te) forms the valid-time interval. Non-equality predicates must be handled separately as a filter step. Do you think, it is a good idea to extend the sort_inner_and_outer() with a new branch, where jointype == NORMALIZE and add the projection and append sub-paths there? > The main problem that I see with that approach is that it > seems desirable to separate the extra-processing-afterwards step into > a separate executor node, and there's no way for a separate type of > plan node to know when the join advances the outer side of the join. > That's the purpose that row_id() is serving as you have it; we'd have > to invent some other kind of signalling mechanism, which might not be > entirely trivial. :-( One advantage of the new approach is that row_id() is no longer needed. The executor knows that it is inside a new "group" after every NEXTOUTER, then the whole loop of the inner relation has the same row_id. The whole mechanism could be handled through the MergeJoinState struct. > If we could do it, though, it might be quite a bit more efficient, > because it would avoid scanning S twice and performing a UNION on the > results of those scans. Also, we wouldn't necessarily need to sort > the whole set of rows from R, which I suspect is unavoidable in your > implementation. We'd just need to sort the individual groups of rows > from S, and my guess is that in many practical cases those groups are > fairly small. We would need a special SEQSCAN node/executor, that does the projection and append steps in a single read. Do you have any suggestions how to implement this? > I wonder what identities hold for NORMALIZE. It does not seem to be > commutative, but it looks like it might be associative... i.e. (R > NORMALIZE S) NORMALIZE T produces the same output as R NORMALIZE (S > NORMALIZE T), perhaps? It is not commutative and also not associative. Assume relations that contain one tuple each as follows: R={[1, 100)}, S={[50, 100)}, and T={[10, 20)} (R NORMALIZE S) NORMALIZE T gives {[1, 10), [10, 20), [20,50), [50, 100)} while R NORMALIZE (S NORMALIZE T) gives {[1, 50), [50, 100)} Best regards, Anton, Johann, Michael, Peter
pgsql-hackers by date: