Range Merge Join v1 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Range Merge Join v1 |
Date | |
Msg-id | CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
======== Example: ======== Find different people using the same website at the same time: create table session(sessionid text, username text, during tstzrange); SELECT s1.username, s2.username, s1.during * s2.during FROM session s1, session s2 WHERE s1.during && s2.during AND s1.username < s2.username ===================================== Brief summary of previous discussion: ===================================== http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis - can indexes solve it already (parameterized paths, etc.)? - planner complexity (track path keys, mergejoinable equality, etc.) - spatial join algorithm choice - compare range merge join against existing indexes to see if any indexing approach is viable ========== Externals: ========== No new syntax or other externals. Just improved execution strategy for joining ranges using the overlaps operator (&&). New syntax is possible, but I don't see a strong reason to support special range join syntax at this time. ============= Optimizer Design ============= I took the path of least resistance, so if I am doing something wrong please let me know. I hardwired it to look for the range overlaps operator when checking if it could do a merge join, and then add range_ops to restrictinfo->mergeopfamilies. Then, I have to suppress adding it as an equivalence class, because overlaps is not equality. It still adds single-member ECs to restrictinfo->right_ec and left_ec, but (I think) that's OK. Also, I have to prevent generating paths for right/full join, because the range join algorithm can't (currently) handle those. ============= Costing ============= Needs more consideration. Seems to do reasonable things in the few examples I tried. ============= Executor Design ============= See detailed comments in nodeMergejoin.c ============= Performance ============= Seems much better than index nest loop join when the join is not selective. I will post detailed numbers soon. If no index is available, range merge join is the only reasonable way to execute a range join. For instance, if the inner is not a leaf in the plan. ============= Alternatives: ============= It was suggested that I approach it as a general spatial-join problem. That has merit, but I rejected it for now because the algorithm that I think has the most promise is range-oriented anyway. By that I mean we would need to extract all of the dimensions into ranges first, and then perform the join. With that in mind, it makes sense to implement range joins first; and then later provide the APIs to get at the spatial dimensions so that we can implement other spatial joins as range joins. Another suggestion was to base it on hash join, but I never understood that proposal and it didn't seem to map very well to a spatial join. Yet another suggestion was to use some kind of temporary index. Some brief experiments I did indicated that it would be fairly slow (though more investigation might be useful here). Also, it doesn't provide any alternative to the nestloop-with-inner-index we already offer at the leaf level today. Regards, Jeff Davis
Attachment
pgsql-hackers by date: