Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers
| From | Haibo Yan |
|---|---|
| Subject | Re: Implement missing join selectivity estimation for range types |
| Date | |
| Msg-id | CABXr29Gts+FuMRkeA=JNAxB700W1EQW2F=W2wAjtJrGZDGTxGw@mail.gmail.com Whole thread |
| In response to | Re: Implement missing join selectivity estimation for range types (SCHOEMANS Maxime <maxime.schoemans@ulb.be>) |
| Responses |
Re: Implement missing join selectivity estimation for range types
|
| List | pgsql-hackers |
Hi Haibo,Thank you for picking this up again. I agree with the changes you madein v5, in particular scoping the patch to the three strict operators andreworking the tests to check plan structure rather than exact row counts.Attached is v6 as a 3-patch series building on your v5.Patch 1 is your range join selectivity patch with one small change: therange_cmp_bounds result in the merge walk is stored in a local cmpvariable to avoid calling it twice per iteration, as jian he suggested.
Hi Maxime,
I think patch 1 looks good to me. I do not see major issues there. One small note: the localized bool join_is_reversed; in my version was intentional. I left it that way because get_join_variables() wants a storage location, and I preferred to keep that use local and explicit rather than trying to reshape things around it.
Patch 2 adds the same estimation for multirange types, covering all typecombinations (multirange x multirange, multirange x range, range xmultirange). Since both range and multirange types use the same boundhistogram format and the same RangeBound representation, the corealgorithm is identical.
For patch 2, I am less convinced, especially for &&.
My concern is not so much whether the code runs, but whether the semantic argument is strong enough for the mixed range/multirange cases. The patch assumes that because range and multirange use the same bounds histogram format and the same RangeBound representation, the same estimator can be applied directly. I think that argument is much easier to make for << and >> than for &&.
For single ranges, && works nicely with the usual decomposition because if two single ranges do not overlap, then one must be entirely to the left or entirely to the right of the other. But for multiranges there is a third possibility: neither side is entirely left nor entirely right, and yet they still do not overlap because of an internal gap.
A = {[1,2), [100,101)}
B = [50,60)
Here:
A << B is false
A >> B is false
A && B is also false
So for multiranges, “not left and not right” does not imply overlap in the same way it does for single ranges. That makes me worry that reusing the same estimator logic for &&, especially in mixed range/multirange cases, may overestimate overlap because the overall lower/upper bounds do not capture the internal holes.
So I think patch 2 still needs a stronger justification there, and probably more targeted tests around sparse multiranges / hole cases, especially for &&.
Patch 3 removes the duplication between rangetypes_selfuncs.c andmultirangetypes_selfuncs.c that Tom raised as a concern. It makes the10 shared helper functions non-static, exports them via selfuncs.h,and deletes the copies from the multirange file. This covers all thepre-existing duplication between the two files, not just the functionsadded in this patch set.Regards,Maxime
For patch 3, I agree with the motivation: the duplication between rangetypes_selfuncs.c and multirangetypes_selfuncs.c is not ideal. But I am not convinced that exporting those helpers via selfuncs.h is the right boundary.
My preference would be something tighter:
keep the shared helper implementations in one place
add a backend-private internal header just for the range-family selfuncs code
include that internal header from rangetypes_selfuncs.c and multirangetypes_selfuncs.c
avoid widening visibility by turning a group of file-local helpers into broader extern declarations in selfuncs.h
Thanks again for working on this.
Regards,
Haibo
pgsql-hackers by date: