Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers

From Haibo Yan
Subject Re: Implement missing join selectivity estimation for range types
Date
Msg-id CABXr29HUT2m7YPpB4gOZHphYmwpwuEWFZ_p6oeyouhWw8Zc1Tw@mail.gmail.com
Whole thread Raw
In response to Re: Implement missing join selectivity estimation for range types  (jian he <jian.universality@gmail.com>)
List pgsql-hackers
On Mon, Apr 6, 2026 at 3:12 PM jian he <jian.universality@gmail.com> wrote:
I cannot figure out why it aborts.

as Tom mentioned in upthread about the test cases.
similar to src/test/regress/sql/stats_ext.sql check_estimated_rows function.
we can test it by something:

create or replace function check_estimated_rows(text) returns table (ok bool)
language plpgsql as
$$
declare
    ln text;
    tmp text[];
    first_row bool := true;
begin
    for ln in
        execute format('explain analyze %s', $1)
    loop
        if first_row then
            first_row := false;
            tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
            return query select 0.2 < tmp[1]::float8 / tmp[2]::float8
and tmp[1]::float8 / tmp[2]::float8 < 5;
        end if;
    end loop;
end;
$$;

select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 && ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 << ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 >> ir2$$);

Do you need 3 tables to do the test? because we need to actually run
the query then compare the estimated row
and actually returned rows.
If you really execute the query with 3 table joins, it will take a lot of time.
So two tables join with where quql should be fine?

/* Fast-forwards i and j to start of iteration */
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);
+
+ /* Do the estimation on overlapping regions */
+ while (i < nhist1 && j < nhist2)
+ {
+ double cur_sel1,
+ cur_sel2;
+ RangeBound cur_sync;
+
+ if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) < 0)
+ cur_sync = hist1[i++];
+ else if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) > 0)
+ cur_sync = hist2[j++];
+ else
+ {
+ /* If equal, skip one */
+ cur_sync = hist1[i];
+

this part range_cmp_bound_values "if else if" part computed twice, you
can just do
`
int cmp;
cmp = range_cmp_bound_values(typcache, &hist1[i], &hist2[j]);
if cmp <0  then
else if cmp > 0 then
else then
`

also. I think you can put the following into  main while loop.
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);

split range and multirange into 2 patches might be a good idea.
seems: same function (calc_hist_join_selectivity) with same function
signature in src/backend/utils/adt/multirangetypes_selfuncs.c
and src/backend/utils/adt/rangetypes_selfuncs.c,
previously mail complaints not resolved.



Hi,all
I'd like to revive the discussion on improving selectivity estimation for range/range joins.
Attached is the v5 patch which teaches rangejoinsel to use range bound histograms for estimating the <<, >>, and && operators. Currently, the planner often falls back to a hardcoded default (like 0.005), which can lead to poor join ordering in complex queries.
In this version, I have intentionally excluded &< and &>.
a &< b essentially maps to upper(a) <= upper(b).
a &> b essentially maps to lower(a) >= lower(b).
Since these operators include equality (<= / >=) rather than strict inequality (< / >), their estimation is slightly more nuanced. I believe focusing on the strict inequality and overlap operators first allows us to deliver a clean, converged, and significantly beneficial improvement. We can discuss the best approach for the remaining operators once this foundation is in place.
Test Results
My local tests show that the planner now correctly identifies cases with zero or full selectivity, which were previously misestimated.
----------------------------------------------------------------------
CREATE TABLE t1 (id int, r int4range);
CREATE TABLE t2 (id int, r int4range);
INSERT INTO t1 SELECT g, int4range(g * 2 - 1, g * 2) FROM generate_series(1, 1000) g;
INSERT INTO t2 SELECT g, int4range(10000 + g * 200, 10000 + g * 200 + 100) FROM generate_series(1, 1000) g;
ANALYZE t1, t2;
----------------------------------------------------------------------

Real Selectivity:
----------------------------------------------------------------------
SELECT
    avg((t1.r << t2.r)::int) AS p_ll, -- Expected: 1.0
    avg((t1.r >> t2.r)::int) AS p_rr, -- Expected: 0.0
    avg((t1.r && t2.r)::int) AS p_ov  -- Expected: 0.0
FROM t1, t2;
-- Result: p_ll = 1.0, p_rr = 0.0, p_ov = 0.0
----------------------------------------------------------------------

Planner Improvements:
With the patch, the EXPLAIN output reflects these probabilities accurately:
For << (High Selectivity):
The planner correctly estimates rows=1000000 (1000 * 1000 * 1.0).
----------------------------------------------------------------------
Nested Loop (cost=29.50..15080.75 rows=1000000 width=61)
  Join Filter: (t1.r << t2.r)
----------------------------------------------------------------------
For && and >> (Near-Zero Selectivity):
The planner now correctly predicts rows=1 instead of using the default multiplier.
----------------------------------------------------------------------
Nested Loop (cost=0.00..15036.50 rows=1 width=36)
  Join Filter: (t1.r && t2.r)
This improved estimation allows the optimizer to make much better decisions regarding join order and nesting when range columns are involved.
I look forward to your feedback.
Regards,
Haibo 
Attachment

pgsql-hackers by date:

Previous
From: Zsolt Parragi
Date:
Subject: Re: SLOPE - Planner optimizations on monotonic expressions.
Next
From: Zsolt Parragi
Date:
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?