Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers
From | Ilia Evdokimov |
---|---|
Subject | Re: Use merge-based matching for MCVs in eqjoinsel |
Date | |
Msg-id | c3dbf2ab-d72d-4033-822a-60ad8023f499@tantorlabs.com Whole thread Raw |
In response to | Re: Use merge-based matching for MCVs in eqjoinsel (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>) |
List | pgsql-hackers |
Hi hackers, On 10.09.2025 16:56, Ilia Evdokimov wrote: > Unfortunately, the JOB benchmark does not contain semi join nodes. > However, TPC-DS does. I'll look for the queries with slowest planner > times there and check them. > > I'll need some time to check both join and semi join cases with small > and large default_statistics_target. I'll share the results later. JOIN ============================== I’ve benchmarked the new implementation of eqjoinsel() with different values of default_statistics_target. On small targets (1, 5, 10, 25, 50, 75, 100) the results are all within statistical noise, and I did not observe any regressions. In my view, it’s reasonable to keep the current condition that the hash table is not used for default_statistics_target = 1. Raising that threshold does not seem useful. Here are the results for JOB queries (where the effect of semi join is not visible due to different data distributions): default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) ------------------------------------------------------------------------------------------ 1 | 1.00 | 1846.643 | 1847.409 5 | 1.00 | 1836.391 | 1828.318 10 | 0.95 | 1841.750 | 1929.722 25 | 0.99 | 1873.172 | 1890.741 50 | 0.98 | 1869.897 | 1898.470 75 | 1.02 | 1969.368 | 1929.521 100 | 0.97 | 1857.890 | 1921.207 1000 | 1.14 | 2279.700 | 1997.102 2500 | 1.78 | 4682.658 | 2636.202 5000 | 6.45 | 15943.696 | 2471.242 7500 | 12.45 | 34350.855 | 2758.565 10000 | 20.52 | 62519.342 | 3046.819 SEMI JOIN ============================== Unfortunately, in TPC-DS it is not possible to clearly see improvements for semi joins. To address this, I designed a synthetic example where the data distribution forces the loop to run fully, without exiting early, which makes the effect on semi joins more visible. In this setup, I also ensured that the length of the MCV array is equal to the chosen default_statistics_target. CREATE TABLE t1 AS SELECT CASE WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1 ELSE (g % 1000000) + 10000 END AS id FROM generate_series(1, 3000000) g; CREATE TABLE t2 AS SELECT CASE WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001 ELSE (g % 1000000) + 20000 END AS id FROM generate_series(1, 3000000) g; ANALYZE t1, t2; The results of the query are: SELECT * FROM t1 WHERE id IN (SELECT id FROM t2); default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) ------------------------------------------------------------------------------------------ 1 | 1.12 | 1.191 | 1.062 5 | 1.02 | 0.493 | 0.481 10 | 0.92 | 0.431 | 0.471 25 | 1.27 | 0.393 | 0.309 50 | 1.04 | 0.432 | 0.416 75 | 0.96 | 0.398 | 0.415 100 | 0.95 | 0.450 | 0.473 1000 | 9.42 | 6.742 | 0.716 2500 | 19.15 | 21.621 | 1.129 5000 | 46.74 | 85.667 | 1.833 7500 | 73.26 | 194.806 | 2.659 10000 | 107.95 | 349.981 | 3.242 -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
pgsql-hackers by date: