Re: Add a greedy join search algorithm to handle large join problems - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Add a greedy join search algorithm to handle large join problems |
| Date | |
| Msg-id | be284a7c-0200-446b-b690-2d65495710e3@vondra.me Whole thread Raw |
| In response to | Re: Add a greedy join search algorithm to handle large join problems (Chengpeng Yan <chengpeng_yan@Outlook.com>) |
| List | pgsql-hackers |
Hi, On 12/16/25 15:38, Chengpeng Yan wrote: > Recently, I have been testing the TPC-H SF=1 dataset using four simple > greedy join-ordering strategies: join cardinality (estimated output > rows), selectivity, estimated result size in bytes, and cheapest total > path cost. These can be roughly seen as either output-oriented > heuristics (rows / selectivity / result size), which try to optimize the > shape of intermediate results, or a cost-oriented heuristic, which > prefers the locally cheapest join step. > > The main goal of these experiments is to check whether the current > greedy rules show obvious structural weaknesses, and to use the observed > behavior as input for thinking about how a greedy rule might evolve. > While there is unlikely to be a perfect greedy strategy, I am hoping to > identify approaches that behave reasonably well across many cases and > avoid clear pathological behavior. > > In the attached files, v3-0001 is identical to the previously submitted > v2-0001 patch and contains the core implementation of the GOO algorithm. > The v3-0002 patch adds testing-only code to evaluate different greedy > rules, including a GUC (goo_greedy_strategy) used only for switching > strategies during experiments. > > All tests were performed on the TPC-H SF=1 dataset. After loading the > data, I ran the following commands before executing the benchmarks: > > ``` > VACUUM FREEZE ANALYZE; > CHECKPOINT; > > ALTER SYSTEM SET join_collapse_limit = 100; > ALTER SYSTEM SET max_parallel_workers_per_gather = 0; > ALTER SYSTEM SET statement_timeout = 600000; > ALTER SYSTEM SET shared_buffers = ‘4GB’; > ALTER SYSTEM SET effective_cache_size = ‘8GB’; > ALTER SYSTEM SET work_mem = ‘1GB’; > SELECT pg_reload_conf(); > ``` > You realize this does not actually change shared buffers size, right? Because that requires a restart, not just pg_reload_conf. Are you sure you're running with 4GB shared buffers? > The detailed benchmark results are summarized in tpch.pdf. Execution > times are reported as ratios, using the DP-based optimizer’s execution > time as the baseline (1.0). > > The compressed archive tpch_tests_result.zip contains summary.csv, which > is the raw data used to generate tpch.pdf and was produced by the > run_job.sh script. It also includes files (xxx_plan.txt), which were > generated by the run_analysis.sh script and record the EXPLAIN ANALYZE > outputs for the same query under different join-ordering algorithms, to > make plan differences easier to compare. > I'm not sure SF=1 is sufficient for making any clear conclusions. No matter what the shared_buffers value is, it's going to fit into RAM, the runs are going to be fully cached. > Based on the TPC-H results, my high-level observations are: > * The threeoutput-oriented greedy rules (rows, selectivity, result size) > show noticeable regressions compared to DP overall, with a relatively > large number of outliers. > * Using total path cost as the greedy key produces results that are > generally closer to DP, but still shows some clear outliers. > > To understand why these regressions occur, I mainly looked at Q20 and > Q7, which show particularly large and consistent regressions and expose > different failure modes. > > In Q20, there is a join between partsupp and an aggregated lineitem > subquery. For this join, the planner’s rowcount estimate is wrong by > orders of magnitude (tens of rows estimated versus hundreds of thousands > actually produced). As a result, output-oriented greedy rules strongly > prefer this join very early, because it appears to be extremely > shrinking. In reality, it processes large inputs and produces a large > intermediate, and this early misordering can significantly amplify > downstream join costs. This makes Q20 a clear outlier for output-based > greedy rules when estimates are severely wrong. > I don't quite see how we could make good decisions if the estimates are this wrong. Garbage in, garbage out. AFAICS this affects both the regular DP planning, which heavily relies on the costing, but also the approaches on heuristics, which still rely on estimates in some way. If the aggregate subquery is 1000x off, why should any of the following decisions be right? The approaches may have different tolerance for estimation errors - some may be "defensive" and handle misestimates better, but the trade off is that it may pick slower plans when the estimates are exact. I don't think there's an "optimal" approach picking the best plan in both cases. If there was, join ordering wouldn't be such a challenge. > Q7 exposes a different issue. The cost-based greedy rule tends to choose > a locally cheap join early, but that join creates an intermediate which > later joins become much more expensive to process. In this case, an > early commitment under relatively weak constraints leads to a > many-to-many intermediate that is only filtered after fact-table joins > are applied. This illustrates how a purely cost-driven greedy rule can > make locally reasonable decisions that turn out to be globally harmful. > > Taken together, these outliers suggest that all four single-metric > greedy rules tested so far have structural limitations. Output-oriented > rules appear fragile when join rowcount estimates are badly wrong, while > cost-oriented greedy decisions can still lead to locally reasonable but > globally poor plans. > I don't think Q20 is about greediness. Poor estimates are going to be an issue for all approaches relying on them in some way. The issue with Q7 seems pretty inherent to greedy algorithms. Picking solutions that are optimal locally but not globally is the definition of "greedy". I don't think it matters which exact metric is used, this flaw is built-in. And I don't think it's solvable. > One question this naturally raises is whether making irreversible greedy > choices based only on a local ranking signal is sufficient, or whether > some mechanism is needed to make the approach more robust and to limit > the impact of such outliers. > Sufficient for what? Sufficient to replace DP or to replace GEQO? I don't think we'd want to replace DP with such greedy algorithm. With accurate estimates and modest number of joins, we should be able to find the best join (more or less). But this thread is not about replacing DP, I see GOO more as a GEQO replacement, right? Or did the goal change? > As a next step, based on the current results, I plan to ignore > selectivity (which performs poorly in many cases), treat rows as largely > redundant with result_size, and move on to testing on the JOB benchmark. > I also plan to compare the behavior of DP, GEQO, and GOO on JOB, and to > use those results to better understand which signals are most useful for > guiding greedy decisions. > I'm not sure ignoring selectivity entirely is a good plan, though. Yes, if the selectivity/estimate is significantly off, the plan may be poor. But what exactly do you plan to use instead? Isn't a lot of the metrics (output size, ...) derived from the selectivity/estimate anyway? I agree the JOB benchmark may be a better fit for this. The TPC-H is pretty simple, and even for TPC-DS the number of joins is not all that high. Sorry if my initial feedback suggested these are the proper benchmarks to evaluate this. I suggest the evaluation should focus on cases where we expect GOO to actually be used in practice - as a replacement for GEQO. Only when it proves useful/acceptable for that use case (for sufficiently many joins etc.), we can start comparing with DP to figure out if we need to adjust the thresholds and so on. Another suggestion is to also test with larger data sets. The problems with SF=10 or SF=100 may be very different, even with TPC-H. Also, consider testing with "cold" cases, as if there's nothing cached by restarting the Postgres instance and drop page cache between runs. > I would be very interested in hearing people’s thoughts on these > observations and on possible directions to explore next. > I realized the paper mentioned at the beginning of this thread is from 1998. That doesn't make it wrong, but I was wondering if there are some newer papers about join order search, with interesting alternative/better approaches. An interesting paper I found is this CIDR21 paper: Simplicity Done Right for Join Ordering Axel Hertzschuch, Claudio Hartmann, Dirk Habich, Wolfgang Lehner https://vldb.org/cidrdb/2021/simplicity-done-right-for-join-ordering.html https://vldb.org/cidrdb/papers/2021/cidr2021_paper01.pdfl Which does something similar to our approach, although it seems to be more a replacement for the DP and not just for GEQO. It's meant to be cheaper, but also more resilient to poor join orders, as it cares about "upper bound" (~worst case) for the join orders. It goes beyond the scope of GOO, as the paper also talks about sampling to determine some of the estimates. But maybe it'd be useful (better than GEQO) even without that? I find the focus on the "worst case" (and only trusting baserel estimates) interesting. I wonder if it might help with the common nestloop problem, where we opt to believe a very optimistic low cardinality estimate, only to end with a sequence of nestloops that never complete. regards -- Tomas Vondra
pgsql-hackers by date: