Re: Add a greedy join search algorithm to handle large join problems - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add a greedy join search algorithm to handle large join problems
Date
Msg-id be284a7c-0200-446b-b690-2d65495710e3@vondra.me
Whole thread Raw
In response to Re: Add a greedy join search algorithm to handle large join problems  (Chengpeng Yan <chengpeng_yan@Outlook.com>)
List pgsql-hackers
Hi,

On 12/16/25 15:38, Chengpeng Yan wrote:
> Recently, I have been testing the TPC-H SF=1 dataset using four simple
> greedy join-ordering strategies: join cardinality (estimated output
> rows), selectivity, estimated result size in bytes, and cheapest total
> path cost. These can be roughly seen as either output-oriented
> heuristics (rows / selectivity / result size), which try to optimize the
> shape of intermediate results, or a cost-oriented heuristic, which
> prefers the locally cheapest join step.
> 
> The main goal of these experiments is to check whether the current
> greedy rules show obvious structural weaknesses, and to use the observed
> behavior as input for thinking about how a greedy rule might evolve.
> While there is unlikely to be a perfect greedy strategy, I am hoping to
> identify approaches that behave reasonably well across many cases and
> avoid clear pathological behavior.
> 
> In the attached files, v3-0001 is identical to the previously submitted
> v2-0001 patch and contains the core implementation of the GOO algorithm.
> The v3-0002 patch adds testing-only code to evaluate different greedy
> rules, including a GUC (goo_greedy_strategy) used only for switching
> strategies during experiments.
> 
> All tests were performed on the TPC-H SF=1 dataset. After loading the
> data, I ran the following commands before executing the benchmarks:
> 
> ```
> VACUUM FREEZE ANALYZE;
> CHECKPOINT;
> 
> ALTER SYSTEM SET join_collapse_limit = 100;
> ALTER SYSTEM SET max_parallel_workers_per_gather = 0;
> ALTER SYSTEM SET statement_timeout = 600000;
> ALTER SYSTEM SET shared_buffers = ‘4GB’;
> ALTER SYSTEM SET effective_cache_size = ‘8GB’;
> ALTER SYSTEM SET work_mem = ‘1GB’;
> SELECT pg_reload_conf();
> ```
> 

You realize this does not actually change shared buffers size, right?
Because that requires a restart, not just pg_reload_conf. Are you sure
you're running with 4GB shared buffers?

> The detailed benchmark results are summarized in tpch.pdf. Execution
> times are reported as ratios, using the DP-based optimizer’s execution
> time as the baseline (1.0).
> 
> The compressed archive tpch_tests_result.zip contains summary.csv, which
> is the raw data used to generate tpch.pdf and was produced by the
> run_job.sh script. It also includes files (xxx_plan.txt), which were
> generated by the run_analysis.sh script and record the EXPLAIN ANALYZE
> outputs for the same query under different join-ordering algorithms, to
> make plan differences easier to compare.
> 

I'm not sure SF=1 is sufficient for making any clear conclusions. No
matter what the shared_buffers value is, it's going to fit into RAM, the
runs are going to be fully cached.

> Based on the TPC-H results, my high-level observations are:
> * The threeoutput-oriented greedy rules (rows, selectivity, result size)
> show noticeable regressions compared to DP overall, with a relatively
> large number of outliers.
> * Using total path cost as the greedy key produces results that are
> generally closer to DP, but still shows some clear outliers.
> 
> To understand why these regressions occur, I mainly looked at Q20 and
> Q7, which show particularly large and consistent regressions and expose
> different failure modes.
> 
> In Q20, there is a join between partsupp and an aggregated lineitem
> subquery. For this join, the planner’s rowcount estimate is wrong by
> orders of magnitude (tens of rows estimated versus hundreds of thousands
> actually produced). As a result, output-oriented greedy rules strongly
> prefer this join very early, because it appears to be extremely
> shrinking. In reality, it processes large inputs and produces a large
> intermediate, and this early misordering can significantly amplify
> downstream join costs. This makes Q20 a clear outlier for output-based
> greedy rules when estimates are severely wrong.
> 

I don't quite see how we could make good decisions if the estimates are
this wrong. Garbage in, garbage out. AFAICS this affects both the
regular DP planning, which heavily relies on the costing, but also the
approaches on heuristics, which still rely on estimates in some way.

If the aggregate subquery is 1000x off, why should any of the following
decisions be right? The approaches may have different tolerance for
estimation errors - some may be "defensive" and handle misestimates
better, but the trade off is that it may pick slower plans when the
estimates are exact.

I don't think there's an "optimal" approach picking the best plan in
both cases. If there was, join ordering wouldn't be such a challenge.

> Q7 exposes a different issue. The cost-based greedy rule tends to choose
> a locally cheap join early, but that join creates an intermediate which
> later joins become much more expensive to process. In this case, an
> early commitment under relatively weak constraints leads to a
> many-to-many intermediate that is only filtered after fact-table joins
> are applied. This illustrates how a purely cost-driven greedy rule can
> make locally reasonable decisions that turn out to be globally harmful.
> 
> Taken together, these outliers suggest that all four single-metric
> greedy rules tested so far have structural limitations. Output-oriented
> rules appear fragile when join rowcount estimates are badly wrong, while
> cost-oriented greedy decisions can still lead to locally reasonable but
> globally poor plans.
> 

I don't think Q20 is about greediness. Poor estimates are going to be an
issue for all approaches relying on them in some way.

The issue with Q7 seems pretty inherent to greedy algorithms. Picking
solutions that are optimal locally but not globally is the definition of
"greedy". I don't think it matters which exact metric is used, this flaw
is built-in. And I don't think it's solvable.

> One question this naturally raises is whether making irreversible greedy
> choices based only on a local ranking signal is sufficient, or whether
> some mechanism is needed to make the approach more robust and to limit
> the impact of such outliers.
> 

Sufficient for what? Sufficient to replace DP or to replace GEQO?

I don't think we'd want to replace DP with such greedy algorithm. With
accurate estimates and modest number of joins, we should be able to find
the best join (more or less).

But this thread is not about replacing DP, I see GOO more as a GEQO
replacement, right? Or did the goal change?


> As a next step, based on the current results, I plan to ignore
> selectivity (which performs poorly in many cases), treat rows as largely
> redundant with result_size, and move on to testing on the JOB benchmark.
> I also plan to compare the behavior of DP, GEQO, and GOO on JOB, and to
> use those results to better understand which signals are most useful for
> guiding greedy decisions.
> 

I'm not sure ignoring selectivity entirely is a good plan, though. Yes,
if the selectivity/estimate is significantly off, the plan may be poor.
But what exactly do you plan to use instead? Isn't a lot of the metrics
(output size, ...) derived from the selectivity/estimate anyway?

I agree the JOB benchmark may be a better fit for this. The TPC-H is
pretty simple, and even for TPC-DS the number of joins is not all that
high. Sorry if my initial feedback suggested these are the proper
benchmarks to evaluate this.

I suggest the evaluation should focus on cases where we expect GOO to
actually be used in practice - as a replacement for GEQO. Only when it
proves useful/acceptable for that use case (for sufficiently many joins
etc.), we can start comparing with DP to figure out if we need to adjust
the thresholds and so on.

Another suggestion is to also test with larger data sets. The problems
with SF=10 or SF=100 may be very different, even with TPC-H. Also,
consider testing with "cold" cases, as if there's nothing cached by
restarting the Postgres instance and drop page cache between runs.

> I would be very interested in hearing people’s thoughts on these
> observations and on possible directions to explore next.
> 

I realized the paper mentioned at the beginning of this thread is from
1998. That doesn't make it wrong, but I was wondering if there are some
newer papers about join order search, with interesting
alternative/better approaches.

An interesting paper I found is this CIDR21 paper:

Simplicity Done Right for Join Ordering
Axel Hertzschuch, Claudio Hartmann, Dirk Habich, Wolfgang Lehner
https://vldb.org/cidrdb/2021/simplicity-done-right-for-join-ordering.html
https://vldb.org/cidrdb/papers/2021/cidr2021_paper01.pdfl

Which does something similar to our approach, although it seems to be
more a replacement for the DP and not just for GEQO. It's meant to be
cheaper, but also more resilient to poor join orders, as it cares about
"upper bound" (~worst case) for the join orders.

It goes beyond the scope of GOO, as the paper also talks about sampling
to determine some of the estimates. But maybe it'd be useful (better
than GEQO) even without that?

I find the focus on the "worst case" (and only trusting baserel
estimates) interesting. I wonder if it might help with the common
nestloop problem, where we opt to believe a very optimistic low
cardinality estimate, only to end with a sequence of nestloops that
never complete.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: should we have a fast-path planning for OLTP starjoins?
Next
From: "Jelte Fennema-Nio"
Date:
Subject: Re: RFC: adding pytest as a supported test framework