Re: Add a greedy join search algorithm to handle large join problems - Mailing list pgsql-hackers
| From | Chengpeng Yan |
|---|---|
| Subject | Re: Add a greedy join search algorithm to handle large join problems |
| Date | |
| Msg-id | B38744DA-FFB0-4651-B099-28A475FC17A6@Outlook.com Whole thread Raw |
| In response to | Re: Add a greedy join search algorithm to handle large join problems (Chengpeng Yan <chengpeng_yan@Outlook.com>) |
| List | pgsql-hackers |
Hi,
Recently, I have been testing the JOB dataset using three greedy
strategies: result_size, cost, and combined. The result_size and cost
strategies are the same heuristics previously tested on TPC-H with SF =
1.
The combined strategy runs GOO twice, once greedy by cost and once by
result_size, producing two complete candidate plans, and then selects
the one with the lower final estimated total_cost. This allows GOO to
compare plans generated under different greedy criteria.
In the attached files, v4-0001 is identical to the previously submitted
v3-0001 patch and contains the core GOO implementation. All
experiment-related changes are in v4-0002, which adds simple, test-only
support for evaluating different greedy strategies and will be
optimized. The v4_tests archive contains scripts for generating
v4_job.pdf (run_job.sh) and for collecting EXPLAIN ANALYZE results under
different greedy strategies (run_analysis.sh).
The detailed results are included in v4_job.pdf. The results report both
the measured execution times and their ratios relative to the DP
execution time, with color coding used to highlight outliers (<0.5 dark
green, 0.5–0.8 green, 1.2x-2x light red, 2x-10x red, 10+ dark red).
Looking more closely at JOB, only a subset of queries exceeds
PostgreSQL’s default geqo_threshold of 12 relations and therefore
exercises GEQO. This includes the 17-table queries (29a, 29b, 29c), the
14-table queries (28a, 28b, 28c, 33a, 33b, 33c), and the 12-table
queries (24a, 24b, 26a, 26b, 26c, 27a, 27b, 27c, 30a, 30b, 30c), for a
total of about 20 queries.
I therefore also extracted results for this GEQO-relevant subset from
the full JOB runs. The detailed results are included in
v4_job_geqo_range.pdf, derived directly from v4_job.pdf. On this subset,
GOO variants tend to show more favorable behavior overall, particularly
in terms of tail robustness.
For reproducibility, JOB was run using the same configuration as in the
previous experiments. The following commands were executed after loading
the data and before running the benchmarks.
Settings requiring a restart (e.g., shared_buffers) were applied via a
server restart prior to benchmarking; other settings were applied using
ALTER SYSTEM followed by pg_reload_conf().
```sql
VACUUM FREEZE ANALYZE;
CHECKPOINT;
ALTER SYSTEM SET join_collapse_limit = 100;
ALTER SYSTEM SET max_parallel_workers_per_gather = 0;
ALTER SYSTEM SET statement_timeout = 600000;
ALTER SYSTEM SET shared_buffers = '4GB';
ALTER SYSTEM SET effective_cache_size = '8GB';
ALTER SYSTEM SET work_mem = '1GB';
SELECT pg_reload_conf();
```
Full JOB results (ratio = algo / DP, lower is better):
Overall summary
algo n gmean p90 p95 max
---------------- --- ------ ------ ------ ------
GOO(combined) 113 0.953 1.94 3.15 8.68
GOO(result_size) 113 0.969 2.63 4.45 67.53
GEQO 113 1.066 1.50 2.06 16.81
GOO(cost) 113 1.496 5.80 16.20 431.32
Tail behavior
algo >=2x >=5x >=10x
---------------- ----- ----- ------
GOO(combined) 11 3 0
GOO(result_size) 18 5 2
GEQO 9 2 2
GOO(cost) 30 15 7
Workload sum (total execution time, ms)
algo total_ms ratio_to_dp
---------------- ------------ ------------
DP 115710.85 1.000
GOO(cost) 798600.41 6.902
GOO(result_size) 139100.42 1.202
GOO(combined) 154027.19 1.331
GEQO 173464.38 1.499
Observations:
* GOO(cost) shows severe tail behavior (max 431×, p99 327×), with many
bad cases.
* GOO(result_size) performs reasonably on average (gmean 0.969) but has
a poor tail (max 67×).
* GOO(combined) is the most robust variant: it achieves the best gmean
(0.953), shows no regressions ≥10×, and significantly reduces the worst
case (max 8.68×), trading some average workload time for improved tail
behavior.
* GEQO is generally slower on JOB (gmean 1.066) and also exhibits ≥10×
regressions (max 16.8×).
Conclusions:
1. All non-DP algorithms have cases that outperform DP, which typically
arise in situations where the cost estimates no longer reflect the
relative execution costs of alternative plans, often due to severe
cardinality misestimation. This patch set does not attempt to fix
cardinality estimation; instead, it focuses on reducing tail risk among
greedy plans under the existing cost model.
2. Based on the current JOB results, further tuning of a single greedy
heuristic is not the immediate priority: each single-strategy variant
still shows corner cases with large regressions. The next direction is
to reduce catastrophic plans (tail risk), rather than to optimize one
specific greedy criterion.
3. GOO(combined) improves robustness by reducing extreme regressions.
Increasing plan diversity via multiple greedy criteria appears
effective: different heuristics fail for different reasons, and the
combined variant selects the cheapest plan among candidates from
different greedy strategies, helping avoid extreme outliers with minimal
additional planning overhead.
Next, I include brief notes on a small set of representative JOB queries
to illustrate where greedy join ordering behaves relatively well or
poorly. Under severe cardinality misestimation, neither DP-based nor
heuristic approaches can reliably select good plans (“garbage in,
garbage out”); different algorithms simply exhibit different tolerances
to such errors. These examples document how specific join-ordering
choices interact with misestimation and help identify where the current
GOO implementation is more or less fragile.
For regressions, I focus on GOO(result_size) (12b, 3b, 11b) and
GOO(combined) (15a, 17b). I also include a few cases where
GOO(result_size) outperforms DP (29c, 31a). The full plan-level
analyses are in the attached v4_bad_case_analysis.txt and are optional
background; the summary points below are sufficient for the main
conclusions.
## GOO(result_size) bad case analysis
TL;DR:
* 12b: Estimation error misleads the greedy choice, causing a highly
selective predicate to be applied too late; with accurate estimates, the
regression can be avoided.
* 3b: An inherent weakness of result_size-based greediness, where small
estimated results hide expensive scans or repeated execution.
* 11b: Row count underestimation leads to a non-materialized NLJ with
repeated execution; accurate estimates or materialization / Hash Join
largely avoid the regression.
## GOO(combined) bad case analysis
GOO(combined) regressions mainly fall into two categories:
1. failures of the cost model itself (15a), and
2. cases where cost and result_size share the same structural weakness
(17b).
TL;DR:
* 15a: The selector chooses the cost-greedy plan due to lower estimated
cost, even though the result_size plan is much faster in reality. This
reflects cost model unreliability; in such cases, plan diversity
collapses back to a single fragile choice. This is an expected
limitation when both candidate plans depend on the same cost signal;
suggestions on improving plan diversity or adding lightweight safeguards
are welcome.
* 17b: Both greedy strategies converge to similarly bad plans (~3× DP),
indicating a structural limitation of greedy enumeration rather than a
selector issue.
## Representative cases where GOO(result_size) outperforms DP
In addition to regressions, I include a small number of cases where
greedy join ordering outperforms DP. These cases are not meant to
suggest that greedy ordering makes intrinsically better decisions, but
to illustrate different failure modes: avoiding fan-out amplification
under severe underestimation (29c, 31a).
TL;DR:
* 29c: Under severe cardinality underestimation, DP introduces early
fan-out, producing large intermediates and repeated inner execution.
GOO(result_size) starts from a highly selective predicate and avoids the
fan-out explosion.
* 31a: DP chooses an early multiplying join under severe
underestimation, causing row counts to explode. GOO(result_size) delays
the multiplying join until a more selective prefix is built.
## Discussion and next steps
On JOB, the combined variant behaves reasonably well overall: it reduces
extreme regressions compared to the single-heuristic variants, and looks
competitive with GEQO on aggregate measures. This does not show that it
is “good enough” in general, but it seems like a promising direction and
worth broader evaluation.
The results are clearly workload- and query-dependent. TPC-H and TPC-DS
provide limited coverage of scenarios where GEQO is exercised, while
JOB, as a commonly used and well-recognized benchmark, includes a subset
of queries that reflect cases where GEQO is actually used. That said,
JOB is still limited in overall join graph size. I therefore plan to
explore more complex workloads with larger join graphs and queries
involving more relations to better approximate real-world GEQO usage.
Suggestions for suitable workloads or query sets would be very welcome.
Evaluating plan quality in isolation is inherently difficult, since join
ordering is only one component of plan selection, alongside cardinality
estimation, operator costing, and other planner decisions. As seen on
JOB, even DP does not always produce the fastest plan, suggesting that
improvements in join ordering alone do not consistently translate into
end-to-end plan quality gains. This makes it less clear how such
improvements should be measured, or how average and tail behavior should
be weighed.
Planning time and memory usage, in contrast, follow more directly from
the algorithmic structure, and I plan to add explicit measurements for
them in future. Beyond plan quality, I am also interested in addressing
areas where GEQO is less satisfactory in practice, such as tunability
and graceful degradation (e.g., using DP up to a resource limit and
completing planning with a greedy step). Broader evaluation across
workloads should help clarify both aspects.
Given the current JOB results and the fact that this implementation
matches what seems to be a common baseline in industry systems and much
of the literature, my next step is to treat the current implementation
as a starting point and expand evaluation: more workloads and
configurations. I also plan to incorporate selectivity into the greedy
strategy and re-run the evaluations.
There is also a large body of academic work proposing additional
mechanisms to improve robustness. Many of these ideas do not seem to
have a clear “production” consensus, and few systems implement them, so
I would prefer to explore them incrementally based on evidence: first
establish behavior of the baseline across a wider range of tests, then
evaluate a small number of well-motivated improvments and measure their
impact.
Recently, I have been testing the JOB dataset using three greedy
strategies: result_size, cost, and combined. The result_size and cost
strategies are the same heuristics previously tested on TPC-H with SF =
1.
The combined strategy runs GOO twice, once greedy by cost and once by
result_size, producing two complete candidate plans, and then selects
the one with the lower final estimated total_cost. This allows GOO to
compare plans generated under different greedy criteria.
In the attached files, v4-0001 is identical to the previously submitted
v3-0001 patch and contains the core GOO implementation. All
experiment-related changes are in v4-0002, which adds simple, test-only
support for evaluating different greedy strategies and will be
optimized. The v4_tests archive contains scripts for generating
v4_job.pdf (run_job.sh) and for collecting EXPLAIN ANALYZE results under
different greedy strategies (run_analysis.sh).
The detailed results are included in v4_job.pdf. The results report both
the measured execution times and their ratios relative to the DP
execution time, with color coding used to highlight outliers (<0.5 dark
green, 0.5–0.8 green, 1.2x-2x light red, 2x-10x red, 10+ dark red).
Looking more closely at JOB, only a subset of queries exceeds
PostgreSQL’s default geqo_threshold of 12 relations and therefore
exercises GEQO. This includes the 17-table queries (29a, 29b, 29c), the
14-table queries (28a, 28b, 28c, 33a, 33b, 33c), and the 12-table
queries (24a, 24b, 26a, 26b, 26c, 27a, 27b, 27c, 30a, 30b, 30c), for a
total of about 20 queries.
I therefore also extracted results for this GEQO-relevant subset from
the full JOB runs. The detailed results are included in
v4_job_geqo_range.pdf, derived directly from v4_job.pdf. On this subset,
GOO variants tend to show more favorable behavior overall, particularly
in terms of tail robustness.
For reproducibility, JOB was run using the same configuration as in the
previous experiments. The following commands were executed after loading
the data and before running the benchmarks.
Settings requiring a restart (e.g., shared_buffers) were applied via a
server restart prior to benchmarking; other settings were applied using
ALTER SYSTEM followed by pg_reload_conf().
```sql
VACUUM FREEZE ANALYZE;
CHECKPOINT;
ALTER SYSTEM SET join_collapse_limit = 100;
ALTER SYSTEM SET max_parallel_workers_per_gather = 0;
ALTER SYSTEM SET statement_timeout = 600000;
ALTER SYSTEM SET shared_buffers = '4GB';
ALTER SYSTEM SET effective_cache_size = '8GB';
ALTER SYSTEM SET work_mem = '1GB';
SELECT pg_reload_conf();
```
Full JOB results (ratio = algo / DP, lower is better):
Overall summary
algo n gmean p90 p95 max
---------------- --- ------ ------ ------ ------
GOO(combined) 113 0.953 1.94 3.15 8.68
GOO(result_size) 113 0.969 2.63 4.45 67.53
GEQO 113 1.066 1.50 2.06 16.81
GOO(cost) 113 1.496 5.80 16.20 431.32
Tail behavior
algo >=2x >=5x >=10x
---------------- ----- ----- ------
GOO(combined) 11 3 0
GOO(result_size) 18 5 2
GEQO 9 2 2
GOO(cost) 30 15 7
Workload sum (total execution time, ms)
algo total_ms ratio_to_dp
---------------- ------------ ------------
DP 115710.85 1.000
GOO(cost) 798600.41 6.902
GOO(result_size) 139100.42 1.202
GOO(combined) 154027.19 1.331
GEQO 173464.38 1.499
Observations:
* GOO(cost) shows severe tail behavior (max 431×, p99 327×), with many
bad cases.
* GOO(result_size) performs reasonably on average (gmean 0.969) but has
a poor tail (max 67×).
* GOO(combined) is the most robust variant: it achieves the best gmean
(0.953), shows no regressions ≥10×, and significantly reduces the worst
case (max 8.68×), trading some average workload time for improved tail
behavior.
* GEQO is generally slower on JOB (gmean 1.066) and also exhibits ≥10×
regressions (max 16.8×).
Conclusions:
1. All non-DP algorithms have cases that outperform DP, which typically
arise in situations where the cost estimates no longer reflect the
relative execution costs of alternative plans, often due to severe
cardinality misestimation. This patch set does not attempt to fix
cardinality estimation; instead, it focuses on reducing tail risk among
greedy plans under the existing cost model.
2. Based on the current JOB results, further tuning of a single greedy
heuristic is not the immediate priority: each single-strategy variant
still shows corner cases with large regressions. The next direction is
to reduce catastrophic plans (tail risk), rather than to optimize one
specific greedy criterion.
3. GOO(combined) improves robustness by reducing extreme regressions.
Increasing plan diversity via multiple greedy criteria appears
effective: different heuristics fail for different reasons, and the
combined variant selects the cheapest plan among candidates from
different greedy strategies, helping avoid extreme outliers with minimal
additional planning overhead.
Next, I include brief notes on a small set of representative JOB queries
to illustrate where greedy join ordering behaves relatively well or
poorly. Under severe cardinality misestimation, neither DP-based nor
heuristic approaches can reliably select good plans (“garbage in,
garbage out”); different algorithms simply exhibit different tolerances
to such errors. These examples document how specific join-ordering
choices interact with misestimation and help identify where the current
GOO implementation is more or less fragile.
For regressions, I focus on GOO(result_size) (12b, 3b, 11b) and
GOO(combined) (15a, 17b). I also include a few cases where
GOO(result_size) outperforms DP (29c, 31a). The full plan-level
analyses are in the attached v4_bad_case_analysis.txt and are optional
background; the summary points below are sufficient for the main
conclusions.
## GOO(result_size) bad case analysis
TL;DR:
* 12b: Estimation error misleads the greedy choice, causing a highly
selective predicate to be applied too late; with accurate estimates, the
regression can be avoided.
* 3b: An inherent weakness of result_size-based greediness, where small
estimated results hide expensive scans or repeated execution.
* 11b: Row count underestimation leads to a non-materialized NLJ with
repeated execution; accurate estimates or materialization / Hash Join
largely avoid the regression.
## GOO(combined) bad case analysis
GOO(combined) regressions mainly fall into two categories:
1. failures of the cost model itself (15a), and
2. cases where cost and result_size share the same structural weakness
(17b).
TL;DR:
* 15a: The selector chooses the cost-greedy plan due to lower estimated
cost, even though the result_size plan is much faster in reality. This
reflects cost model unreliability; in such cases, plan diversity
collapses back to a single fragile choice. This is an expected
limitation when both candidate plans depend on the same cost signal;
suggestions on improving plan diversity or adding lightweight safeguards
are welcome.
* 17b: Both greedy strategies converge to similarly bad plans (~3× DP),
indicating a structural limitation of greedy enumeration rather than a
selector issue.
## Representative cases where GOO(result_size) outperforms DP
In addition to regressions, I include a small number of cases where
greedy join ordering outperforms DP. These cases are not meant to
suggest that greedy ordering makes intrinsically better decisions, but
to illustrate different failure modes: avoiding fan-out amplification
under severe underestimation (29c, 31a).
TL;DR:
* 29c: Under severe cardinality underestimation, DP introduces early
fan-out, producing large intermediates and repeated inner execution.
GOO(result_size) starts from a highly selective predicate and avoids the
fan-out explosion.
* 31a: DP chooses an early multiplying join under severe
underestimation, causing row counts to explode. GOO(result_size) delays
the multiplying join until a more selective prefix is built.
## Discussion and next steps
On JOB, the combined variant behaves reasonably well overall: it reduces
extreme regressions compared to the single-heuristic variants, and looks
competitive with GEQO on aggregate measures. This does not show that it
is “good enough” in general, but it seems like a promising direction and
worth broader evaluation.
The results are clearly workload- and query-dependent. TPC-H and TPC-DS
provide limited coverage of scenarios where GEQO is exercised, while
JOB, as a commonly used and well-recognized benchmark, includes a subset
of queries that reflect cases where GEQO is actually used. That said,
JOB is still limited in overall join graph size. I therefore plan to
explore more complex workloads with larger join graphs and queries
involving more relations to better approximate real-world GEQO usage.
Suggestions for suitable workloads or query sets would be very welcome.
Evaluating plan quality in isolation is inherently difficult, since join
ordering is only one component of plan selection, alongside cardinality
estimation, operator costing, and other planner decisions. As seen on
JOB, even DP does not always produce the fastest plan, suggesting that
improvements in join ordering alone do not consistently translate into
end-to-end plan quality gains. This makes it less clear how such
improvements should be measured, or how average and tail behavior should
be weighed.
Planning time and memory usage, in contrast, follow more directly from
the algorithmic structure, and I plan to add explicit measurements for
them in future. Beyond plan quality, I am also interested in addressing
areas where GEQO is less satisfactory in practice, such as tunability
and graceful degradation (e.g., using DP up to a resource limit and
completing planning with a greedy step). Broader evaluation across
workloads should help clarify both aspects.
Given the current JOB results and the fact that this implementation
matches what seems to be a common baseline in industry systems and much
of the literature, my next step is to treat the current implementation
as a starting point and expand evaluation: more workloads and
configurations. I also plan to incorporate selectivity into the greedy
strategy and re-run the evaluations.
There is also a large body of academic work proposing additional
mechanisms to improve robustness. Many of these ideas do not seem to
have a clear “production” consensus, and few systems implement them, so
I would prefer to explore them incrementally based on evidence: first
establish behavior of the baseline across a wider range of tests, then
evaluate a small number of well-motivated improvments and measure their
impact.
--
Best regards,
Chengpeng Yan
Attachment
pgsql-hackers by date: