[PATCH] Reduce EXPLAIN ANALYZE overhead for row counting - Mailing list pgsql-hackers
| From | Hironobu SUZUKI |
|---|---|
| Subject | [PATCH] Reduce EXPLAIN ANALYZE overhead for row counting |
| Date | |
| Msg-id | 64996009-156f-414c-84f2-bea2da55e29f@interdb.jp Whole thread Raw |
| List | pgsql-hackers |
Hi hackers,
I would like to propose an approach to reduce EXPLAIN ANALYZE overhead
for row counting.
# Introduction
A critical issue with EXPLAIN ANALYZE is its significant performance
overhead. Therefore, I focused solely on counting actual rows—i.e.,
where TIMING, BUFFERUSAGE, and WALUSAGE are all FALSE. Through several
iterations of optimization attempts, I managed to reduce this overhead
from 26.60% down to 1.64%. (Please see Appendix A for benchmarks.)
The optimization consists of three phases:
+ Phase 1
- Action / Approach: Simplification of ExecProcNodeInstr() and
InstrStopNode().
- Resulting Overhead: 8.13%
- Status: Ready for immediate review.
+ Phase 2
- Action / Approach: Direct instrumentation inside the ExecXXX()
family (e.g., ExecSeqScan, ExecIndexScan).
- Resulting Overhead: 3.95%
- Issue: Current implementation causes a 0.47% performance
degradation for normal SELECT queries.
- Status: Proof-of-Concept (PoC)
+ Phase 3
- Action / Approach: Memory optimization incl. PlanState changes and
bit-packing.
- Resulting Overhead: 1.64%
- Issue: Inherits the performance degradation issue found in Phase 2.
- Status: Proof-of-Concept (PoC). JIT unsupported.
If implemented, this feature offers substantial benefits:
+ Enhanced Monitoring: Enables frequent use of tools like auto_explain
to better identify performance issues.
+ Optimizer Improvement: Allows external extensions like Adaptive
Query Optimization (AQO) by PostgresPro to perform frequent,
low-impact training, leading to more accurate optimizers and
stimulating research.
+ Query Progress Observation: Makes it feasible to observe query
progress directly by snapshotting the executing plan and its actual
row counts.
## Immediate Focus: Phase 1 Readiness
The initial step, Phase 1, is a non-intrusive, production-ready
solution that delivers substantial performance gains -- reducing the
original overhead from 26.60% to 8.13% (over 69% reduction). Its
isolation and minimal code changes make it the most actionable
component for immediate adoption into the main branch.
# Detailed Design Specification
Profiler results indicate that InstrStartNode(), InstrStopNode(), and
ExecProcNodeInstr() are the primary sources of overhead (see Appendix
B).
My approach systematically eliminates these sources while utilizing
the existing Instrumentation Framework. (A major architectural
overhaul is outside the scope of this proposal.)
Note that preliminary results indicate that inlining these key
functions reduces overhead by several percentage points in non-target
cases (e.g., when BUFFERUSAGE or WALUSAGE are TRUE). Therefore, all
subsequent phases assume inlining is applied.
## Phase 1
### Approach: Simplifying ExecProcNodeInstr()
When ANALYZE is TRUE but other metrics are FALSE, we can skip
InstrStartNode() and reduce InstrStopNode() to just two operations:
incrementing tuplecount and setting running=true.
I introduced ExecProcNodeInstrRowCount() to execute only these minimal
statements:
```
static TupleTableSlot *
ExecProcNodeInstrRowCount(PlanState *node)
{
TupleTableSlot *result;
/* Skip calling InstrStartNode() */
result = node->ExecProcNodeReal(node);
/* Perform the essential steps from InstrStopNode() */
if (!TupIsNull(result))
node->instrument->tuplecount += 1.0;
node->instrument->running = true; /* Always true */
return result;
}
```
ExecProcNodeFirst() is extended to switch between ExecProcNodeInstr
and ExecProcNodeInstrRowCount based on the requested metrics.
(While this proposal focuses solely on counting actual rows,
optimization of other metrics can be achieved without conflicts by
employing conditional branching within ExecProcNodeFirst().)
### Results:
Overhead reduced to 8.13%. Crucially, this introduces no performance
degradation for normal (non-instrumented) execution.
## Phase 2
Phase 2 is a Proof-of-Concept (PoC) model.
### Approach: Direct Instrumentation in ExecXXX Functions
Phase 2 removes the wrapper function overhead by modifying every
ExecXXX function (e.g., ExecSeqScan, ExecIndexScan) to call a new,
inlined helper, ExecNodeRowcountEnd(), at the final step.
ExecSeqScan() Example:
```
// BEFORE
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
SeqScanState *node = castNode(SeqScanState, pstate);
return ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
NULL,
NULL);
}
// AFTER
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
TupleTableSlot *result;
SeqScanState *node = castNode(SeqScanState, pstate);
result = ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
NULL,
NULL);
return ExecNodeRowcountEnd(pstate, result);
}
```
ExecNodeRowcountEnd() Definition:
```
static pg_attribute_always_inline TupleTableSlot *
ExecNodeRowcountEnd(PlanState *node, TupleTableSlot *result)
{
Instrumentation *const instr = node->instrument;
if (!instr)
return result;
if (!instr->fast_path_instr)
return result;
/* count the returned tuples */
if (!TupIsNull(result))
instr->tuplecount += 1.0;
instr->running = true; /* Always true */
return result;
}
```
I added a boolean flag, fast_path_instr, to the Instrumentation
struct. This is set to TRUE only when counting actual rows is the sole
active metric.
```
typedef struct Instrumentation
{
/* true if we can bypass InstrStartNode and InstrStopNode */
bool fast_path_instr;
/* Parameters set at node creation: */
...
} Instrumentation;
```
#### Execution Paths:
1. Standard SELECT: ExecNodeRowcountEnd is called but returns
immediately (if (!instr)).
+ Issue: The function call and check cause a performance degradation.
2. EXPLAIN ANALYZE (Fast Path): fast_path_instr is TRUE. Executes
minimal logic.
3. EXPLAIN ANALYZE (Full): Normal path. InstrStartNode runs,
ExecNodeRowcountEnd returns early (fast_path_instr is FALSE), then
InstrStopNode runs.
(See the attached figure: fast_path_instr.png)
### Results and Degradation Analysis:
Overhead reduced to 3.95%.
However, the non-instrumented baseline (Query-1) suffered a 0.47%
slowdown (approx. 0.57s increase) due to the unconditional call to
ExecNodeRowcountEnd. While code duplication (e.g., ExecSeqScan vs
ExecSeqScanWithInstr) could solve this, it is unmaintainable. I am
seeking a cleaner solution.
## Phase 3
Phase 3 is also a Proof-of-Concept (PoC) and Feasibility Study. It is
currently ad-hoc and does not support JIT.
The performance degradation observed here stems from the same reasons
as in Phase 2 (unconditional checks in the hot path).
### Approach: Reducing Register Loads
To further minimize overhead, I focused on memory and register
efficiency:
1. Consolidating Variables: tuplecount (bits 0-61), running (bit 62),
and fast_path_instr (bit 63) are packed into a single uint64 hot_instr.
2. Relocating the Variable: hot_instr is moved from Instrumentation to
PlanState.
Note: A 62-bit integer is sufficient for tuplecount as the double
mantissa limit is 53 bits.
PlanState Structure Modification:
```
typedef struct PlanState
{
...
/* bit 63: fast_path_instr, bit 62: running, bit 0-61: tuplecount */
uint64_t hot_instr;
Instrumentation *instrument; /* Optional runtime stats for this node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
...
}
```
ExecNodeRowcountEnd() Implementation:
```
static pg_attribute_always_inline TupleTableSlot *
ExecNodeRowcountEnd(PlanState *node, TupleTableSlot *result)
{
uint64_t current = node->hot_instr;
if ((current & FAST_PATH_INSTR_MASK) == 0)
return result;
if (!TupIsNull(result))
/* Prevent overflow */
if ((current & TUPLECOUNT_MASK) != MAX_TUPLECOUNT)
current += 1; /* tuplecount++ */
current |= RUNNING_MASK; /* running = true */
/* store current */
node->hot_instr = current;
return result;
}
```
This reduces data loads from cache/memory to registers from five
(Phase 2) to two (Phase 3).
Specifically, we eliminated the need to load the address of the
Instrumentation struct. Additionally, by loading three pieces of data
simultaneously, we reduced the total number of load operations by
three.
### Results:
Overhead reduced to 1.64%.
As with Phase 2, a performance degradation in standard queries was
observed (1.11%). See Appendix A for a discussion on the variance of
these degradation metrics.
# Implementation Status
Independent patches corresponding to each phase are attached, along
with their current review status:
+ Phase 1 (Status: Ready for immediate review):
v1-0001-Introduce-ExecProcNodeInstrRowCount-for-reduced-A.patch
+ Phase 2 (Status: Proof-of-Concept (PoC)):
v1-0001-POC-for-direct-instrumentation-call-in-ExecXXX-no.patch
+ Phase 3 (Status: Proof-of-Concept (PoC)):
v1-0001-POC-for-bit-packing-instrumentation-into-PlanStat.patch
# Call for Feedback
Given that Phase 1 is non-intrusive and offers an immediate 69%
overhead reduction, I believe its stability makes this patch ready for
integration, and I primarily seek review for it in this thread.
I also welcome feedback on the following specific challenges:
+ General Idea: Thoughts on optimizing actual rows counting.
+ Phase 2 Implementation: Ideas to avoid the 0.47% performance
degradation in standard queries without code duplication.
+ Phase 3 Strategy: Comments on the hot_instr bit-packing approach and
PlanState modification.
Best regards,
Hironobu Suzuki
# Appendix
## A: Benchmark
To reproduce these results:
```
$ git clone https://github.com/s-hironobu/fast_explain_analyze.git
$ cd fast_explain_analyze
$ bash ./bench.sh setup
$ bash ./bench.sh benchmark
```
### A.1: Environment
|Component|Version|Note|
|-|:-|:-|
|CPU | M1 Max|ARM architecture|
|OS | macOS 15.5||
|Compiler |Apple clang version 17.0.0 | Target: arm64-apple-darwin24.5.0|
|PostgreSQL | 19dev | config options: --without-icu CFLAGS="-O3 -g"|
|postgresql.conf|Default settings, except: shared_buffers = 512MB,
max_parallel_workers_per_gather = 0, max_parallel_workers = 0| Disable
parallel queries.|
Note: Although the benchmark disables parallel queries, all patches
support them.
### A.2: Dataset
```
CREATE TABLE test1 (id int, data int);
CREATE INDEX test1_id_idx ON test1 (id);
CREATE TABLE test2 (id int PRIMARY KEY, data int);
CREATE TABLE test3 (id int PRIMARY KEY, data int);
INSERT INTO test1 (id, data) SELECT i, i % 51 FROM generate_series(1,
150000) AS i;
INSERT INTO test1 (id, data) SELECT i, i % 51 FROM generate_series(1,
5000) AS i;
INSERT INTO test2 (id, data) SELECT i, floor(random() * 50 + 1)::int
FROM generate_series(1, 50000) AS i;
INSERT INTO test3 (id, data) SELECT i, i FROM generate_series(1, 100000)
AS i;
ANALYZE;
```
### A.3: Queries
+ Query-1: Standard SELECT
+ Query-2: EXPLAIN ANALYZE with TIMING,BUFFERS,WAL=FALSE
```
-- Query-1
SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;
-- Query-2
EXPLAIN (ANALYZE TRUE, TIMING FALSE, BUFFERS FALSE) SELECT count(*) FROM
test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;
```
### A.4: Results (Test3 size: 100,000 rows)
Query-1 and Query-2 were each executed 10 times. The results show the
average execution time (and variance) and the derived overheads.
+ Overhead Definition:
Overhead = 100 * (ExecutionTime(Query-2) - ExecutionTime(Query-1))/
ExecutionTime(Query-1)
Overhead based on Original = 100 * (ExecutionTime(Query-2) -
ExecutionTime(Original Query-1)) / ExecutionTime(Original Query-1)
| |Query-1 [sec] (Var.[$\text{sec}^{2}$])|Query-1 Degradation
[%]|Query-2 [sec] (Var. [$\text{sec}^{2}$])|Overhead [%]| Overhead based
on Original [%] |
|-|-:|-:|-:|-:|-:|
|Original |120.7846 (0.1260)|N/A |152.9188 (0.3539)|26.60|26.60|
|Phase 1 |120.7209 (0.0822)|N/A |130.6013 (0.0257)| 8.18| 8.13|
|Phase 2 |121.3559 (0.0007)|0.47|125.5515 (0.0039)| 3.46| 3.95|
|Phase 3 |122.1207 (0.1115)|1.11|122.7696 (0.1403)| 0.53| 1.64|
### A.5: Sensitivity Analysis
#### Test3 size: 35,000 rows
| |Query-1 [sec] (Var.[$\text{sec}^{2}$])|Query-1 Degradation
[%]|Query-2 [sec] (Var. [$\text{sec}^{2}$])|Overhead [%]| Overhead based
on Original [%] |
|-|-:|-:|-:|-:|-:|
|Original |45.3314 (0.0090)|N/A |58.4396 (0.0721)|28.92|28.92|
|Phase 1 |45.3670 (0.0115)|N/A |50.5628 (0.0133)|11.45|11.54|
|Phase 2 |45.8230 (0.0196)|1.08|48.4476 (0.0159)| 5.73| 6.87|
|Phase 3 |45.3970 (0.2161)|0.15|46.2231 (0.1641)| 1.82| 1.96|
#### Test3 size: 200,000 rows
| |Query-1 [sec] (Var.[$\text{sec}^{2}$])|Query-1 Degradation
[%]|Query-2 [sec] (Var. [$\text{sec}^{2}$])|Overhead [%]| Overhead based
on Original [%] |
|-|-:|-:|-:|-:|-:|
|Original |177.4159 (0.2237)|N/A |226.9384 (0.6501)|27.91|27.91|
|Phase 1 |176.2791 (0.1037)|N/A |192.8065 (0.3016)| 9.38| 8.67|
|Phase 2 |177.9841 (0.2546)|0.32|185.5116 (0.3097)| 4.23| 4.56|
|Phase 3 |178.8132 (0.2190)|0.78|181.2868 (0.4573)| 1.38| 2.18|
#### Discussion on Results
The measured results contain statistical variance due to environmental
factors. For example, the execution time for the original Query-1 is
slightly higher than that of Phase 1, which suggests measurement
noise.
Furthermore, the benchmark results are highly sensitive to the dataset
size and environmental factors. As demonstrated by reducing the size
of the test3 table from 100,000 rows to 35,000 rows, the Query-1
performance degradation in Phase 3 is sharply reduced from 1.11% to
0.15%.
Therefore, these benchmark figures should be interpreted not as
absolute performance metrics, but primarily as an indication of the
significant trend of overhead reduction achieved through the proposed
phases.
## B: Profiler Results
+ Measured on Ubuntu 24.04.
+ Test3 size: 35,000 rows
**Query-1:** SELECT
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.38 34.05 34.05 4000040003 0.00 0.00 ExecInterpExpr
15.68 53.55 19.50 2000000001 0.00 0.00 ExecNestLoop
11.44 67.77 14.22 _init
7.95 77.66 9.89 2000010000 0.00 0.00
tuplestore_gettupleslot
7.02 86.39 8.73 2 4.37 54.90 ExecAgg
6.68 94.70 8.31 2000010000 0.00 0.00 tuplestore_gettuple
6.48 102.76 8.06 1999960000 0.00 0.00
ExecStoreMinimalTuple
5.13 109.14 6.38 2000050000 0.00 0.00 ExecMaterial
4.18 114.34 5.20 2000000001 0.00 0.00 fetch_input_tuple
...
```
**Query-2:** EXPLAIN ANALYZE (TIMING FALSE)
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.24 36.39 36.39 4000040003 0.00 0.00 ExecInterpExpr
15.53 62.99 26.60 _init
12.08 83.69 20.70 2000000001 0.00 0.00 ExecNestLoop
8.44 98.15 14.46 2000010016 0.00 0.00
tuplestore_gettupleslot
6.01 108.44 10.29 4000215014 0.00 0.00 InstrStartNode
5.43 117.74 9.30 2000050000 0.00 0.00 ExecMaterial
5.32 126.86 9.12 4000215014 0.00 0.00 InstrStopNode
5.26 135.87 9.01 2000050000 0.00 0.00 tuplestore_ateof
4.13 142.94 7.07 4000215015 0.00 0.00 ExecProcNodeInstr
...
```
Attachment
pgsql-hackers by date: