On 4/2/26 20:12, Tomas Vondra wrote:
> Hi,
>
> I can reproduce the performance getting much worse in 16, using the
> provided SQL scripts. This is what I see:
>
> 14: 1551.363 ms
> 15: 1385.414 ms
> 16: 161571.400 ms
> 17: 156434.543 ms
> 18: 159095.001 ms
>
> I'm attaching the explains for 15+16. I don't know what's causing it,
> but I have a couple interesting observations.
>
> 1) If I disable parallel query, the timings change to
>
> 14: 3990.439 ms
> 15: 3518.453 ms
> 16: 3606.460 ms
> 17: 3591.039 ms
> 18: 3617.872 ms
>
> So no regression in this case. It seems to be related to parallelism.
>
>
> 2) There seems to be an explosion of temporary files. We don't have that
> in explain, but I queried pg_stat_database before/after the query, and
> there's huge difference. Both start at
>
> temp_files | 112
> temp_bytes | 1942275280
>
> so 112 files, ~2GB disk space. But after the query, 15 says
>
> temp_files | 721
> temp_bytes | 2755839184
>
> while 16 has
>
> temp_files | 2078995
> temp_bytes | 70607906000
>
> 2M files and 70GB? Wow!
>
>
> 3) Indeed, before the query completes the pgsql_tmp directory has this:
>
> 63M pgsql_tmp3499395.0.fileset
> 63G pgsql_tmp3499395.1.fileset
> 95M pgsql_tmp3499395.2.fileset
> 95M pgsql_tmp3499395.3.fileset
> 127M pgsql_tmp3499395.4.fileset
>
> So I guess that's one of the parallel hash joins doing something, and
> consuming 63GB of disk space? I don't see anything suspicious in the
> plan, but I assume parallel HJ may not report the relevant stats.
>
> FWIW bumping up work_mem (to 64MB) solved this with the sample data.
>
> I suspect this is going to be something like the hash join explosion,
> where we just happen to add more and more batches. I don't have time to
> investigate this more at the moment.
>
FWIW I think that's what's happening. If I add an elog(WARNING) into
ExecParallelHashJoinSetUpBatches, I see this:
WARNING: 0x55dbe375a5e8 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 32 batches
WARNING: 0x7f3868a3ab80 initializing 4 batches
WARNING: 0x55dbe36148c0 initializing 4 batches
WARNING: 0x7f3868a3b230 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 64 batches
WARNING: 0x55dbe36144b0 initializing 128 batches
WARNING: 0x55dbe36144b0 initializing 256 batches
WARNING: 0x55dbe36144b0 initializing 512 batches
WARNING: 0x55dbe36144b0 initializing 1024 batches
WARNING: 0x7f3868a3a978 initializing 2048 batches
WARNING: 0x7f3868a3a978 initializing 4096 batches
WARNING: 0x55dbe36144b0 initializing 8192 batches
WARNING: 0x55dbe36144b0 initializing 16384 batches
WARNING: 0x55dbe36144b0 initializing 32768 batches
WARNING: 0x7f3868a3a978 initializing 65536 batches
WARNING: 0x55dbe36144b0 initializing 131072 batches
WARNING: 0x7f3868a3a978 initializing 262144 batches
so we're ending with 256k batches, for this one join. I'm not sure how
exactly this maps to the 2M files from pg_stat_database, but it means
~0.5M tuplestores and ~10GB virtual memory (at lest per top).
I don't know what triggers the batch increase, but I still suspect it's
similar to the explosion we fixed (or mitigated) in PG18, but only for
serial (non-parallel) joins.
regards
--
Tomas Vondra