Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
Date
Msg-id 2747373b-d188-43b1-8e49-66f9e23e3c24@vondra.me
Whole thread Raw
In response to Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)  (Adrian Mönnich <adrian.moennich@cern.ch>)
Responses Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
List pgsql-bugs
Hi,

I can reproduce the performance getting much worse in 16, using the
provided SQL scripts. This is what I see:

  14: 1551.363 ms
  15: 1385.414 ms
  16: 161571.400 ms
  17: 156434.543 ms
  18: 159095.001 ms

I'm attaching the explains for 15+16. I don't know what's causing it,
but I have a couple interesting observations.

1) If I disable parallel query, the timings change to

  14: 3990.439 ms
  15: 3518.453 ms
  16: 3606.460 ms
  17: 3591.039 ms
  18: 3617.872 ms

So no regression in this case. It seems to be related to parallelism.


2) There seems to be an explosion of temporary files. We don't have that
in explain, but I queried pg_stat_database before/after the query, and
there's huge difference. Both start at

  temp_files               | 112
  temp_bytes               | 1942275280

so 112 files, ~2GB disk space. But after the query, 15 says

  temp_files               | 721
  temp_bytes               | 2755839184

while 16 has

  temp_files               | 2078995
  temp_bytes               | 70607906000

2M files and 70GB? Wow!


3) Indeed, before the query completes the pgsql_tmp directory has this:

  63M    pgsql_tmp3499395.0.fileset
  63G    pgsql_tmp3499395.1.fileset
  95M    pgsql_tmp3499395.2.fileset
  95M    pgsql_tmp3499395.3.fileset
  127M    pgsql_tmp3499395.4.fileset

So I guess that's one of the parallel hash joins doing something, and
consuming 63GB of disk space? I don't see anything suspicious in the
plan, but I assume parallel HJ may not report the relevant stats.

FWIW bumping up work_mem (to 64MB) solved this with the sample data.

I suspect this is going to be something like the hash join explosion,
where we just happen to add more and more batches. I don't have time to
investigate this more at the moment.


regards

-- 
Tomas Vondra

Attachment

pgsql-bugs by date:

Previous
From: Adrian Mönnich
Date:
Subject: Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
Next
From: Tomas Vondra
Date:
Subject: Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)