Re: Performance Issue on Query 18 of TPC-H Benchmark - Mailing list pgsql-bugs

From Andrei Lepikhov
Subject Re: Performance Issue on Query 18 of TPC-H Benchmark
Date
Msg-id 98398f84-6079-47f9-958c-c7e323228cc7@gmail.com
Whole thread Raw
In response to Re: Performance Issue on Query 18 of TPC-H Benchmark  (Ba Jinsheng <bajinsheng@u.nus.edu>)
Responses Re: Performance Issue on Query 18 of TPC-H Benchmark
List pgsql-bugs
On 10/17/24 01:26, Ba Jinsheng wrote:
>  >I believe you are not allowing the optimizer to generate a different 
> aggregation path (Group Aggregate) because it requires a sort operation. 
> So I think this is not correct.
> 
> Yes, this is what I did. I though it is what you were asking? I have not 
> found another way to enforce HashAggregate, so I directly modified the 
> code. Can you eliberate why it is incorrect?
As I see, the main problem lies in the first aggregate (with HAVING 
clause) where hash aggregation seems preferable. To disable that, you 
can use some trick:
SET enable_hashagg = 'on';
SET enable_sort = 'off';
SET work_mem = '1GB';

In my case the optimiser have built aggregation:
->  HashAggregate  (cost=202639.35..208346.45 rows=126825 width=4)
     (actual time=3540.761..4224.547 rows=9 loops=3)
     Group Key: lineitem_1.l_orderkey
       Filter: (sum(lineitem_1.l_quantity) > '314'::numeric)
       Batches: 1  Memory Usage: 638993kB
       Rows Removed by Filter: 1499991
       Worker 0:  Batches: 1  Memory Usage: 638993kB
       Worker 1:  Batches: 1  Memory Usage: 638993kB
       ->  Seq Scan on lineitem lineitem_1
       (cost=0.00..172626.23 rows=6002623 width=9)
       (actual time=0.014..675.552 rows=6001215 loops=3)

Not sure it is the best plan possible in this case. I know only about 
re-optimisation feature which can provide correct number of groups 
estimation to aggregates as well as cardinality and work_mem and give 
the optimiser all correct estimations. But it is still an enterprise 
feature :(.

> Can I understand disabling parallelism is a good setup for finding 
> performance issues?
I usually use such such a switch to identify problems. Parallel workers 
use HashJoin more frequently and it sometimes cause non-linear behaviour 
of execution time, compared to sequental plan. At the same time, they 
donn't support parameterised paths and it may end up in interesting 
performance jumps ...

BTW, I also wonder why do you report to pgsql-bugs in presence of better 
fitted pgsql-performance thread?

-- 
regards, Andrei Lepikhov




pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: Performance Issue on Query 18 of TPC-H Benchmark
Next
From: Amit Langote
Date:
Subject: Re: BUG #18657: Using JSON_OBJECTAGG with volatile function leads to segfault