Thread: Re: optimize hashjoin

Re: optimize hashjoin

From
Tomas Vondra
Date:
Hi,

It seems you responded by writing a new message and just copying the
subject, which unfortunately doesn't set the headers used for threading
(e.g. in archives). Please just respond to the message.

Or maybe your client does not set the References/In-Reply-To headers
correctly. Not sure which mail client you're using.


On 8/22/24 14:08, bucoo wrote:
> 
>> 0) The patch does not apply anymore, thanks to David committing a patch
> 
>> yesterday. Attached is a patch rebased on top of current master.
> 
> That patch is based on PG17. I have now rewritten it based on the master
> branch and added some comments.
> 

Thanks. Yeah, patches should be based on "master".

> 
>> 1) Wouldn't it be easier (and just as efficient) to use slots with
> 
>> TTSOpsMinimalTuple, i.e. backed by a minimal tuple?
> 
> Use diffent kind of slot, the ExecEvalExpr function will report an error.
> 

Hmm, I haven't tried so it's possible it wouldn't work.

> 
>> 2) I find the naming of the new functions a bit confusing. We now have
>> the "original" functions working with slots, and then also functions
>> with "Tuple" working with tuples. Seems asymmetric.
> 
> In net patch function name renamed to ExecHashTableInsertSlot and
> ExecHashTableInsertTuple,
> 
> also ExecParallelHashTableInsertSlotCurrentBatch and
> ExecParallelHashTableInsertTupleCurrentBatch.
> 

OK

> 
>> 3) The code in ExecHashJoinOuterGetTuple is hard to understand, it'd
>> very much benefit from some comments. I'm a bit unsure if the likely()
>> and unlikely() hints really help.
> 
> In new patch added some comments.
> 
> "Likely" and "unlikely" might offer only marginal help on some CPUs and
> might not be beneficial at all on other platforms (I think).
> 

Having such hints is an implicit suggestion it's beneficial. Otherwise
we'd just use them everywhere, but we don't - only a tiny fraction of
condition has those.

> 
>> 4) Is the hj_outerTupleBuffer buffer really needed / effective? I'd bet
>> just using palloc() will work just as well, thanks to the AllocSet
>> caching etc.
> 
> The hj_outerTupleBuffer avoid reform(materialize) tuple in
> non-TTSOpsMinimalTuple scenarios,
> 
> see ExecForceStoreMinimalTuple. I added some comments in new patch.
> 

AFAIK you mean this comment:

 * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using
 * shouldFree as false to call ExecForceStoreMinimalTuple().
 *
 * When slot is TTSOpsMinimalTuple we can avoid realloc memory for
 * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple).

But my point was that I don't think the palloc/repalloc should be very
expensive, once the AllocSet warms up a bit.

 * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid
 * reform(materialize) tuple(see ExecForceStoreMinimalTuple).

Yeah, but doesn't that conflate two things - materialization and freeing
the memory? Only because materialization is expensive, is that a good
reason to abandon the memory management too?

> 
>> Can you provide more information about the benchmark you did? What
>> hardware, what scale, PostgreSQL configuration, which of the 22 queries
>> are improved, etc.
>>
>> I ran TPC-H with 1GB and 10GB scales on two machines, and I see pretty
>> much no difference compared to master. However, it occurred to me the
>> patch only ever helps if we increase the number of batches during
>> execution, in which case we need to move tuples to the right batch.
> 
> Only parallel HashJoin speed up to ~2x(all data cached in memory),
> 
> not full query, include non-parallel HashJoin.
> 
> non-parallel HashJoin only when batchs large then one will speed up,
> 
> because this patch only optimize for read batchs tuples to memory.
> 

I'm sorry, but this does not answer *any* of the questions I asked.

Please provide enough info to reproduce the benefit - benchmark scale,
which query, which parameters, etc. Show explain / explain analyze of
the query without / with the patch, stuff like that.

I ran a number of TPC-H benchmarks with the patch and I never a benefit
of this scale.

regards

-- 
Tomas Vondra



答复: optimize hashjoin

From
"bucoo"
Date:
>  * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using
>  * shouldFree as false to call ExecForceStoreMinimalTuple().
>  *
>  * When slot is TTSOpsMinimalTuple we can avoid realloc memory for
>  * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple).
>
> But my point was that I don't think the palloc/repalloc should be very expensive, once the AllocSet warms up a bit.

Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple.

>  * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid
>  * reform(materialize) tuple(see ExecForceStoreMinimalTuple).
>
> Yeah, but doesn't that conflate two things - materialization and freeing the memory? Only because materialization is
expensive,is that a good reason to abandon the memory management too? 

Currently, I haven't thought of a better way to avoid reform.

> >
> >> Can you provide more information about the benchmark you did? What
> >> hardware, what scale, PostgreSQL configuration, which of the 22
> >> queries are improved, etc.
> >>
> >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see
> >> pretty much no difference compared to master. However, it occurred to
> >> me the patch only ever helps if we increase the number of batches
> >> during execution, in which case we need to move tuples to the right batch.
> >
> > Only parallel HashJoin speed up to ~2x(all data cached in memory),
> >
> > not full query, include non-parallel HashJoin.
> >
> > non-parallel HashJoin only when batchs large then one will speed up,
> >
> > because this patch only optimize for read batchs tuples to memory.
> >
>
> I'm sorry, but this does not answer *any* of the questions I asked.
>
> Please provide enough info to reproduce the benefit - benchmark scale, which query, which > parameters, etc. Show
explain/ explain analyze of the query without / with the patch, stuff > like that. 
>
> I ran a number of TPC-H benchmarks with the patch and I never a benefit of this scale.

After further testing, it turns out that the parallel hashjoin did not improve performance. I might have compared it
witha debug version at the time. I apologize for that. 

Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement.
Here is the testing information:

CPU: 13th Gen Intel(R) Core(TM) i7-13700
Memory: 32GB
SSD: UMIS REPEYJ512MKN1QWQ
Windows version: win11 23H2 22631.4037
WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
OS version: rocky linux 9.4
TPCH: SF=8

SQL:
set max_parallel_workers_per_gather = 0;
set enable_mergejoin = off;
explain (verbose,analyze)
select count(*)
from lineitem, orders
where lineitem.l_orderkey = orders.o_orderkey;

patch before:
Aggregate  (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=10591.679..10591.681 rows=1 loops=1)
   Output: count(*)
   ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007
loops=1)
         Inner Unique: true
         Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Index Only Scan using lineitem_pkey on public.lineitem  (cost=0.56..1246171.69 rows=47989008 width=4)
(actualtime=0.023..1974.365 rows=47989007 loops=1) 
               Output: lineitem.l_orderkey
               Heap Fetches: 0
         ->  Hash  (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1074.155..1074.156 rows=12000000
loops=1)
               Output: orders.o_orderkey
               Buckets: 262144  Batches: 128  Memory Usage: 5335kB
               ->  Index Only Scan using orders_pkey on public.orders  (cost=0.43..311620.43 rows=12000000 width=4)
(actualtime=0.014..464.346 rows=12000000 loops=1) 
                     Output: orders.o_orderkey
                     Heap Fetches: 0
 Planning Time: 0.141 ms
 Execution Time: 10591.730 ms
(16 rows)

Patch after:
Aggregate  (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=9826.105..9826.106 rows=1 loops=1)
   Output: count(*)
   ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007
loops=1)
         Inner Unique: true
         Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Index Only Scan using lineitem_pkey on public.lineitem  (cost=0.56..1246171.69 rows=47989008 width=4)
(actualtime=0.015..1989.389 rows=47989007 loops=1) 
               Output: lineitem.l_orderkey
               Heap Fetches: 0
         ->  Hash  (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1086.357..1086.358 rows=12000000
loops=1)
               Output: orders.o_orderkey
               Buckets: 262144  Batches: 128  Memory Usage: 5335kB
               ->  Index Only Scan using orders_pkey on public.orders  (cost=0.43..311620.43 rows=12000000 width=4)
(actualtime=0.011..470.225 rows=12000000 loops=1) 
                     Output: orders.o_orderkey
                     Heap Fetches: 0
 Planning Time: 0.065 ms
 Execution Time: 9826.135 ms




Re: optimize hashjoin

From
Robert Haas
Date:
On Fri, Aug 23, 2024 at 7:02 AM bucoo <bucoo@sohu.com> wrote:
> Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement.
>    ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007
loops=1)
>    ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007
loops=1)

It's not a good idea to test performance with EXPLAIN ANALYZE,
generally speaking. And you usually need to test a few times and
average or something, rather than just a single test. But also, this
doesn't show the hash join being 10% faster. It shows the hash join
being essentially the same speed (1075ms unpatched, 1087ms patched),
and the aggregate node on top of it being faster.

Now, it does seem possible to me that changing one node could cause a
performance improvement for the node above it, but I don't quite see
why that would happen in this case.

--
Robert Haas
EDB: http://www.enterprisedb.com