Thread: Re: optimize hashjoin
Hi, It seems you responded by writing a new message and just copying the subject, which unfortunately doesn't set the headers used for threading (e.g. in archives). Please just respond to the message. Or maybe your client does not set the References/In-Reply-To headers correctly. Not sure which mail client you're using. On 8/22/24 14:08, bucoo wrote: > >> 0) The patch does not apply anymore, thanks to David committing a patch > >> yesterday. Attached is a patch rebased on top of current master. > > That patch is based on PG17. I have now rewritten it based on the master > branch and added some comments. > Thanks. Yeah, patches should be based on "master". > >> 1) Wouldn't it be easier (and just as efficient) to use slots with > >> TTSOpsMinimalTuple, i.e. backed by a minimal tuple? > > Use diffent kind of slot, the ExecEvalExpr function will report an error. > Hmm, I haven't tried so it's possible it wouldn't work. > >> 2) I find the naming of the new functions a bit confusing. We now have >> the "original" functions working with slots, and then also functions >> with "Tuple" working with tuples. Seems asymmetric. > > In net patch function name renamed to ExecHashTableInsertSlot and > ExecHashTableInsertTuple, > > also ExecParallelHashTableInsertSlotCurrentBatch and > ExecParallelHashTableInsertTupleCurrentBatch. > OK > >> 3) The code in ExecHashJoinOuterGetTuple is hard to understand, it'd >> very much benefit from some comments. I'm a bit unsure if the likely() >> and unlikely() hints really help. > > In new patch added some comments. > > "Likely" and "unlikely" might offer only marginal help on some CPUs and > might not be beneficial at all on other platforms (I think). > Having such hints is an implicit suggestion it's beneficial. Otherwise we'd just use them everywhere, but we don't - only a tiny fraction of condition has those. > >> 4) Is the hj_outerTupleBuffer buffer really needed / effective? I'd bet >> just using palloc() will work just as well, thanks to the AllocSet >> caching etc. > > The hj_outerTupleBuffer avoid reform(materialize) tuple in > non-TTSOpsMinimalTuple scenarios, > > see ExecForceStoreMinimalTuple. I added some comments in new patch. > AFAIK you mean this comment: * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using * shouldFree as false to call ExecForceStoreMinimalTuple(). * * When slot is TTSOpsMinimalTuple we can avoid realloc memory for * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple). But my point was that I don't think the palloc/repalloc should be very expensive, once the AllocSet warms up a bit. * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid * reform(materialize) tuple(see ExecForceStoreMinimalTuple). Yeah, but doesn't that conflate two things - materialization and freeing the memory? Only because materialization is expensive, is that a good reason to abandon the memory management too? > >> Can you provide more information about the benchmark you did? What >> hardware, what scale, PostgreSQL configuration, which of the 22 queries >> are improved, etc. >> >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see pretty >> much no difference compared to master. However, it occurred to me the >> patch only ever helps if we increase the number of batches during >> execution, in which case we need to move tuples to the right batch. > > Only parallel HashJoin speed up to ~2x(all data cached in memory), > > not full query, include non-parallel HashJoin. > > non-parallel HashJoin only when batchs large then one will speed up, > > because this patch only optimize for read batchs tuples to memory. > I'm sorry, but this does not answer *any* of the questions I asked. Please provide enough info to reproduce the benefit - benchmark scale, which query, which parameters, etc. Show explain / explain analyze of the query without / with the patch, stuff like that. I ran a number of TPC-H benchmarks with the patch and I never a benefit of this scale. regards -- Tomas Vondra
> * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using > * shouldFree as false to call ExecForceStoreMinimalTuple(). > * > * When slot is TTSOpsMinimalTuple we can avoid realloc memory for > * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple). > > But my point was that I don't think the palloc/repalloc should be very expensive, once the AllocSet warms up a bit. Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple. > * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid > * reform(materialize) tuple(see ExecForceStoreMinimalTuple). > > Yeah, but doesn't that conflate two things - materialization and freeing the memory? Only because materialization is expensive,is that a good reason to abandon the memory management too? Currently, I haven't thought of a better way to avoid reform. > > > >> Can you provide more information about the benchmark you did? What > >> hardware, what scale, PostgreSQL configuration, which of the 22 > >> queries are improved, etc. > >> > >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see > >> pretty much no difference compared to master. However, it occurred to > >> me the patch only ever helps if we increase the number of batches > >> during execution, in which case we need to move tuples to the right batch. > > > > Only parallel HashJoin speed up to ~2x(all data cached in memory), > > > > not full query, include non-parallel HashJoin. > > > > non-parallel HashJoin only when batchs large then one will speed up, > > > > because this patch only optimize for read batchs tuples to memory. > > > > I'm sorry, but this does not answer *any* of the questions I asked. > > Please provide enough info to reproduce the benefit - benchmark scale, which query, which > parameters, etc. Show explain/ explain analyze of the query without / with the patch, stuff > like that. > > I ran a number of TPC-H benchmarks with the patch and I never a benefit of this scale. After further testing, it turns out that the parallel hashjoin did not improve performance. I might have compared it witha debug version at the time. I apologize for that. Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement. Here is the testing information: CPU: 13th Gen Intel(R) Core(TM) i7-13700 Memory: 32GB SSD: UMIS REPEYJ512MKN1QWQ Windows version: win11 23H2 22631.4037 WSL version: 2.2.4.0 Kernel version: 5.15.153.1-2 OS version: rocky linux 9.4 TPCH: SF=8 SQL: set max_parallel_workers_per_gather = 0; set enable_mergejoin = off; explain (verbose,analyze) select count(*) from lineitem, orders where lineitem.l_orderkey = orders.o_orderkey; patch before: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=10591.679..10591.681 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actualtime=0.023..1974.365 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1074.155..1074.156 rows=12000000 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actualtime=0.014..464.346 rows=12000000 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.141 ms Execution Time: 10591.730 ms (16 rows) Patch after: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=9826.105..9826.106 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actualtime=0.015..1989.389 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1086.357..1086.358 rows=12000000 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actualtime=0.011..470.225 rows=12000000 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.065 ms Execution Time: 9826.135 ms
On Fri, Aug 23, 2024 at 7:02 AM bucoo <bucoo@sohu.com> wrote: > Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement. > -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007 loops=1) > -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007 loops=1) It's not a good idea to test performance with EXPLAIN ANALYZE, generally speaking. And you usually need to test a few times and average or something, rather than just a single test. But also, this doesn't show the hash join being 10% faster. It shows the hash join being essentially the same speed (1075ms unpatched, 1087ms patched), and the aggregate node on top of it being faster. Now, it does seem possible to me that changing one node could cause a performance improvement for the node above it, but I don't quite see why that would happen in this case. -- Robert Haas EDB: http://www.enterprisedb.com