Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id 53BC5CD6.4080502@fuzzy.cz
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On 8.7.2014 21:53, Jeff Janes wrote:
> On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> 
>> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large
>> outer relations it may be way cheaper to use higher NTUP_PER_BUCKET
>> values instead of increasing the number of batches (resulting in
>> repeated scans of the outer table). I think it's important (but
>> difficult) to keep these things somehow balanced.
>> 
>> With large work_mem values the amount of memory for buckets may be
>> quite significant. E.g. 800MB work_mem may easily give ~120MB of
>> memory taken by buckets with NTUP_PER_BUCKET=1. With
>> NTUP_PER_BUCKET=4 it's just ~30MB.
> 
> 
> That extra 90MB is memory well spent, in my book. Versus having to
> walk a 4-deep linked list which is scattered all over main memory
> just to find one entry we care about in it.

Yes, I share this view, although it really depends on how expensive it's
to rescan the outer relation (due to more batches) vs. lookup in a
"deeper" hash table.

The other thing is that this memory is currently unaccounted for, i.e.
the space for buckets is not counted within the work_mem limit (unless
I'm missing something in the code). I'm not sure why, and I believe it
should be, so I changer this in the patch.

> It might cause some things that were very close to the edge to tip
> over into multi-pass hash joins, but even that is not necessarily a
> bad thing. (When I test with work_mem in the 20 to 50 MB range, I
> often find batches=2 be ~30% faster than batches=1, I think because 
> partitioning into main memory using sequential writes is not much of
> a burden, and building and walking two hash tables that both fit in
> L3 cache is much faster than building 1 hash table in main memory,
> and more than makes up for the work of partitioning. Presumably that
> situation doesn't apply to work_mem 900MB, but isn't NUMA about the
> same thing as L4 cache, in effect?).

Yes, I've seen cases where plans with (nbatch>1) were faster than a plan
with (nbatch=1). I'm not entirely sure why :-( but I have two hypotheses
so far:

(a) Caching within CPU (current CPUs have multiple MBs of L2 cache),   which may make a difference, especially in the
sizerange you've   mentioned.
 

(b) Lower tuple/bucket ratio - this may easily happen for example if   the estimates are slighly lower than reality
(eitherrow count or   row width) and narrowly exceed work_mem, thus forcing batching.   The resulting hash table has
~50%tuple/bucket on average, and   thus is faster.
 

> And if someone had a whole bunch of hash joins which were right in
> that anti-sweet spot, all they have to do is increase work_mem by (at
> most) 15% to get out of it. work_mem is basically impossible to tune,
> so I doubt anyone exists who has a reasonable setting for it which
> can' be increased by 15% and still be reasonable. And if someone does
> have it tuned so tightly, they probably won't be upgrading to new
> major versions without expecting to do some re-tuning.

Right. Still, it's not really nice to get slower hash joins after
upgrading to a new version - I somehow expect to get the same or better
performance, if possible. So I'd like to make it as efficient as
possible, within the given memory limit.

Sadly, the increase may be needed anyway because of counting the bucket
memory into work_mem, as mentioned above. If committed, this should be
probably mentioned in release notes or something.

regards
Tomas



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Next
From: Tomas Vondra
Date:
Subject: Re: tweaking NTUP_PER_BUCKET