Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: tweaking NTUP_PER_BUCKET |
Date | |
Msg-id | 53BC5CD6.4080502@fuzzy.cz Whole thread Raw |
In response to | Re: tweaking NTUP_PER_BUCKET (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
On 8.7.2014 21:53, Jeff Janes wrote: > On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> >> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large >> outer relations it may be way cheaper to use higher NTUP_PER_BUCKET >> values instead of increasing the number of batches (resulting in >> repeated scans of the outer table). I think it's important (but >> difficult) to keep these things somehow balanced. >> >> With large work_mem values the amount of memory for buckets may be >> quite significant. E.g. 800MB work_mem may easily give ~120MB of >> memory taken by buckets with NTUP_PER_BUCKET=1. With >> NTUP_PER_BUCKET=4 it's just ~30MB. > > > That extra 90MB is memory well spent, in my book. Versus having to > walk a 4-deep linked list which is scattered all over main memory > just to find one entry we care about in it. Yes, I share this view, although it really depends on how expensive it's to rescan the outer relation (due to more batches) vs. lookup in a "deeper" hash table. The other thing is that this memory is currently unaccounted for, i.e. the space for buckets is not counted within the work_mem limit (unless I'm missing something in the code). I'm not sure why, and I believe it should be, so I changer this in the patch. > It might cause some things that were very close to the edge to tip > over into multi-pass hash joins, but even that is not necessarily a > bad thing. (When I test with work_mem in the 20 to 50 MB range, I > often find batches=2 be ~30% faster than batches=1, I think because > partitioning into main memory using sequential writes is not much of > a burden, and building and walking two hash tables that both fit in > L3 cache is much faster than building 1 hash table in main memory, > and more than makes up for the work of partitioning. Presumably that > situation doesn't apply to work_mem 900MB, but isn't NUMA about the > same thing as L4 cache, in effect?). Yes, I've seen cases where plans with (nbatch>1) were faster than a plan with (nbatch=1). I'm not entirely sure why :-( but I have two hypotheses so far: (a) Caching within CPU (current CPUs have multiple MBs of L2 cache), which may make a difference, especially in the sizerange you've mentioned. (b) Lower tuple/bucket ratio - this may easily happen for example if the estimates are slighly lower than reality (eitherrow count or row width) and narrowly exceed work_mem, thus forcing batching. The resulting hash table has ~50%tuple/bucket on average, and thus is faster. > And if someone had a whole bunch of hash joins which were right in > that anti-sweet spot, all they have to do is increase work_mem by (at > most) 15% to get out of it. work_mem is basically impossible to tune, > so I doubt anyone exists who has a reasonable setting for it which > can' be increased by 15% and still be reasonable. And if someone does > have it tuned so tightly, they probably won't be upgrading to new > major versions without expecting to do some re-tuning. Right. Still, it's not really nice to get slower hash joins after upgrading to a new version - I somehow expect to get the same or better performance, if possible. So I'd like to make it as efficient as possible, within the given memory limit. Sadly, the increase may be needed anyway because of counting the bucket memory into work_mem, as mentioned above. If committed, this should be probably mentioned in release notes or something. regards Tomas
pgsql-hackers by date: