Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id CAMkU=1xDecxEuNL6xhSNUBqvHRm=e96F9H=hZp1UpMe5jDH-fg@mail.gmail.com
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: tweaking NTUP_PER_BUCKET
List pgsql-hackers
On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 8 Červenec 2014, 14:49, Robert Haas wrote:
> On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
>> once the table is built and there's free space in work_mem. The patch
>> mentioned above makes implementing this possible / rather simple.
>
> Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
> run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
> the common case is that work_mem will be large enough that everything
> fits; and if you do it that way, then you save yourself the trouble of
> rehashing later, which as you point out might lose if there are only a
> few probes.  If it turns out that you run short of memory, you can
> merge pairs of buckets up to three times, effectively doubling
> NTUP_PER_BUCKET each time.

Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
relations it may be way cheaper to use higher NTUP_PER_BUCKET values
instead of increasing the number of batches (resulting in repeated scans
of the outer table). I think it's important (but difficult) to keep these
things somehow balanced.

With large work_mem values the amount of memory for buckets may be quite
significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.


That extra 90MB is memory well spent, in my book.  Versus having to walk a 4-deep linked list which is scattered all over main memory just to find one entry we care about in it.

It might cause some things that were very close to the edge to tip over into multi-pass hash joins, but even that is not necessarily a bad thing.  (When I test with work_mem in the 20 to 50 MB range, I often find batches=2 be ~30% faster than batches=1, I think because partitioning into main memory using sequential writes is not much of a burden, and building and walking two hash tables that both fit in L3 cache is much faster than building 1 hash table in main memory, and more than makes up for the work of partitioning.  Presumably that situation doesn't apply to work_mem 900MB, but isn't NUMA about the same thing as L4 cache, in effect?).  

And if someone had a whole bunch of hash joins which were right in that anti-sweet spot, all they have to do is increase work_mem by (at most) 15% to get out of it.  work_mem is basically impossible to tune, so I doubt anyone exists who has a reasonable setting for it which can' be increased by 15% and still be reasonable.  And if someone does have it tuned so tightly, they probably won't be upgrading to new major versions without expecting to do some re-tuning.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: [GSoC2014] Patch ALTER TABLE ... SET LOGGED
Next
From: Bruce Momjian
Date:
Subject: Re: Securing "make check" (CVE-2014-0067)