Re: tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: tweaking NTUP_PER_BUCKET
Date
Msg-id a17d6217fe0c9e459cb45cb764ad727c.squirrel@sq.gransy.com
Whole thread Raw
In response to Re: tweaking NTUP_PER_BUCKET  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: tweaking NTUP_PER_BUCKET  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On 11 Červenec 2014, 9:27, Simon Riggs wrote:
> On 9 July 2014 18:54, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
>>     of batches this requires)
>
> If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
> it save to recalculate the hash bucket at 10 instead?
> Resizing sounds like it will only be useful of we only just overflow our
> limit.
>
> If we release next version with this as a hardcoded change, my
> understanding is that memory usage for hash joins will leap upwards,
> even if the run time of queries reduces. It sounds like we need some
> kind of parameter to control this. "We made it faster" might not be
> true if we run this on servers that are already experiencing high
> memory pressure.

Sure. We certainly don't want to make things worse for environments with
memory pressure.

The current implementation has two issues regarding memory:

(1) It does not include buckets into used memory, i.e. it's not included
into work_mem (so we may overflow work_mem). I plan to fix this, to make
work_mem a bit more correct, as it's important for cases with
NTUP_PER_BUCKET=1.

(2) There's a significant palloc overhead, because of allocating each
tuple separately - see my message from yesterday, where I observed the
batch memory context to get 1.4GB memory for 700MB of tuple data. By
densely packing the tuples, I got down to ~700MB (i.e. pretty much no
overhead).

The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating
this it more than compensates for ~8B per tuple, required for
NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable.

It also improves the issue that palloc overhead is not counted into
work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB).

So in the end this should give us much lower memory usage for hash joins,
even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much
independent change). Does that seem reasonable?

Regarding the tunable to control this - I certainly don't want another GUC
no one really knows how to set right. And I think it's unnecessary thanks
to the palloc overhead / work_mem accounting fix, described above.

The one thing I'm not sure about is what to do in case of reaching the
work_mem limit (which should only happen with underestimated row count /
row width) - how to decide whether to shrink the hash table or increment
the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it
may happen with whatever NTUP_PER_BUCKET value you choose.

The current code does not support resizing at all, so it always increments
the number of batches, but I feel an "interleaved" approach might be more
appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be
nice to have some cost estimates ('how expensive is a rescan' vs. 'how
expensive is a resize'), but I'm not sure how to get that.

regards
Tomas




pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Incorrect comment in postgres_fdw.c
Next
From: Andres Freund
Date:
Subject: Re: pg_receivexlog and replication slots