Thread: tweaking NTUP_PER_BUCKET

tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

03 July 2014, 00:14:10

Hi,

while hacking on the 'dynamic nbucket' patch, scheduled for the next CF
(https://commitfest.postgresql.org/action/patch_view?id=1494) I was
repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change
in how we handle it.


TL;DR; version
--------------

I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
once the table is built and there's free space in work_mem. The patch
mentioned above makes implementing this possible / rather simple.


Long version
------------

I've seen discussions in the list in the past, but I don't recall any
numbers comparing impact of different values. So I did a simple test,
measuring the duration of a simple hashjoin with different hashtable
sizes and NTUP_PER_BUCKET values. In short, something like this:

 CREATE TABLE outer_table (id INT);
 CREATE TABLE inner_table (id INT, val_int INT, val_txt TEXT);

 INSERT INTO outer_table SELECT i
        FROM generate_series(1,50000000) s(i);

 INSERT INTO inner_table SELECT i, mod(i,10000), md5(i::text)
        FROM generate_series(1,10000000) s(i);

 ANALYZE outer_table;
 ANALYZE inner_table;

-- force hash join with a single batch

 set work_mem = '1GB';
 set enable_sort = off;
 set enable_indexscan = off;
 set enable_bitmapscan = off;
 set enable_mergejoin = off;
 set enable_nestloop = off;

-- query with various values in the WHERE condition, determining
-- the size of the hash table (single batch, thanks to work_mem)

 SELECT COUNT(i.val_txt)
   FROM outer_table o LEFT JOIN inner_table i ON (i.id  = o.id)
  WHERE val_int < 1000;

which leads to plans like this: http://explain.depesz.com/s/i7W

I've repeated this with with NTUP_PER_BUCKET between 1 and 10, and
various values in the WHERE condition, and I got this table,
illustrating impact of various NTUP_PER_BUCKET values for various
hashtable sizes:

  kB      NTUP=1  NTUP=2  NTUP=4  NTUP=5  NTUP=9  NTUP=10
  1407    6753    7218    7890    9324    9488    12574
  7032    9417    11586   17417   15820   25732   25402
  35157   13179   17101   27225   24763   43323   43175
  70313   14203   18475   29354   26690   46840   46676
  175782  14319   17458   25325   34542   37342   60949
  351563  16297   19928   29124   43364   43232   70014
  703125  19027   23125   33610   50283   50165   81398

(I've skipped columns for NTUP values that are almost exactly the same
as the adjacent columns due to the fact bucket count grows by doubling).

If you prefer charts to tables, attached are two showing the same data
(complete and "small hashtables"). Also, attached is a CSV with the
results, tracking number of buckets and durations for each hashtable
size / NTUP_PER_BUCKET combination.

I'm aware that the numbers and exact behaviour depends on the hardware
(e.g. L1/L2 cache, RAM etc.), and the query used to get these values is
somewhat simple (at least compared to some real-world values). But I
believe the general behaviour won't vary too much.

The first observation is that for large hashtables the difference
between NTUP=1 and NTUP=10 is much larger than for small ones. Not
really surprising - probably due to better hit ratios for CPU caches.

For the large hashtable the difference is significant, easily beating
price for the rebuild of the hashtable (increasing the nbuckets).


Therefore I'd like to propose dynamic increase of bucket count. The
patch mentioned at the beginning does pretty much exactly this for plans
with significantly underestimated cardinality of the hashtable (thus
causing inappropriately low bucket count and poor performance of the
join). However it allows dynamic approach to NTUP_PER_BUCKET too, so I
propose a change along these lines:

 1) consider lowering the NTUP_PER_BUCKET (see measurements below)

 2) do the initial nbucket/nbatch estimation as today (with whatever
    NTUP_PER_BUCKET value we end up using) and build the hashtable

 3) while building the hashtable, include space for buckets into
    spaceUsed while deciding whether to increase number of batches
    (the fact it's not already included seems wrong), but instead of
    using initial bucket count, do this:

    if (hashtable->spaceUsed + BUCKET_SPACE(hashtable)
                                    > hashtable->spaceAllowed)
        ExecHashIncreaseNumBatches(hashtable);

    where BUCKET_SPACE is 'how many buckets would be appropriate, given
    current tuples per batch'.

 4) consider what is the "appropriate" number of buckets and rebuild
    the table if appropriate

    This is where the patch fixes the underestimated nbuckets value, but
    what we can do is look at how much free space remains in work_mem
    and use as many buckets as fit into work_mem (capped with tuples
    per batch, of course) if there's free space. Even if the initial
    cardinality estimate was correct (so we're below NTUP_PER_BUCKET)
    this may be a significant speedup.

    Say for example we have 10M tuples in the table - with the default
    NTUP_PER_BUCKET this means ~1M buckets. But there's significant
    performance difference between hashtable with 1 or 10 tuples per
    bucket (see below), and if we have 80MB in work_mem (instead of the
    default 8MB) this may be a huge gain.


Any comments / objections welcome.

regards
Tomas

Attachment

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

03 July 2014, 18:03:21

On 3.7.2014 02:13, Tomas Vondra wrote:
> Hi,
>
> while hacking on the 'dynamic nbucket' patch, scheduled for the next CF
> (https://commitfest.postgresql.org/action/patch_view?id=1494) I was
> repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change
> in how we handle it.
>
>
> TL;DR; version
> --------------
>
> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
> once the table is built and there's free space in work_mem. The patch
> mentioned above makes implementing this possible / rather simple.

Attached is v1 of this experimental patch. It's supposed to be applied
on top of v7 of this patch

   http://www.postgresql.org/message-id/53B59498.3000800@fuzzy.cz

as it shared most of the implementation. So apply it like this:

   patch -p1 < hashjoin-nbuckets-growth-v7.patch
   patch -p1 < hashjoin-dynamic-ntup-v1.patch

It implements the ideas outlined in the previous message, most importantly:

   (a) decreases NTUP_PER_BUCKET to 4

   (b) checks free work_mem when deciding whether to add batch

   (c) after building the batches, increases the number of buckets as
       much as possible, i.e. up to the number of batch tuples, but not
       exceeding work_mem

The improvements for the queries I tried so far are quite significant
(partially due to the NTUP_PER_BUCKET decrease, partially due to the
additional bucket count increase).

The rebuild is quite fast - the patch measures and reports this as a
WARNING, and the timings I've seen are ~12ms per 7MB (i.e. ~120ms for
70MB and ~1200ms for 700MB). Of course, this only makes sense when
compared to how much time it saved, but for the queries I tested so far
this was a good investment.

However it's likely there are queries where this may not be the case,
i.e. where rebuilding the hash table is not worth it. Let me know if you
can construct such query (I wasn't).

regards
Tomas

Attachment

Re: tweaking NTUP_PER_BUCKET

From

Stephen Frost

Date:

03 July 2014, 18:10:19

Tomas,

* Tomas Vondra (tv@fuzzy.cz) wrote:
> However it's likely there are queries where this may not be the case,
> i.e. where rebuilding the hash table is not worth it. Let me know if you
> can construct such query (I wasn't).

Thanks for working on this!  I've been thinking on this for a while and
this seems like it may be a good approach.  Have you considered a bloom
filter over the buckets..?  Also, I'd suggest you check the archives
from about this time last year for test cases that I was using which
showed cases where hashing the larger table was a better choice- those
same cases may also show regression here (or at least would be something
good to test).

Have you tried to work out what a 'worst case' regression for this
change would look like?  Also, how does the planning around this change?
Are we more likely now to hash the smaller table (I'd guess 'yes' just
based on the reduction in NTUP_PER_BUCKET, but did you make any changes
due to the rehashing cost?)?
Thanks,
    Stephen

Re: tweaking NTUP_PER_BUCKET

From

Atri Sharma

Date:

03 July 2014, 18:40:40

On Thu, Jul 3, 2014 at 11:40 PM, Stephen Frost <sfrost@snowman.net> wrote:

Tomas,

* Tomas Vondra (tv@fuzzy.cz) wrote:
> However it's likely there are queries where this may not be the case,
> i.e. where rebuilding the hash table is not worth it. Let me know if you
> can construct such query (I wasn't).

Thanks for working on this! I've been thinking on this for a while and
this seems like it may be a good approach. Have you considered a bloom
filter?

IIRC, last time when we tried doing bloom filters, I was short of some real world useful hash functions that we could use for building the bloom filter.

If we are restarting experiments on this, I would be glad to assist.

Regards,

Atri

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

03 July 2014, 18:50:42

Hi Stephen,

On 3.7.2014 20:10, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (tv@fuzzy.cz) wrote:
>> However it's likely there are queries where this may not be the case,
>> i.e. where rebuilding the hash table is not worth it. Let me know if you
>> can construct such query (I wasn't).
> 
> Thanks for working on this! I've been thinking on this for a while
> and this seems like it may be a good approach. Have you considered a
> bloom filter over the buckets..? Also, I'd suggest you check the

I know you've experimented with it, but I haven't looked into that yet.

> archives from about this time last year for test cases that I was
> using which showed cases where hashing the larger table was a better
> choice- those same cases may also show regression here (or at least
> would be something good to test).

Good idea, I'll look at the test cases - thanks.

> Have you tried to work out what a 'worst case' regression for this 
> change would look like? Also, how does the planning around this
> change? Are we more likely now to hash the smaller table (I'd guess
> 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you
> make any changes due to the rehashing cost?)?

The case I was thinking about is underestimated cardinality of the inner
table and a small outer table. That'd lead to a large hash table and
very few lookups (thus making the rehash inefficient). I.e. something
like this:
 Hash Join    Seq Scan on small_table (rows=100) (actual rows=100)    Hash       Seq Scan on bad_estimate (rows=100)
(actualrows=1000000000)           Filter: ((a < 100) AND (b < 100))

But I wasn't able to reproduce this reasonably, because in practice
that'd lead to a nested loop or something like that (which is a planning
issue, impossible to fix in hashjoin code).

Tomas

Re: tweaking NTUP_PER_BUCKET

From

Greg Stark

Date:

03 July 2014, 18:52:31

On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
> IIRC, last time when we tried doing bloom filters, I was short of some real
> world useful hash functions that we could use for building the bloom filter.

Last time was we wanted to use bloom filters in hash joins to filter
out tuples that won't match any of the future hash batches to reduce
the amount of tuples that need to be spilled to disk. However the
problem was that it was unclear for a given amount of memory usage how
to pick the right size bloom filter and how to model how much it would
save versus how much it would cost in reduced hash table size.

I think it just required some good empirical tests and hash join heavy
workloads to come up with some reasonable guesses. We don't need a
perfect model just some reasonable bloom filter size that we're pretty
sure will usually help more than it hurts.

-- 
greg

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

03 July 2014, 21:45:40

On 3.7.2014 20:50, Tomas Vondra wrote:
> Hi Stephen,
> 
> On 3.7.2014 20:10, Stephen Frost wrote:
>> Tomas,
>>
>> * Tomas Vondra (tv@fuzzy.cz) wrote:
>>> However it's likely there are queries where this may not be the case,
>>> i.e. where rebuilding the hash table is not worth it. Let me know if you
>>> can construct such query (I wasn't).
>>
>> Thanks for working on this! I've been thinking on this for a while
>> and this seems like it may be a good approach. Have you considered a
>> bloom filter over the buckets..? Also, I'd suggest you check the
>> archives from about this time last year for test cases that I was
>> using which showed cases where hashing the larger table was a better
>> choice- those same cases may also show regression here (or at least
>> would be something good to test).
> 
> Good idea, I'll look at the test cases - thanks.

I can't find the thread / test cases in the archives. I've found this
thread in hackers:

http://www.postgresql.org/message-id/CAOeZVif-R-iLF966wEipk5By-KhzVLOqpWqurpaK3p5fYw-Rdw@mail.gmail.com

Can you point me to the right one, please?

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Stephen Frost

Date:

06 July 2014, 04:47:57

* Greg Stark (stark@mit.edu) wrote:
> On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
> > IIRC, last time when we tried doing bloom filters, I was short of some real
> > world useful hash functions that we could use for building the bloom filter.
>
> Last time was we wanted to use bloom filters in hash joins to filter
> out tuples that won't match any of the future hash batches to reduce
> the amount of tuples that need to be spilled to disk. However the
> problem was that it was unclear for a given amount of memory usage how
> to pick the right size bloom filter and how to model how much it would
> save versus how much it would cost in reduced hash table size.

Right.  There's only one hash function available, really, (we don't
currently support multiple hash functions..), unless we want to try and
re-hash the 32bit hash value that we get back (not against trying that,
but it isn't what I'd start with) and it would hopefully be sufficient
for this.

> I think it just required some good empirical tests and hash join heavy
> workloads to come up with some reasonable guesses. We don't need a
> perfect model just some reasonable bloom filter size that we're pretty
> sure will usually help more than it hurts.

This would help out a lot of things, really..  Perhaps what Tomas is
developing regarding test cases would help here also.
Thanks,
    Stephen

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

06 July 2014, 14:16:56

On 6.7.2014 06:47, Stephen Frost wrote:
> * Greg Stark (stark@mit.edu) wrote:
>> Last time was we wanted to use bloom filters in hash joins to
>> filter out tuples that won't match any of the future hash batches
>> to reduce the amount of tuples that need to be spilled to disk.
>> However the problem was that it was unclear for a given amount of
>> memory usage how to pick the right size bloom filter and how to
>> model how much it would save versus how much it would cost in
>> reduced hash table size.
> 
> Right. There's only one hash function available, really, (we don't 
> currently support multiple hash functions..), unless we want to try
> and re-hash the 32bit hash value that we get back (not against trying
> that, but it isn't what I'd start with) and it would hopefully be
> sufficient for this.

I don't really see a need to add more hashing functions. Use the hash
value itself as an input to the bloom filter seems just fine to me.

According to [1] the expected number of collisions is
  n - k + k * math.pow((1 - 1/k), n)

where 'k' is the number of possible values (2^32 in our case), and 'n'
is the number of inserts (i.e. values inserted into the table). With a
1GB work_mem and ~50B per row, that's ~20M rows. According to the
formula, that's ~50k collisions. In other words, 0.25% of false
positives. This seems more than sufficient for the bloom filter, and if
0.25% of false positives causes it to be inefficient I'd say the gains
are not that great anyway. (Unless my calculations are somehow flawed,
of course).

[1] https://www.cs.duke.edu/courses/cps102/spring10/Lectures/L-18.pdf

>> I think it just required some good empirical tests and hash join
>> heavy workloads to come up with some reasonable guesses. We don't
>> need a perfect model just some reasonable bloom filter size that
>> we're pretty sure will usually help more than it hurts.
> 
> This would help out a lot of things, really.. Perhaps what Tomas is 
> developing regarding test cases would help here also.

The test cases might be reused I guess. However I still don't have a
clear idea of how exactly the bloom filter would be implemented. In the
threads I found it was suggested to use per-bucket filtersm, which seems
really strange to me. I see two options:

(a) global filter
   - bloom filter sized according to estimates from planner   - ... so we're screwed if the estimates are significantly
wrong  - create the filter in ExecHashTableCreate   - insert all tuples in ExecHashTableInsert (all batches)

(b) per-batch filter
   - may be built after the batching is completed   - we have reliable numbers at this point (not just estimates)   -
per-batchfilters may be smaller (only fraction of tuples)

So the per-batch filter seems to be a better approach. The question is
the sizing of the bloom filter. According to [2] there are three rules
of thumb for sizing bloom filters:
  (1) 1B per item, to get 2% error rate  (2) 0.7 * bits of hash functions (~5 when using 1B / item)  (3) number of
hashesdominates the performance (more to compute,      more memory accesses)

When working with 20M rows, this means ~20MB and 5 hash functions. Not
that bad, although we don't know how expensive it's to build the filter.

[2] http://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html

However this is somewhat flawed because this assumes the 20M hashed
values are distinct, so if the inner table contains duplicate keys, we
might use smaller bloom filter. But we don't know that value (ndistinct
estimates are rather unreliable, but we might use hyperloglog counter to
do this).

Also, there are cases when we know there will be no misses (e.g. a join
on a FK), and in that case it's pointless to build the bloom filter.
Would be nice to get some flag from the planner whether to build it, a
threshold when to build it (e.g. if rowcount is > 10k, build the filter).

Doing something similar for the two patches I posted (tweaking the
nbuckets dynamically) might also benefit from such planning information,
although it's not that critical for them.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Stephen Frost

Date:

06 July 2014, 15:57:59

Tomas,

* Tomas Vondra (tv@fuzzy.cz) wrote:
> I can't find the thread / test cases in the archives. I've found this
> thread in hackers:
>
> http://www.postgresql.org/message-id/CAOeZVif-R-iLF966wEipk5By-KhzVLOqpWqurpaK3p5fYw-Rdw@mail.gmail.com
>
> Can you point me to the right one, please?

This:

http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net

and I think there may have been other test cases on that thread
somewhere...  I certainly recall having at least a couple of them that I
was playing with.

I can probably find them around here somewhere too, if necessary.
Thanks,
    Stephen

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

06 July 2014, 16:44:54

On 6.7.2014 17:57, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (tv@fuzzy.cz) wrote:
>> I can't find the thread / test cases in the archives. I've found this
>> thread in hackers:
>>
>> http://www.postgresql.org/message-id/CAOeZVif-R-iLF966wEipk5By-KhzVLOqpWqurpaK3p5fYw-Rdw@mail.gmail.com
>>
>> Can you point me to the right one, please?
> 
> This:
> 
> http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net

Sadly, the link to the SQL does not work :-(
 http://snowman.net/~sfrost/test_case.sql => 404

> and I think there may have been other test cases on that thread
> somewhere...  I certainly recall having at least a couple of them that I
> was playing with.
> 
> I can probably find them around here somewhere too, if necessary.

If you could look for them, that'd be great.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Robert Haas

Date:

08 July 2014, 12:49:56

On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
> once the table is built and there's free space in work_mem. The patch
> mentioned above makes implementing this possible / rather simple.

Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
the common case is that work_mem will be large enough that everything
fits; and if you do it that way, then you save yourself the trouble of
rehashing later, which as you point out might lose if there are only a
few probes.  If it turns out that you run short of memory, you can
merge pairs of buckets up to three times, effectively doubling
NTUP_PER_BUCKET each time.

Yet another idea is to stick with your scheme, but do the dynamic
bucket splits lazily.  Set a flag on each bucket indicating whether or
not it needs a lazy split.  When someone actually probes the hash
table, if the flag is set for a particular bucket, move any tuples
that don't belong as we scan the bucket.  If we reach the end of the
bucket chain, clear the flag.

Not sure either of these are better (though I kind of like the first
one) but I thought I'd throw them out there...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tweaking NTUP_PER_BUCKET

From

"Tomas Vondra"

Date:

08 July 2014, 13:35:19

On 8 Červenec 2014, 14:49, Robert Haas wrote:
> On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
>> once the table is built and there's free space in work_mem. The patch
>> mentioned above makes implementing this possible / rather simple.
>
> Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
> run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
> the common case is that work_mem will be large enough that everything
> fits; and if you do it that way, then you save yourself the trouble of
> rehashing later, which as you point out might lose if there are only a
> few probes.  If it turns out that you run short of memory, you can
> merge pairs of buckets up to three times, effectively doubling
> NTUP_PER_BUCKET each time.

Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
relations it may be way cheaper to use higher NTUP_PER_BUCKET values
instead of increasing the number of batches (resulting in repeated scans
of the outer table). I think it's important (but difficult) to keep these
things somehow balanced.

With large work_mem values the amount of memory for buckets may be quite
significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.

> Yet another idea is to stick with your scheme, but do the dynamic
> bucket splits lazily.  Set a flag on each bucket indicating whether or
> not it needs a lazy split.  When someone actually probes the hash
> table, if the flag is set for a particular bucket, move any tuples
> that don't belong as we scan the bucket.  If we reach the end of the
> bucket chain, clear the flag.

I don't think a lazy scheme makes much sense here. There are two clearly
separated phases - loading the table and search.

Also, it seems to me it can't work as described. Say we build a table and
then find out we need a table 4x the size. If I understand your proposal
correctly, you'd just resize buckets array, copy the existing buckets
(first 1/4 buckets) and set 'needs_split' flags. Correct?

Well, the problem is that while scanning a bucket you already need to know
which tuples belong there. But with the lazy approach, the tuples might
still be in the old (not yet split) buckets. So you'd have to search the
current bucket and all buckets that were not split yet. Or did I miss
something?

Tomas

Re: tweaking NTUP_PER_BUCKET

From

Robert Haas

Date:

08 July 2014, 14:16:23

On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 8 Červenec 2014, 14:49, Robert Haas wrote:
>> On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
>>> once the table is built and there's free space in work_mem. The patch
>>> mentioned above makes implementing this possible / rather simple.
>>
>> Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
>> run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
>> the common case is that work_mem will be large enough that everything
>> fits; and if you do it that way, then you save yourself the trouble of
>> rehashing later, which as you point out might lose if there are only a
>> few probes.  If it turns out that you run short of memory, you can
>> merge pairs of buckets up to three times, effectively doubling
>> NTUP_PER_BUCKET each time.
>
> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
> relations it may be way cheaper to use higher NTUP_PER_BUCKET values
> instead of increasing the number of batches (resulting in repeated scans
> of the outer table). I think it's important (but difficult) to keep these
> things somehow balanced.

Right, I think that's clear.  I'm just pointing out that you get to
decide: you can either start with a larger NTUP_PER_BUCKET and then
reduce it if you enough memory left, or you can start with a smaller
NTUP_PER_BUCKET and then increase it if you run short of memory.

>> Yet another idea is to stick with your scheme, but do the dynamic
>> bucket splits lazily.  Set a flag on each bucket indicating whether or
>> not it needs a lazy split.  When someone actually probes the hash
>> table, if the flag is set for a particular bucket, move any tuples
>> that don't belong as we scan the bucket.  If we reach the end of the
>> bucket chain, clear the flag.
>
> I don't think a lazy scheme makes much sense here. There are two clearly
> separated phases - loading the table and search.
>
> Also, it seems to me it can't work as described. Say we build a table and
> then find out we need a table 4x the size. If I understand your proposal
> correctly, you'd just resize buckets array, copy the existing buckets
> (first 1/4 buckets) and set 'needs_split' flags. Correct?
>
> Well, the problem is that while scanning a bucket you already need to know
> which tuples belong there. But with the lazy approach, the tuples might
> still be in the old (not yet split) buckets. So you'd have to search the
> current bucket and all buckets that were not split yet. Or did I miss
> something?

If you have, say, bucket 0..63 and you decide to change it to buckets
0..255, then any tuple that isn't in bucket N has to be in bucket N %
64.  More generally, any time you expand or contract by a power of two
it's pretty easy to figure out where tuples have to go.

But even if you do something that's not a power of two, like say a 10x
compaction of the buckets, there's still only two buckets that can
contain any particular hash value.  If the hash value is H, the old
number of buckets is O, and the new number of buckets is N, then the
tuple has got to be in bucket H % O or bucket H % N.  If bucket H % O
doesn't have the needs-split flag set, then it must be in H % N (if it
exists at all).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tweaking NTUP_PER_BUCKET

From

"Tomas Vondra"

Date:

08 July 2014, 16:06:32

On 8 Červenec 2014, 16:16, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
>> relations it may be way cheaper to use higher NTUP_PER_BUCKET values
>> instead of increasing the number of batches (resulting in repeated scans
>> of the outer table). I think it's important (but difficult) to keep
>> these
>> things somehow balanced.
>
> Right, I think that's clear.  I'm just pointing out that you get to
> decide: you can either start with a larger NTUP_PER_BUCKET and then
> reduce it if you enough memory left, or you can start with a smaller
> NTUP_PER_BUCKET and then increase it if you run short of memory.

I don't think those two approaches are equal.

With the approach I took, I can use a compromise value (NTUP=4) initially,
and only resize the hash table once at the end (while keeping the amount
of memory under work_mem all the time).

With the "NTUP=1 and increase in case of memory pressure" you have to
shrink the table immediately (to fit into work_mem), and if the hash table
gets split into multiple batches you're suddenly in a situation with lower
memory pressure and you may need to increase it again.

>>> Yet another idea is to stick with your scheme, but do the dynamic
>>> bucket splits lazily.  Set a flag on each bucket indicating whether or
>>> not it needs a lazy split.  When someone actually probes the hash
>>> table, if the flag is set for a particular bucket, move any tuples
>>> that don't belong as we scan the bucket.  If we reach the end of the
>>> bucket chain, clear the flag.
>>
>> I don't think a lazy scheme makes much sense here. There are two clearly
>> separated phases - loading the table and search.
>>
>> Also, it seems to me it can't work as described. Say we build a table
>> and
>> then find out we need a table 4x the size. If I understand your proposal
>> correctly, you'd just resize buckets array, copy the existing buckets
>> (first 1/4 buckets) and set 'needs_split' flags. Correct?
>>
>> Well, the problem is that while scanning a bucket you already need to
>> know
>> which tuples belong there. But with the lazy approach, the tuples might
>> still be in the old (not yet split) buckets. So you'd have to search the
>> current bucket and all buckets that were not split yet. Or did I miss
>> something?
>
> If you have, say, bucket 0..63 and you decide to change it to buckets
> 0..255, then any tuple that isn't in bucket N has to be in bucket N %
> 64.  More generally, any time you expand or contract by a power of two
> it's pretty easy to figure out where tuples have to go.

OK.

> But even if you do something that's not a power of two, like say a 10x
> compaction of the buckets, there's still only two buckets that can
> contain any particular hash value.  If the hash value is H, the old
> number of buckets is O, and the new number of buckets is N, then the
> tuple has got to be in bucket H % O or bucket H % N.  If bucket H % O
> doesn't have the needs-split flag set, then it must be in H % N (if it
> exists at all).

I wonder if this is really worth the effort - my guess is it's efficient
only if large portion of buckets is not visited (and thus does not need to
be split) at all. Not sure how common that is (our workloads certainly are
not like that).

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Robert Haas

Date:

08 July 2014, 17:00:36

On Tue, Jul 8, 2014 at 12:06 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 8 Červenec 2014, 16:16, Robert Haas wrote:
>> On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
>>> relations it may be way cheaper to use higher NTUP_PER_BUCKET values
>>> instead of increasing the number of batches (resulting in repeated scans
>>> of the outer table). I think it's important (but difficult) to keep
>>> these
>>> things somehow balanced.
>>
>> Right, I think that's clear.  I'm just pointing out that you get to
>> decide: you can either start with a larger NTUP_PER_BUCKET and then
>> reduce it if you enough memory left, or you can start with a smaller
>> NTUP_PER_BUCKET and then increase it if you run short of memory.
>
> I don't think those two approaches are equal.
>
> With the approach I took, I can use a compromise value (NTUP=4) initially,
> and only resize the hash table once at the end (while keeping the amount
> of memory under work_mem all the time).
>
> With the "NTUP=1 and increase in case of memory pressure" you have to
> shrink the table immediately (to fit into work_mem), and if the hash table
> gets split into multiple batches you're suddenly in a situation with lower
> memory pressure and you may need to increase it again.

True.  On the other hand, this really only comes into play when the
estimates are wrong.  If you know at the start how many tuples you're
going to need to store and how big they are, you determine whether
NTUP_PER_BUCKET=1 is going to work before you even start building the
hash table.  If it isn't, then you use fewer buckets right from the
start.  If we start by estimating a small value for NTUP_PER_BUCKET
and then let it float upward if we turn out to have more tuples than
expected, we're optimizing for the case where our statistics are
right.  If we start by estimating a larger value for NTUP_PER_BUCKET
than what we think we need to fit within work_mem, we're basically
guessing that our statistics are more likely to be wrong than to be
right.  I think.

> I wonder if this is really worth the effort - my guess is it's efficient
> only if large portion of buckets is not visited (and thus does not need to
> be split) at all. Not sure how common that is (our workloads certainly are
> not like that).

Yeah.  It may be a bad idea.  I threw it out there as a possible way
of trying to mitigate the worst case, which is when you trouble to
build the hash table and then make very few probes.  But that may not
be worth the complexity that this would introduce.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

08 July 2014, 18:28:58

On 8.7.2014 19:00, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 12:06 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 8 Červenec 2014, 16:16, Robert Haas wrote:
>>
>>> Right, I think that's clear. I'm just pointing out that you get
>>> to decide: you can either start with a larger NTUP_PER_BUCKET and
>>> then reduce it if you enough memory left, or you can start with a
>>> smaller NTUP_PER_BUCKET and then increase it if you run short of
>>> memory.
>>
>> I don't think those two approaches are equal.
>>
>> With the approach I took, I can use a compromise value (NTUP=4) 
>> initially, and only resize the hash table once at the end (while
>> keeping the amount of memory under work_mem all the time).
>>
>> With the "NTUP=1 and increase in case of memory pressure" you have 
>> to shrink the table immediately (to fit into work_mem), and if the 
>> hash table gets split into multiple batches you're suddenly in a
>> situation with lower memory pressure and you may need to increase
>> it again.
> 
> True. On the other hand, this really only comes into play when the 
> estimates are wrong. If you know at the start how many tuples you're 
> going to need to store and how big they are, you determine whether 
> NTUP_PER_BUCKET=1 is going to work before you even start building
> the hash table. If it isn't, then you use fewer buckets right from
> the start. If we start by estimating a small value for
> NTUP_PER_BUCKET and then let it float upward if we turn out to have
> more tuples than expected, we're optimizing for the case where our
> statistics are right. If we start by estimating a larger value for
> NTUP_PER_BUCKET than what we think we need to fit within work_mem,
> we're basically guessing that our statistics are more likely to be
> wrong than to be right. I think.

Good point. The fist patch was targetted exactly at the wrongly
estimated queries. This patch attempts to apply the rehash to all plans,
and maybe there's a better way.

If the estimates are correct / not too off, we can use this information
to do the sizing 'right' at the beginning (without facing rehashes later).

Over-estimates are not a problem, because it won't make the hash table
slower (it'll be sized for more tuples) and we can't change the number
of batches anyway.

With under-estimates we have to decide whether to resize the hash or
increase the number of batches.

In both cases that matter (correct estimates and under-estimates) we
have to decide whether to increase the number of buckets or batches. I'm
not sure how to do that.

>> I wonder if this is really worth the effort - my guess is it's 
>> efficient only if large portion of buckets is not visited (and
>> thus does not need to be split) at all. Not sure how common that is
>> (our workloads certainly are not like that).
> 
> Yeah. It may be a bad idea. I threw it out there as a possible way of
> trying to mitigate the worst case, which is when you trouble to build
> the hash table and then make very few probes. But that may not be
> worth the complexity that this would introduce.

Let's keep it simple for now. I think the sizing question (explained
above) is more important and needs to be solved first.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Jeff Janes

Date:

08 July 2014, 19:53:44

On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 8 Červenec 2014, 14:49, Robert Haas wrote:
> On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
>> once the table is built and there's free space in work_mem. The patch
>> mentioned above makes implementing this possible / rather simple.
>
> Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
> run out of memory, increase NTUP_PER_BUCKET. I'd like to think that
> the common case is that work_mem will be large enough that everything
> fits; and if you do it that way, then you save yourself the trouble of
> rehashing later, which as you point out might lose if there are only a
> few probes. If it turns out that you run short of memory, you can
> merge pairs of buckets up to three times, effectively doubling
> NTUP_PER_BUCKET each time.

Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
relations it may be way cheaper to use higher NTUP_PER_BUCKET values
instead of increasing the number of batches (resulting in repeated scans
of the outer table). I think it's important (but difficult) to keep these
things somehow balanced.

With large work_mem values the amount of memory for buckets may be quite
significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.

That extra 90MB is memory well spent, in my book. Versus having to walk a 4-deep linked list which is scattered all over main memory just to find one entry we care about in it.

It might cause some things that were very close to the edge to tip over into multi-pass hash joins, but even that is not necessarily a bad thing. (When I test with work_mem in the 20 to 50 MB range, I often find batches=2 be ~30% faster than batches=1, I think because partitioning into main memory using sequential writes is not much of a burden, and building and walking two hash tables that both fit in L3 cache is much faster than building 1 hash table in main memory, and more than makes up for the work of partitioning. Presumably that situation doesn't apply to work_mem 900MB, but isn't NUMA about the same thing as L4 cache, in effect?).

And if someone had a whole bunch of hash joins which were right in that anti-sweet spot, all they have to do is increase work_mem by (at most) 15% to get out of it. work_mem is basically impossible to tune, so I doubt anyone exists who has a reasonable setting for it which can' be increased by 15% and still be reasonable. And if someone does have it tuned so tightly, they probably won't be upgrading to new major versions without expecting to do some re-tuning.

Cheers,

Jeff

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

08 July 2014, 21:04:38

On 8.7.2014 21:53, Jeff Janes wrote:
> On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> 
>> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large
>> outer relations it may be way cheaper to use higher NTUP_PER_BUCKET
>> values instead of increasing the number of batches (resulting in
>> repeated scans of the outer table). I think it's important (but
>> difficult) to keep these things somehow balanced.
>> 
>> With large work_mem values the amount of memory for buckets may be
>> quite significant. E.g. 800MB work_mem may easily give ~120MB of
>> memory taken by buckets with NTUP_PER_BUCKET=1. With
>> NTUP_PER_BUCKET=4 it's just ~30MB.
> 
> 
> That extra 90MB is memory well spent, in my book. Versus having to
> walk a 4-deep linked list which is scattered all over main memory
> just to find one entry we care about in it.

Yes, I share this view, although it really depends on how expensive it's
to rescan the outer relation (due to more batches) vs. lookup in a
"deeper" hash table.

The other thing is that this memory is currently unaccounted for, i.e.
the space for buckets is not counted within the work_mem limit (unless
I'm missing something in the code). I'm not sure why, and I believe it
should be, so I changer this in the patch.

> It might cause some things that were very close to the edge to tip
> over into multi-pass hash joins, but even that is not necessarily a
> bad thing. (When I test with work_mem in the 20 to 50 MB range, I
> often find batches=2 be ~30% faster than batches=1, I think because 
> partitioning into main memory using sequential writes is not much of
> a burden, and building and walking two hash tables that both fit in
> L3 cache is much faster than building 1 hash table in main memory,
> and more than makes up for the work of partitioning. Presumably that
> situation doesn't apply to work_mem 900MB, but isn't NUMA about the
> same thing as L4 cache, in effect?).

Yes, I've seen cases where plans with (nbatch>1) were faster than a plan
with (nbatch=1). I'm not entirely sure why :-( but I have two hypotheses
so far:

(a) Caching within CPU (current CPUs have multiple MBs of L2 cache),   which may make a difference, especially in the
sizerange you've   mentioned.

(b) Lower tuple/bucket ratio - this may easily happen for example if   the estimates are slighly lower than reality
(eitherrow count or   row width) and narrowly exceed work_mem, thus forcing batching.   The resulting hash table has
~50%tuple/bucket on average, and   thus is faster.

> And if someone had a whole bunch of hash joins which were right in
> that anti-sweet spot, all they have to do is increase work_mem by (at
> most) 15% to get out of it. work_mem is basically impossible to tune,
> so I doubt anyone exists who has a reasonable setting for it which
> can' be increased by 15% and still be reasonable. And if someone does
> have it tuned so tightly, they probably won't be upgrading to new
> major versions without expecting to do some re-tuning.

Right. Still, it's not really nice to get slower hash joins after
upgrading to a new version - I somehow expect to get the same or better
performance, if possible. So I'd like to make it as efficient as
possible, within the given memory limit.

Sadly, the increase may be needed anyway because of counting the bucket
memory into work_mem, as mentioned above. If committed, this should be
probably mentioned in release notes or something.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

08 July 2014, 21:17:09

Hi,

Thinking about this a bit more, do we really need to build the hash
table on the first pass? Why not to do this:

(1) batching   - read the tuples, stuff them into a simple list   - don't build the hash table yet

(2) building the hash table   - we have all the tuples in a simple list, batching is done   - we know exact row count,
cansize the table properly   - build the table
 

Also, maybe we could use a regular linear hash table [1], instead of
using the current implementation with NTUP_PER_BUCKET=1. (Although,
that'd be absolutely awful with duplicates.)

regards
Tomas

[1] http://en.wikipedia.org/wiki/Linear_probing

Re: tweaking NTUP_PER_BUCKET

From

Robert Haas

Date:

09 July 2014, 14:07:51

On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Thinking about this a bit more, do we really need to build the hash
> table on the first pass? Why not to do this:
>
> (1) batching
>     - read the tuples, stuff them into a simple list
>     - don't build the hash table yet
>
> (2) building the hash table
>     - we have all the tuples in a simple list, batching is done
>     - we know exact row count, can size the table properly
>     - build the table

We could do this, and in fact we could save quite a bit of memory if
we allocated say 1MB chunks and packed the tuples in tightly instead
of palloc-ing each one separately.  But I worry that rescanning the
data to build the hash table would slow things down too much.

> Also, maybe we could use a regular linear hash table [1], instead of
> using the current implementation with NTUP_PER_BUCKET=1. (Although,
> that'd be absolutely awful with duplicates.)

Linear probing is pretty awful unless your load factor is << 1.  You'd
probably want NTUP_PER_BUCKET=0.25, or something like that, which
would eat up a lot of memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

09 July 2014, 17:54:50

On 9.7.2014 16:07, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Thinking about this a bit more, do we really need to build the
>> hash table on the first pass? Why not to do this:
>> 
>> (1) batching - read the tuples, stuff them into a simple list -
>> don't build the hash table yet
>> 
>> (2) building the hash table - we have all the tuples in a simple
>> list, batching is done - we know exact row count, can size the
>> table properly - build the table
> 
> We could do this, and in fact we could save quite a bit of memory if 
> we allocated say 1MB chunks and packed the tuples in tightly instead 
> of palloc-ing each one separately.  But I worry that rescanning the 
> data to build the hash table would slow things down too much.

OK. I think we should shoot for no negative impact on well estimated
plans, and the rescans might violate that. I need to think about this
for some time, but my current plan is this:

(1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number   of batches this requires)

(2) build the batches as in the current patch

(3) if the hash table is too small, resize it

Which is not really different from the current patch.

>> Also, maybe we could use a regular linear hash table [1], instead
>> of using the current implementation with NTUP_PER_BUCKET=1.
>> (Although, that'd be absolutely awful with duplicates.)
> 
> Linear probing is pretty awful unless your load factor is << 1.
> You'd probably want NTUP_PER_BUCKET=0.25, or something like that,
> which would eat up a lot of memory.

My experience is that 0.5-0.75 load factor is quite OK, and
NTUP_PER_BUCKET=1 gives you ~0.75 load on average, so it's not that
different. Anyway, the inability to handle duplicies reasonably is
probably a show-stopper.

Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

10 July 2014, 19:34:08

On 9.7.2014 16:07, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Thinking about this a bit more, do we really need to build the hash
>> table on the first pass? Why not to do this:
>>
>> (1) batching
>>     - read the tuples, stuff them into a simple list
>>     - don't build the hash table yet
>>
>> (2) building the hash table
>>     - we have all the tuples in a simple list, batching is done
>>     - we know exact row count, can size the table properly
>>     - build the table
>
> We could do this, and in fact we could save quite a bit of memory if
> we allocated say 1MB chunks and packed the tuples in tightly instead
> of palloc-ing each one separately.  But I worry that rescanning the
> data to build the hash table would slow things down too much.

I did a quick test of how much memory we could save by this. The
attached patch densely packs the tuples into 32kB chunks (1MB seems way
too much because of small work_mem values, but I guess this might be
tuned based on number of tuples / work_mem size ...).

Tested on query like this (see the first message in this thread how to
generate the tables):

                             QUERY PLAN
-----------------------------------------------------------------------
 Aggregate  (cost=2014697.64..2014697.65 rows=1 width=33) (actual
time=63796.270..63796.271 rows=1 loops=1)
   ->  Hash Left Join  (cost=318458.14..1889697.60 rows=50000016
width=33) (actual time=2865.656..61778.592 rows=50000000 loops=1)
         Hash Cond: (o.id = i.id)
         ->  Seq Scan on outer_table o  (cost=0.00..721239.16
rows=50000016 width=4) (actual time=0.033..2676.234 rows=50000000 loops=1)
         ->  Hash  (cost=193458.06..193458.06 rows=10000006 width=37)
(actual time=2855.408..2855.408 rows=10000000 loops=1)
               Buckets: 1048576  Batches: 1  Memory Usage: 703125kB
               ->  Seq Scan on inner_table i  (cost=0.00..193458.06
rows=10000006 width=37) (actual time=0.044..952.802 rows=10000000 loops=1)
 Planning time: 1.139 ms
 Execution time: 63889.056 ms
(9 rows)

I.e. it creates a single batch with ~700 MB of tuples. Without the
patch, top shows this:

     VIRT    RES    SHR S  %CPU %MEM   COMMAND
  2540356 1,356g   5936 R 100,0 17,6   postgres: EXPLAIN

and the MemoryContextStats added to MultiExecHash shows this:

HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

So yeah, the overhead is pretty huge in this case - basicaly 100%
overhead, because the inner table row width is only ~40B. With wider
rows the overhead will be lower.

Now, with the patch it looks like this:

     VIRT    RES    SHR S  %CPU %MEM   COMMAND
  1835332 720200   6096 R 100,0  8,9   postgres: EXPLAIN

HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So, pretty much no overhead at all. It was slightly faster too (~5%) but
I haven't done much testing so it might be measurement error.

This patch is pretty independent of the other changes discussed here
(tweaking NTUP_PER_BUCKET / nbuckets) so I'll keep it separate.

regards
Tomas

Attachment

hashjoin-alloc.patch

Re: tweaking NTUP_PER_BUCKET

From

Simon Riggs

Date:

11 July 2014, 07:27:49

On 9 July 2014 18:54, Tomas Vondra <tv@fuzzy.cz> wrote:

> (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
>     of batches this requires)

If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
it save to recalculate the hash bucket at 10 instead?
Resizing sounds like it will only be useful of we only just overflow our limit.

If we release next version with this as a hardcoded change, my
understanding is that memory usage for hash joins will leap upwards,
even if the run time of queries reduces. It sounds like we need some
kind of parameter to control this. "We made it faster" might not be
true if we run this on servers that are already experiencing high
memory pressure.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: tweaking NTUP_PER_BUCKET

From

"Tomas Vondra"

Date:

11 July 2014, 09:23:34

On 11 Červenec 2014, 9:27, Simon Riggs wrote:
> On 9 July 2014 18:54, Tomas Vondra <tv@fuzzy.cz> wrote:
>
>> (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
>>     of batches this requires)
>
> If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
> it save to recalculate the hash bucket at 10 instead?
> Resizing sounds like it will only be useful of we only just overflow our
> limit.
>
> If we release next version with this as a hardcoded change, my
> understanding is that memory usage for hash joins will leap upwards,
> even if the run time of queries reduces. It sounds like we need some
> kind of parameter to control this. "We made it faster" might not be
> true if we run this on servers that are already experiencing high
> memory pressure.

Sure. We certainly don't want to make things worse for environments with
memory pressure.

The current implementation has two issues regarding memory:

(1) It does not include buckets into used memory, i.e. it's not included
into work_mem (so we may overflow work_mem). I plan to fix this, to make
work_mem a bit more correct, as it's important for cases with
NTUP_PER_BUCKET=1.

(2) There's a significant palloc overhead, because of allocating each
tuple separately - see my message from yesterday, where I observed the
batch memory context to get 1.4GB memory for 700MB of tuple data. By
densely packing the tuples, I got down to ~700MB (i.e. pretty much no
overhead).

The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating
this it more than compensates for ~8B per tuple, required for
NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable.

It also improves the issue that palloc overhead is not counted into
work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB).

So in the end this should give us much lower memory usage for hash joins,
even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much
independent change). Does that seem reasonable?

Regarding the tunable to control this - I certainly don't want another GUC
no one really knows how to set right. And I think it's unnecessary thanks
to the palloc overhead / work_mem accounting fix, described above.

The one thing I'm not sure about is what to do in case of reaching the
work_mem limit (which should only happen with underestimated row count /
row width) - how to decide whether to shrink the hash table or increment
the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it
may happen with whatever NTUP_PER_BUCKET value you choose.

The current code does not support resizing at all, so it always increments
the number of batches, but I feel an "interleaved" approach might be more
appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be
nice to have some cost estimates ('how expensive is a rescan' vs. 'how
expensive is a resize'), but I'm not sure how to get that.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Simon Riggs

Date:

11 July 2014, 10:38:12

On 11 July 2014 10:23, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 11 Červenec 2014, 9:27, Simon Riggs wrote:
>> On 9 July 2014 18:54, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>>> (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
>>>     of batches this requires)
>>
>> If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
>> it save to recalculate the hash bucket at 10 instead?
>> Resizing sounds like it will only be useful of we only just overflow our
>> limit.
>>
>> If we release next version with this as a hardcoded change, my
>> understanding is that memory usage for hash joins will leap upwards,
>> even if the run time of queries reduces. It sounds like we need some
>> kind of parameter to control this. "We made it faster" might not be
>> true if we run this on servers that are already experiencing high
>> memory pressure.
>
> Sure. We certainly don't want to make things worse for environments with
> memory pressure.
>
> The current implementation has two issues regarding memory:
>
> (1) It does not include buckets into used memory, i.e. it's not included
> into work_mem (so we may overflow work_mem). I plan to fix this, to make
> work_mem a bit more correct, as it's important for cases with
> NTUP_PER_BUCKET=1.
>
> (2) There's a significant palloc overhead, because of allocating each
> tuple separately - see my message from yesterday, where I observed the
> batch memory context to get 1.4GB memory for 700MB of tuple data. By
> densely packing the tuples, I got down to ~700MB (i.e. pretty much no
> overhead).
>
> The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating
> this it more than compensates for ~8B per tuple, required for
> NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable.
>
> It also improves the issue that palloc overhead is not counted into
> work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB).
>
> So in the end this should give us much lower memory usage for hash joins,
> even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much
> independent change). Does that seem reasonable?

Yes, that alleviates my concern. Thanks.

> Regarding the tunable to control this - I certainly don't want another GUC
> no one really knows how to set right. And I think it's unnecessary thanks
> to the palloc overhead / work_mem accounting fix, described above.

Agreed, nor does anyone.

> The one thing I'm not sure about is what to do in case of reaching the
> work_mem limit (which should only happen with underestimated row count /
> row width) - how to decide whether to shrink the hash table or increment
> the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it
> may happen with whatever NTUP_PER_BUCKET value you choose.
>
> The current code does not support resizing at all, so it always increments
> the number of batches, but I feel an "interleaved" approach might be more
> appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be
> nice to have some cost estimates ('how expensive is a rescan' vs. 'how
> expensive is a resize'), but I'm not sure how to get that.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

11 July 2014, 17:25:53

On 10.7.2014 21:33, Tomas Vondra wrote:
> On 9.7.2014 16:07, Robert Haas wrote:
>> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Thinking about this a bit more, do we really need to build the hash
>>> table on the first pass? Why not to do this:
>>>
>>> (1) batching
>>>     - read the tuples, stuff them into a simple list
>>>     - don't build the hash table yet
>>>
>>> (2) building the hash table
>>>     - we have all the tuples in a simple list, batching is done
>>>     - we know exact row count, can size the table properly
>>>     - build the table
>>
>> We could do this, and in fact we could save quite a bit of memory if
>> we allocated say 1MB chunks and packed the tuples in tightly instead
>> of palloc-ing each one separately.  But I worry that rescanning the
>> data to build the hash table would slow things down too much.
> 
> I did a quick test of how much memory we could save by this. The 
> attached patch densely packs the tuples into 32kB chunks (1MB seems
> way too much because of small work_mem values, but I guess this might
> be tuned based on number of tuples / work_mem size ...).

Turns out getting this working properly will quite complicated. The
patch was only a quick attempt to see how much overhead there is, and
didn't solve one important details - batching.

The problem is that when increasing the number of batches, we need to
get rid of the tuples from one of them. Which means calling pfree() on
the tuples written to a temporary file, and that's not possible with the
dense allocation.

1) copy into new chunks (dead end)
----------------------------------

The first idea I had was to just "copy" everything into new chunks and
then throw away the old ones, but this way we might end up using 150% of
work_mem on average (when the two batches are about 1/2 the data each),
and in an extreme case up to 200% of work_mem (all tuples having the
same key and thus falling into the same batch). So I don't think that's
really a good approach.

2) walking through the tuples sequentially
------------------------------------------

The other option is not to walk the tuples through buckets, but by
walking throught the chunks - we know the tuples are stored as
HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
walking the tuples in the order they're stored, we may just rearrange
the tuples in already allocated memory. And maybe it could be even
faster than the current code, because of the sequential access to the
memory (as opposed to the random access through buckets) and CPU caches.
So I like this approach - it's simple, clean and shouldn't add any
overhead (memory or CPU).

3) batch-aware chunks
---------------------

I also think a "batch-aware" chunks might work. Say we're starting with
nbatch=N. Instead of allocating everything in a single chunk, we'll
allocate the tuples from the chunks according to a "hypothetical batch
number" - what batch would the tuple belong to if we had (nbatch=N*2).
So we'd have two chunks (or sequences of chunks), and we'd allocate the
tuples into them.

Then if we actually need to increase the number of batches, we know we
can just free one of the chunks, because all of the tuples are from the
'wrong' batche, and redistribute the remaining tuples into the new
chunks (according to the new hypothetical batch numbers).

I'm not sure how much overhead this would be, though. I think it can be
done quite efficiently, though, and it shouldn't have any impact at all,
if we don't do any additional batching (i.e. if the initial estimates
are ok).

Any other ideas how to tackle this?

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Simon Riggs

Date:

12 July 2014, 09:39:16

On 11 July 2014 18:25, Tomas Vondra <tv@fuzzy.cz> wrote:

> Turns out getting this working properly will quite complicated.

Lets keep this patch simple then. Later research can be another patch.

In terms of memory pressure, having larger joins go x4 faster has a
much more significant reducing effect on memory pressure than anything
else. So my earlier concerns seem less of a concern.

So lets just this change done and then do more later.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

12 July 2014, 11:43:48

On 12.7.2014 11:39, Simon Riggs wrote:
> On 11 July 2014 18:25, Tomas Vondra <tv@fuzzy.cz> wrote:
> 
>> Turns out getting this working properly will quite complicated.
> 
> Lets keep this patch simple then. Later research can be another patch.

Well, the dense allocation is independent to the NTUP_PER_BUCKET
changes, and only happened to be discussed here because it's related to
hash joins. My plan was to keep it as a separate patch, thus not making
the NTUP patch any more complex.

> In terms of memory pressure, having larger joins go x4 faster has a 
> much more significant reducing effect on memory pressure than
> anything else. So my earlier concerns seem less of a concern.

OK.

> So lets just this change done and then do more later.

There's no way back, sadly. The dense allocation turned into a
challenge. I like challenges. I have to solve it or I won't be able to
sleep.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Simon Riggs

Date:

13 July 2014, 10:27:48

On 12 July 2014 12:43, Tomas Vondra <tv@fuzzy.cz> wrote:

>> So lets just this change done and then do more later.
>
> There's no way back, sadly. The dense allocation turned into a
> challenge. I like challenges. I have to solve it or I won't be able to
> sleep.

I admire your tenacity, but how about we solve the first challenge
first and the second one second?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

13 July 2014, 19:00:10

On 13.7.2014 12:27, Simon Riggs wrote:
> On 12 July 2014 12:43, Tomas Vondra <tv@fuzzy.cz> wrote:
> 
>>> So lets just this change done and then do more later.
>>
>> There's no way back, sadly. The dense allocation turned into a 
>> challenge. I like challenges. I have to solve it or I won't be able
>> to sleep.
> 
> I admire your tenacity, but how about we solve the first challenge 
> first and the second one second?

Yeah, I know. However the two patches deal with the same part of the
code, so it feels natural to work on both at the same time. And I
already had a solution in my head, and "only" needed to code it.

Also, I think the question of allocated memory / memory pressure is
important, and I'd like to avoid "this will be solved by the other
patch" only to find out that the other patch does not work.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

13 July 2014, 19:32:50

On 11.7.2014 19:25, Tomas Vondra wrote:
> 2) walking through the tuples sequentially
> ------------------------------------------
>
> The other option is not to walk the tuples through buckets, but by
> walking throught the chunks - we know the tuples are stored as
> HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
> walking the tuples in the order they're stored, we may just rearrange
> the tuples in already allocated memory. And maybe it could be even
> faster than the current code, because of the sequential access to the
> memory (as opposed to the random access through buckets) and CPU caches.
> So I like this approach - it's simple, clean and shouldn't add any
> overhead (memory or CPU).

So, attached is a patch that implements this. It's not very complicated
and the resulting code is surprisingly readable (IMHO comprable to the
previous code).

Basics of the patch:

(1) adds HashChunkData structure - a buffer used for dense allocation
    of tuples, 16kB by default

(2) adds 'chunks_batch' into HashJoinTableData, which is a linked list
    of HashChunkData

(3) the allocation itself is done in chunk_alloc - in short it either
    allocates the tuple in the current chunk, or adds a new one if
    needed

(4) ExecHashTableInsert calls chunk_alloc (instead of the regular
    MemoryContextAlloc)

(5) the main change is in ExecHashIncreaseNumBatches, which now can't
    do pfree (does not work with dense allocation), so it reads the
    tuples from chunks directly and simply rebuilds the buckets
    from scratch (thanks to this we only need one more chunk of memory,
    not a complete copy, and we don't need to duplicate buckets at all)

>From the tests I've done so far this seems to work perfectly - it still
saves the memory overhead, and I've seen no difference in performance.

The current patch only implemnents this for tuples in the main hash
table, not for skew buckets. I plan to do that, but it will require
separate chunks for each skew bucket (so we can remove it without
messing with all of them). The chunks for skew buckets should be smaller
(I'm thinking about ~4kB), because there'll be more of them, but OTOH
those buckets are for frequent values so the chunks should not remain empty.

But let's see if the current patch misses something important first.


regards
Tomas

Attachment

hashjoin-alloc-v3.patch

Re: tweaking NTUP_PER_BUCKET

From

Stephen Frost

Date:

14 July 2014, 04:29:16

Tomas,

* Tomas Vondra (tv@fuzzy.cz) wrote:
> On 6.7.2014 17:57, Stephen Frost wrote:
> > * Tomas Vondra (tv@fuzzy.cz) wrote:
> >> I can't find the thread / test cases in the archives. I've found this
> >> thread in hackers:
> >>
> >> http://www.postgresql.org/message-id/CAOeZVif-R-iLF966wEipk5By-KhzVLOqpWqurpaK3p5fYw-Rdw@mail.gmail.com
> >>
> >> Can you point me to the right one, please?
> >
> > This:
> >
> > http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net
>
> Sadly, the link to the SQL does not work :-(
>
>   http://snowman.net/~sfrost/test_case.sql => 404
>
> > and I think there may have been other test cases on that thread
> > somewhere...  I certainly recall having at least a couple of them that I
> > was playing with.
> >
> > I can probably find them around here somewhere too, if necessary.
>
> If you could look for them, that'd be great.

cc'ing me directly helps me notice these...

I've added back test_case.sql and test_case2.sql.  Please check them out
and let me know- they're real-world cases we've run into.
Thanks!
    Stephen

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

19 July 2014, 18:16:20

On 14.7.2014 06:29, Stephen Frost wrote:
> Tomas,
> 
> * Tomas Vondra (tv@fuzzy.cz) wrote:
>> On 6.7.2014 17:57, Stephen Frost wrote:
>>> * Tomas Vondra (tv@fuzzy.cz) wrote:
>>>> I can't find the thread / test cases in the archives. I've found this
>>>> thread in hackers:
>>>>
>>>> http://www.postgresql.org/message-id/CAOeZVif-R-iLF966wEipk5By-KhzVLOqpWqurpaK3p5fYw-Rdw@mail.gmail.com
>>>>
>>>> Can you point me to the right one, please?
>>>
>>> This:
>>>
>>> http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net
>>
>> Sadly, the link to the SQL does not work :-(
>>
>>   http://snowman.net/~sfrost/test_case.sql => 404
>>
>>> and I think there may have been other test cases on that thread
>>> somewhere...  I certainly recall having at least a couple of them that I
>>> was playing with.
>>>
>>> I can probably find them around here somewhere too, if necessary.
>>
>> If you could look for them, that'd be great.
> 
> cc'ing me directly helps me notice these...
> 
> I've added back test_case.sql and test_case2.sql.  Please check them out
> and let me know- they're real-world cases we've run into.

Hi Stephen,

I've reviewed the two test cases mentioned here, and sadly there's
nothing that can be 'fixed' by this patch. The problem here lies in the
planning stage, which decides to hash the large table - we can't fix
that in the executor.

I did three runs - with master, with the 'dense allocation' patch
applied, and with 'dense allocation + ntup patch' applied.

As the estimates are pretty exact here, the ntup patch effectively only
sets NTUP_PER_BUCKET=1 (and does not do any dynamic resize or so).

For the first testcase, it behaves like this:

####### master (no patches) ########
                            QUERY PLAN
-------------------------------------------------------------------------Hash Join  (cost=115030.10..116671.32
rows=41176width=24) (actual
 
time=980.363..1015.623 rows=13731 loops=1)  Hash Cond: (small_table.id_short = big_table.id_short)  ->  Seq Scan on
small_table (cost=0.00..714.76 rows=41176 width=24)
 
(actual time=0.015..2.563 rows=41176 loops=1)  ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=979.399..979.399 rows=4272271 loops=1)        Buckets: 524288  Batches: 1  Memory Usage: 150198kB        ->  Seq
Scanon big_table  (cost=0.00..61626.71 rows=4272271
 
width=4) (actual time=0.026..281.151 rows=4272271 loops=1)Planning time: 0.196 msExecution time: 1034.931 ms
(8 rows)

# without explain analyze

Time: 841,198 ms
Time: 825,000 ms

####### master + dense allocation ########
                            QUERY PLAN
-------------------------------------------------------------------------Hash Join  (cost=115030.10..116671.32
rows=41176width=24) (actual
 
time=778.116..815.254 rows=13731 loops=1)  Hash Cond: (small_table.id_short = big_table.id_short)  ->  Seq Scan on
small_table (cost=0.00..714.76 rows=41176 width=24)
 
(actual time=0.006..2.423 rows=41176 loops=1)  ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=775.803..775.803 rows=4272271 loops=1)        Buckets: 524288  Batches: 1  Memory Usage: 150198kB        ->  Seq
Scanon big_table  (cost=0.00..61626.71 rows=4272271
 
width=4) (actual time=0.005..227.274 rows=4272271 loops=1)Planning time: 0.061 msExecution time: 826.346 ms
(8 rows)

# without explain analyze

Time: 758,393 ms
Time: 734,329 ms

####### master + dense allocation + ntup=1 patch ########
                            QUERY PLAN
-------------------------------------------------------------------------Hash Join  (cost=115030.10..116465.44
rows=41176width=24) (actual
 
time=1020.288..1033.539 rows=13731 loops=1)  Hash Cond: (small_table.id_short = big_table.id_short)  ->  Seq Scan on
small_table (cost=0.00..714.76 rows=41176 width=24)
 
(actual time=0.015..2.214 rows=41176 loops=1)  ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=953.890..953.890 rows=4272271 loops=1)        Buckets: 8388608  Batches: 1  Memory Usage: 215734kB        ->  Seq
Scanon big_table  (cost=0.00..61626.71 rows=4272271
 
width=4) (actual time=0.014..241.223 rows=4272271 loops=1)Planning time: 0.193 msExecution time: 1055.351 ms
(8 rows)

# without explain analyze

Time: 836,050 ms
Time: 822,818 ms

#########################################################

For the second test case, the times are like this:

# master
Time: 2326,579 ms
Time: 2320,232 ms

# master + dense allocation
Time: 2207,254 ms
Time: 2251,691 ms

# master + dense allocation + ntup=1
Time: 2374,278 ms
Time: 2377,409 ms

So while the dense allocation shaves off ~5%, the time with both patches
is slightly higher (again, ~5% difference). After spending some time by
profiling the code, I believe it's because of worse cache hit ratio when
accessing the buckets. By decreasing NTUP_PER_BUCKET the second test
case now uses ~130MB of buckets, while originally it used only ~16MB).
That means slightly lower cache hit ratio when accessing
hashtable->buckets (e.g. in ExecHashTableInsert).

The resulting hash table should be much faster when doing lookups,
however the main issue illustrated by the test cases is that we're
hashing large table and scanning the small one, which means relatively
small amount of lookups. Which pretty much prevents getting anything
from the better lookup performance.

So, I think we need to fix the planning problem here, if possible. I
might try doing that, but if someone with better knowledge of the
planner could do that, that'd be great.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tom Lane

Date:

19 July 2014, 18:24:12

Tomas Vondra <tv@fuzzy.cz> writes:
> I've reviewed the two test cases mentioned here, and sadly there's
> nothing that can be 'fixed' by this patch. The problem here lies in the
> planning stage, which decides to hash the large table - we can't fix
> that in the executor.

We've heard a couple reports before of the planner deciding to hash a
larger table rather than a smaller one.  The only reason I can think of
for that is if the smaller table has many more duplicates, so that the
planner thinks the executor might end up traversing long hash chains.
The planner's estimates could easily be off in this area, of course.
estimate_hash_bucketsize() is the likely culprit if it's wrong.

Which test case are you seeing this in, exactly?
        regards, tom lane

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

19 July 2014, 18:24:27

On 13.7.2014 21:32, Tomas Vondra wrote:
> The current patch only implemnents this for tuples in the main hash 
> table, not for skew buckets. I plan to do that, but it will require 
> separate chunks for each skew bucket (so we can remove it without 
> messing with all of them). The chunks for skew buckets should be
> smaller (I'm thinking about ~4kB), because there'll be more of them,
> but OTOH those buckets are for frequent values so the chunks should
> not remain empty.

I've looked into extending the dense allocation to the skew buckets, and
I think we shouldn't do that. I got about 50% of the changes and then
just threw it out because it turned out quite pointless.

The amount of memory for skew buckets is limited to 2% of work mem, so
even with 100% overhead it'll use ~4% of work mem. So there's pretty
much nothing to gain here. So the additional complexity introduced by
the dense allocation seems pretty pointless.

I'm not entirely happy with the code allocating some memory densely and
some using traditional palloc, but it certainly seems cleaner than the
code I had.

So I think the patch is mostly OK as is.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

19 July 2014, 18:29:09

On 19.7.2014 20:24, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I've reviewed the two test cases mentioned here, and sadly there's
>> nothing that can be 'fixed' by this patch. The problem here lies in the
>> planning stage, which decides to hash the large table - we can't fix
>> that in the executor.
> 
> We've heard a couple reports before of the planner deciding to hash a
> larger table rather than a smaller one.  The only reason I can think of
> for that is if the smaller table has many more duplicates, so that the
> planner thinks the executor might end up traversing long hash chains.
> The planner's estimates could easily be off in this area, of course.
> estimate_hash_bucketsize() is the likely culprit if it's wrong.
> 
> Which test case are you seeing this in, exactly?

The two reported by Stephen here:


http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net

Just download this (I had to gunzip it):
 http://snowman.net/~sfrost/test_case.sql http://snowman.net/~sfrost/test_case2.sql

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

19 July 2014, 21:08:06

On 19.7.2014 20:28, Tomas Vondra wrote:
> On 19.7.2014 20:24, Tom Lane wrote:
>> Tomas Vondra <tv@fuzzy.cz> writes:
>>> I've reviewed the two test cases mentioned here, and sadly there's
>>> nothing that can be 'fixed' by this patch. The problem here lies in the
>>> planning stage, which decides to hash the large table - we can't fix
>>> that in the executor.
>>
>> We've heard a couple reports before of the planner deciding to hash a
>> larger table rather than a smaller one.  The only reason I can think of
>> for that is if the smaller table has many more duplicates, so that the
>> planner thinks the executor might end up traversing long hash chains.
>> The planner's estimates could easily be off in this area, of course.
>> estimate_hash_bucketsize() is the likely culprit if it's wrong.
>>
>> Which test case are you seeing this in, exactly?
> 
> The two reported by Stephen here:
> 
> 
> http://www.postgresql.org/message-id/20130328235627.GV4361@tamriel.snowman.net
> 
> Just download this (I had to gunzip it):
> 
>   http://snowman.net/~sfrost/test_case.sql
>   http://snowman.net/~sfrost/test_case2.sql

Anyway, looking a bit at the distributions, I don't think the
distributions are significantly skewed. In the first testscase, I get this

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;cnt | count
-----+-------  1 |    23  2 | 18795  3 |    13  4 |   726  5 |     4  6 |    93  8 |     4 10 |     1
(8 rows)

and in the second one I get this:

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;cnt | count
-----+--------  1 | 182161  2 |   5582  3 |    371  4 |     28  5 |      2  6 |      3  9 |      1
(7 rows)

So while #1 contains most values twice, #2 is almost perfectly distinct.
In the big_table, the values are unique.

For the first case, a WARNING at the end of estimate_hash_bucketsize
says this:

WARNING:  nbuckets=8388608.00 estfract=0.000001
WARNING:  nbuckets=65536.00 estfract=0.000267

There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>= 1e-6), and ~10 tuples/bucket for the small_table (42k rows).

For the second case, I get this:

WARNING:  nbuckets=16777216.00 estfract=0.000001
WARNING:  nbuckets=262144.00 estfract=0.000100

i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.

That sounds really strange, because small_table in the second test case
is almost perfectly unique. And in the first test case, there are only
very few keys with >2 occurences.


By explicitly setting the selectivity in estimate_hash_bucketsize to
1e-6, I got these plans:

Test case #1
                         QUERY PLAN
---------------------------------------------------------------------Hash Join  (cost=1229.46..79288.95 rows=41176
width=24)(actual
 
time=50.397..634.986 rows=13731 loops=1)  Hash Cond: (big_table.id_short = small_table.id_short)  ->  Seq Scan on
big_table (cost=0.00..61626.71 rows=4272271 width=4)
 
(actual time=0.018..229.597 rows=4272271 loops=1)  ->  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual
time=31.653..31.653 rows=41176 loops=1)        Buckets: 65536  Batches: 1  Memory Usage: 3086kB        ->  Seq Scan on
small_table (cost=0.00..714.76 rows=41176
 
width=24) (actual time=0.012..13.341 rows=41176 loops=1)Planning time: 0.597 msExecution time: 635.498 ms
(8 rows)

Without explain analyze: 486 ms (original plan: >850ms).

Test case #2
                         QUERY PLAN
---------------------------------------------------------------------Hash Join  (cost=5240.21..220838.44 rows=194587
width=4)(actual
 
time=88.252..2143.457 rows=149540 loops=1)  Hash Cond: (big_table.id_short = small_table.id_short)  ->  Seq Scan on
big_table (cost=0.00..169569.72 rows=11755372
 
width=4) (actual time=0.018..533.955 rows=11755475 loops=1)  ->  Hash  (cost=2807.87..2807.87 rows=194587 width=4)
(actual
time=87.548..87.548 rows=194587 loops=1)        Buckets: 262144  Batches: 1  Memory Usage: 8889kB        ->  Seq Scan
onsmall_table  (cost=0.00..2807.87 rows=194587
 
width=4) (actual time=0.012..29.929 rows=194587 loops=1)Planning time: 0.438 msExecution time: 2146.818 ms
(8 rows)

Without explain analyze: 1600 ms (original plan: >2300ms)

The differences are fairly small - well, it's 30-40% faster, but it's
less than 1s in both cases. I'm wondering whether we could get into
similar problems with much longer queries and still get 30-40% improvement.

Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

19 July 2014, 22:13:24

On 19.7.2014 23:07, Tomas Vondra wrote:
> On 19.7.2014 20:28, Tomas Vondra wrote:
> For the first case, a WARNING at the end of estimate_hash_bucketsize
> says this:
> 
> WARNING:  nbuckets=8388608.00 estfract=0.000001
> WARNING:  nbuckets=65536.00 estfract=0.000267
> 
> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
> 
> For the second case, I get this:
> 
> WARNING:  nbuckets=16777216.00 estfract=0.000001
> WARNING:  nbuckets=262144.00 estfract=0.000100
> 
> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
> 
> That sounds really strange, because small_table in the second test case
> is almost perfectly unique. And in the first test case, there are only
> very few keys with >2 occurences.

FWIW, the "problem" seems to be this:
 /*  * Adjust estimated bucketsize upward to account for skewed  * distribution. */
 if (avgfreq > 0.0 && mcvfreq > avgfreq)   estfract *= mcvfreq / avgfreq;

Which is explained like in the estimate_hash_bucketsize comment like this:

* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something *
close to the "worst case" bucket size. We try to estimate this by
* adjusting the fraction if there are too few distinct data values, and
* then scaling up by the ratio of the most common value's frequency to
* the average frequency.

The problem is that the two testcases don't behave like this, i.e. the
other relation does not have similar distribution (because there the
join key is unique). Actually, I wonder how frequently that happens and
I wouldn't be surprised if it was quite rare.

So maybe we shouldn't scale it the way we scale it now. For example we
could do this:
 if (avgfreq > 0.0 && mcvfreq > avgfreq)   estfract *= sqrt(mcvfreq / avgfreq);

Or instead of using the first MCV frequency (i.e. the frequency of the
most common value, which causes the unexpectedly hight tuple/bucket
values), use an average or median of the MCV frequenfies.

Either of these three changes "fixed" the first test case, some fixed
the second one. However it's pure guesswork, and I have how many plans
will be hit negatively by these changes.

What I think might work better is lowering the cost of searching small
hash tables, when the hash table fits into L1/L2... caches. The table I
posted in the first message in this thread illustrates that:
 kB      NTUP=1  NTUP=2  NTUP=4  NTUP=5  NTUP=9  NTUP=10 1407    6753    7218    7890    9324    9488    12574 7032
9417   11586   17417   15820   25732   25402 35157   13179   17101   27225   24763   43323   43175 70313   14203
18475  29354   26690   46840   46676 175782  14319   17458   25325   34542   37342   60949 351563  16297   19928
29124  43364   43232   70014 703125  19027   23125   33610   50283   50165   81398

So a small hash table (~1.4MB), fitting into L2 (measured on CPU with
~4MB cache) is influenced much less than large tables. I don't know
whether we should detect the CPU cache size somehow, or whether we
should assume some conservative size (say, 4MB sounds OK), or what. But
I think something like this might work
 if (hash_size < 4MB) {    /* take in only 10% of the difference */    estfract += (mcvfreq / avgfreq) * estfract *
0.1;} else if (hash_size > 16MB) {    estfract *= (mcvfreq / avgfreq); } else {    /* linear approximation between 10
and100% */    estfract += (mcvfreq / avgfreq) * estfract                * (0.1 + 0.9 * (hash_size - 4MB) / 12MB); }

Or maybe not, I'm not really sure. The problem is that the two test
cases really don't match the assumption that the outer relation has the
same distribution.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

20 July 2014, 12:57:18

On 20.7.2014 00:12, Tomas Vondra wrote:
> On 19.7.2014 23:07, Tomas Vondra wrote:
>> On 19.7.2014 20:28, Tomas Vondra wrote:
>> For the first case, a WARNING at the end of estimate_hash_bucketsize
>> says this:
>>
>> WARNING:  nbuckets=8388608.00 estfract=0.000001
>> WARNING:  nbuckets=65536.00 estfract=0.000267
>>
>> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
>>
>> For the second case, I get this:
>>
>> WARNING:  nbuckets=16777216.00 estfract=0.000001
>> WARNING:  nbuckets=262144.00 estfract=0.000100
>>
>> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
>>
>> That sounds really strange, because small_table in the second test case
>> is almost perfectly unique. And in the first test case, there are only
>> very few keys with >2 occurences.
> 
> FWIW, the "problem" seems to be this:
> 
>   /*
>    * Adjust estimated bucketsize upward to account for skewed
>    * distribution. */
> 
>   if (avgfreq > 0.0 && mcvfreq > avgfreq)
>     estfract *= mcvfreq / avgfreq;
> 
> Which is explained like in the estimate_hash_bucketsize comment like this:
> 
> * be nonuniformly occupied. If the other relation in the join has a key
> * distribution similar to this one's, then the most-loaded buckets are
> * exactly those that will be probed most often. Therefore, the "average"
> * bucket size for costing purposes should really be taken as something *
> close to the "worst case" bucket size. We try to estimate this by
> * adjusting the fraction if there are too few distinct data values, and
> * then scaling up by the ratio of the most common value's frequency to
> * the average frequency.

After thinking about this a bit more, I'm starting to question the logic
behind the adjustment. The thing is, with NTUP_PER_BUCKET=1, all the
tuples in a bucket should have the same hash value and (hash collisions
aside) the same join key value. That means we're not really searching
through the bucket, we're merely iterating and returning all the tuples.

With NTUP_PER_BUCKET=10 it maybe made sense, because it was expected to
see multiple hash values in a single bucket. With NTUP_PER_BUCKET=1
we shouldn't really expect that. So I think we should rip the MCV
adjustment out ouf estimate_hash_bucketsize (and indeed, it fixed both
the test cases).

The only case where I think this might play role is when we can't use
the optimal number of buckets (achieving NTUP_PER_BUCKET=1), but in that
case the patched ExecChooseHashTableSize prefers to increase number of
batches. So that seems safe to me.

Or in what scenario (with NTUP_PER_BUCKET=1) does this still make sense?

Two more comments:

(a) Isn't the minimum selectivity enforced in estimate_hash_bucketsize   (1.0e-6) too high? I mean, With 1GB work_mem I
canget ~20M tuples   into the hash table, and in that case it's off by an order of   magnitude. Maybe 1e-9 would be
moreappropriate?

(b) Currently, ExecChooseHashTableSize sizes the hash table (nbuckets)   using ntuples, but maybe ndistinct (if
available)would be better?   It's tricky, though, because while both values are initially   estimates, ndistinct is
notoriouslyless reliable. Also, we   currently don't have means to count ndistinct while building the   table (as
opposedto ntuples, which is trivial), which is needed   when resizing the hash table in case the initial estimate was
off.  So currently the code optimizes for "ndistinct==ntuples" which may   produce larger nbuckets values, but ISTM
it'sthe right choice.

regards
Tomas

Re: tweaking NTUP_PER_BUCKET

From

Tomas Vondra

Date:

20 July 2014, 16:17:36

On 19.7.2014 20:24, Tomas Vondra wrote:
> On 13.7.2014 21:32, Tomas Vondra wrote:
>> The current patch only implemnents this for tuples in the main
>> hash table, not for skew buckets. I plan to do that, but it will
>> require separate chunks for each skew bucket (so we can remove it
>> without messing with all of them). The chunks for skew buckets
>> should be smaller (I'm thinking about ~4kB), because there'll be
>> more of them, but OTOH those buckets are for frequent values so the
>> chunks should not remain empty.
>
> I've looked into extending the dense allocation to the skew buckets,
> and I think we shouldn't do that. I got about 50% of the changes and
> then just threw it out because it turned out quite pointless.
>
> The amount of memory for skew buckets is limited to 2% of work mem,
> so even with 100% overhead it'll use ~4% of work mem. So there's
> pretty much nothing to gain here. So the additional complexity
> introduced by the dense allocation seems pretty pointless.
>
> I'm not entirely happy with the code allocating some memory densely
> and some using traditional palloc, but it certainly seems cleaner
> than the code I had.
>
> So I think the patch is mostly OK as is.

Attached is v4 of the patch, mostly with minor improvements and cleanups
(removed MemoryContextStats, code related to skew buckets).

regards
Tomas

Attachment

hashjoin-alloc-v4.patch

Re: tweaking NTUP_PER_BUCKET

From

Heikki Linnakangas

Date:

20 August 2014, 12:06:36

On 07/20/2014 07:17 PM, Tomas Vondra wrote:
> On 19.7.2014 20:24, Tomas Vondra wrote:
>> On 13.7.2014 21:32, Tomas Vondra wrote:
>>> The current patch only implemnents this for tuples in the main
>>> hash table, not for skew buckets. I plan to do that, but it will
>>> require separate chunks for each skew bucket (so we can remove it
>>> without messing with all of them). The chunks for skew buckets
>>> should be smaller (I'm thinking about ~4kB), because there'll be
>>> more of them, but OTOH those buckets are for frequent values so the
>>> chunks should not remain empty.
>>
>> I've looked into extending the dense allocation to the skew buckets,
>> and I think we shouldn't do that. I got about 50% of the changes and
>> then just threw it out because it turned out quite pointless.
>>
>> The amount of memory for skew buckets is limited to 2% of work mem,
>> so even with 100% overhead it'll use ~4% of work mem. So there's
>> pretty much nothing to gain here. So the additional complexity
>> introduced by the dense allocation seems pretty pointless.
>>
>> I'm not entirely happy with the code allocating some memory densely
>> and some using traditional palloc, but it certainly seems cleaner
>> than the code I had.
>>
>> So I think the patch is mostly OK as is.
>
> Attached is v4 of the patch, mostly with minor improvements and cleanups
> (removed MemoryContextStats, code related to skew buckets).

Can you remind me what kind of queries this is supposed to help with? 
Could you do some performance testing to demonstrate the effect? And 
also a worst-case scenario.

At a quick glance, I think you're missing a fairly obvious trick in 
ExecHashIncreaseNumBatches: if a chunk contains only a single tuple, you 
can avoid copying it to a new chunk and just link the old chunk to the 
new list instead. Not sure if ExecHashIncreaseNumBatches is called often 
enough for that to be noticeable, but at least it should help in extreme 
cases where you push around huge > 100MB tuples.

- Heikki

Re: tweaking NTUP_PER_BUCKET

From

"Tomas Vondra"

Date:

20 August 2014, 13:20:16

On 20 Srpen 2014, 14:05, Heikki Linnakangas wrote:
> On 07/20/2014 07:17 PM, Tomas Vondra wrote:
>> On 19.7.2014 20:24, Tomas Vondra wrote:
>>> On 13.7.2014 21:32, Tomas Vondra wrote:
>>>> The current patch only implemnents this for tuples in the main
>>>> hash table, not for skew buckets. I plan to do that, but it will
>>>> require separate chunks for each skew bucket (so we can remove it
>>>> without messing with all of them). The chunks for skew buckets
>>>> should be smaller (I'm thinking about ~4kB), because there'll be
>>>> more of them, but OTOH those buckets are for frequent values so the
>>>> chunks should not remain empty.
>>>
>>> I've looked into extending the dense allocation to the skew buckets,
>>> and I think we shouldn't do that. I got about 50% of the changes and
>>> then just threw it out because it turned out quite pointless.
>>>
>>> The amount of memory for skew buckets is limited to 2% of work mem,
>>> so even with 100% overhead it'll use ~4% of work mem. So there's
>>> pretty much nothing to gain here. So the additional complexity
>>> introduced by the dense allocation seems pretty pointless.
>>>
>>> I'm not entirely happy with the code allocating some memory densely
>>> and some using traditional palloc, but it certainly seems cleaner
>>> than the code I had.
>>>
>>> So I think the patch is mostly OK as is.
>>
>> Attached is v4 of the patch, mostly with minor improvements and cleanups
>> (removed MemoryContextStats, code related to skew buckets).
>
> Can you remind me what kind of queries this is supposed to help with?
> Could you do some performance testing to demonstrate the effect? And
> also a worst-case scenario.

The dense allocation? That was not really directed at any specific
queries, it was about lowering hashjoin memory requirements in general.

First to make it more correct with respect to work_mem (hashjoin does not
account for the palloc overhead, so the actual memory consumption might be
much higher than work_mem).

Second to compensate for the memory for more buckets due to
NTUP_PER_BUCKET, which is tweaked in the 'hashjoin dynamic nbuckets'
patch.

There are some numbers / more detailed analysis in
http://www.postgresql.org/message-id/a17d6217fe0c9e459cb45cb764ad727c.squirrel@sq.gransy.com

>
> At a quick glance, I think you're missing a fairly obvious trick in
> ExecHashIncreaseNumBatches: if a chunk contains only a single tuple, you
> can avoid copying it to a new chunk and just link the old chunk to the
> new list instead. Not sure if ExecHashIncreaseNumBatches is called often
> enough for that to be noticeable, but at least it should help in extreme
> cases where you push around huge > 100MB tuples.

Yeah, I thought about that too, but it seemed like a rare corner case. But
maybe you're right - it will also eliminate the 'fluctuation' (allocating
100MB chunk, which might get us over work_mem, ...). Not sure if this is
going to help, but it's easy enough to fix I guess.

The other optimization I'm thinking about is that when increasing number
of batches, the expectation is only about 1/2 the chunks will be
necessary. So instead of freeing the chunk, we might keep it and reuse it
later. That might lower the palloc overhead a bit (but it's already very
low, so I don't think that's measurable difference).

regards
Tomas