Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: bad estimation together with large work_mem generates terrible slow hash joins
Date
Msg-id 53B0A03C.3080805@fuzzy.cz
Whole thread Raw
In response to Re: bad estimation together with large work_mem generates terrible slow hash joins  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: bad estimation together with large work_mem generates terrible slow hash joins  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On 26.6.2014 23:48, Tomas Vondra wrote:
> On 26.6.2014 20:43, Tomas Vondra wrote:
>> Attached is v2 of the patch, with some cleanups / minor improvements:
>>
>> * there's a single FIXME, related to counting tuples in the
>
> Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
> patch. This just adds a proper 'batch tuples' counter to the hash table.
>
> All comments, measurements on different queries etc. welcome. We'll
> certainly do a lot of testing, because this was a big issue for us.

Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

===== SMALL WORK MEM (2MB) =====
          no dynamic buckets     dynamic buckets
query A   5945 ms                5969 ms
query B   6080 ms                5943 ms
query C   6531 ms                6822 ms
query D   6962 ms                6618 ms

===== MEDIUM WORK MEM (16MB) =====
          no dynamic buckets     dynamic buckets
query A   7955 ms                7944 ms
query B   9970 ms                7569 ms
query C   8643 ms                8560 ms
query D  33908 ms                7700 ms

===== LARGE WORK MEM (64MB) =====
          no dynamic buckets     dynamic buckets
query A   10235 ms               10233 ms
query B   32229 ms                9318 ms
query C   14500 ms               10554 ms
query D  213344 ms                9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
    caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
    caches), the differences are negligible, except for the last query
    with most severe underestimate. In that case the new behaviour is
    much faster.

(3) For large work_mem values, the speedup is pretty obvious and
    dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Deferring some AtStart* allocations?
Next
From: Andres Freund
Date:
Subject: Re: idle_in_transaction_timeout