Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: bad estimation together with large work_mem generates terrible slow hash joins
Date
Msg-id c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu
Whole thread Raw
In response to Re: bad estimation together with large work_mem generates terrible slow hash joins  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-hackers
Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a):
> Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 12.9.2014 23:22, Robert Haas wrote:
>
>>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
>>> certainly arguable. Either method, though, figures to be better than
>>> doing nothing, so let's do something.
>>
>> OK, but can we commit the remaining part first? Because it'll certainly
>> interact with this proposed part, and it's easier to tweak when the code
>> is already there. Instead of rebasing over and over.
>
> The patch applied and built without problem, and pass `make
> check-world`.  The only thing that caught my eye was the addition
> of debug code using printf() instead of logging at a DEBUG level.
> Is there any reason for that?

Not really. IIRC the main reason it that the other code in nodeHash.c uses
the same approach.

> I still need to work through the (rather voluminous) email threads
> and the code to be totally comfortable with all the issues, but
> if Robert and Heikki are comfortable with it, I'm not gonna argue.

;-)

> Preliminary benchmarks look outstanding!  I tried to control
> carefully to ensure consistent results (by dropping, recreating,
> vacuuming, analyzing, and checkpointing before each test), but
> there were surprising outliers in the unpatched version.  It turned
> out that it was because slight differences in the random samples
> caused different numbers of buckets for both unpatched and patched,
> but the patched version dealt with the difference gracefully while
> the unpatched version's performance fluctuated randomly.

Can we get a bit more details on the test cases? I did a number of tests
while working on the patch, and I generally tested two cases:

(a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET)

(b) mis-estimated queries, with various levels of accuracy (say, 10x -
1000x misestimates)

From the description, it seems you only tested (b) - is that correct?

The thing is that while the resize is very fast and happens only once,
it's not perfectly free. In my tests, this was always more than
compensated by the speedups (even in the weird cases reported by Stephen
Frost), so I think we're safe here.

Also, I definitely recommend testing with different hash table sizes (say,
work_mem=256MB and the hash table just enough to fit in without batching).
The thing is the effect of CPU caches is very different for small and
large hash tables. (This is not about work_mem alone, but about how much
memory is used by the hash table - according to the results you posted it
never gets over ~4.5MB)


You tested against current HEAD, right? This patch was split into three
parts, two of which were already commited (45f6240a and 8cce08f1). The
logic of the patch was "this might take a of time/memory, but it's
compensated by these other changes". Maybe running the tests on the
original code would be interesting?

Although, if this last part of the patch actually improves the performance
on it's own, we're fine - it'll improve it even more compared to the old
code (especially before lowering NTUP_PER_BUCKET 10 to 1).

> My only concern from the benchmarks is that it seemed like there
> was a statistically significant increase in planning time:
>
> unpatched plan time average: 0.450 ms
> patched plan time average:   0.536 ms
>
> That *might* just be noise, but it seems likely to be real.  For
> the improvement in run time, I'd put up with an extra 86us in
> planning time per hash join; but if there's any way to shave some
> of that off, all the better.

I agree with Heikki that this is probably noise, because the patch does
not mess with planner at all.

The only thing I can think of is adding a few fields into
HashJoinTableData. Maybe this makes the structure too large to fit into a
cacheline, or something?

Tomas





pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: bad estimation together with large work_mem generates terrible slow hash joins
Next
From: Peter Geoghegan
Date:
Subject: Re: UPSERT wiki page, and SQL MERGE syntax