Home > mailing lists

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date	August 19, 2014 21:38:10
Msg-id	53F39983.2020303@fuzzy.cz Whole thread Raw
In response to	Re: bad estimation together with large work_mem generates terrible slow hash joins (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: bad estimation together with large work_mem generates terrible slow hash joins
List	pgsql-hackers

Tree view

On 19.8.2014 19:05, Robert Haas wrote:
> On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 12.8.2014 00:30, Tomas Vondra wrote:
>>> On 11.8.2014 20:25, Robert Haas wrote:
>>>> It also strikes me that when there's only 1 batch, the set of bits
>>>> that map onto the batch number is zero-width, and one zero-width bit
>>>> range is as good as another.  In other words, if you're only planning
>>>> to do one batch, you can easily grow the number of buckets on the fly.
>>>> Growing the number of buckets only becomes difficult once you have
>>>> more than one batch.
>>
>> ...
>>
>>> I was considering using reversing the bits of the hash, because that's
>>> pretty much the simplest solution. But I think you're right it might
>>> actually work like this:
>>>
>>>   * Are more batches needed?
>>>
>>>       (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>>
>>>       (no) => go ahead, until processing all tuples or hitting work_mem
>>>
>>>               (work_mem) => meh, use the same nbuckets above
>>>
>>>               (all tuples) => compute optimal nbuckets / resize
>>>
>>>
>>> But I need to think about this a bit. So far it seems to me there's no
>>> way additional batches might benefit from increasing nbuckets further.
>>
>> I think this is a simple and solid solution, solving the batchno
>> computation issues quite nicely. Attached is v10 patch (bare and
>> combined with the dense allocation), that does this:
>>
>> 1) when we know we'll need batching, buckets are sized for full work_mem
>>    (using the estimated tuple width, etc.)
>>
>> 2) without the batching, we estimate the 'right number of buckets' for
>>    the estimated number of tuples, and keep track of the optimal number
>>    as tuples are added to the hash table
>>
>>    - if we discover we need to start batching, we keep the current
>>      optimal value (which should be the same as the max number of
>>      buckets) and don't mess with it anymore (making it possible to
>>      compute batch IDs just like before)
>>
>>    - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>>      is resized as part of the rebatch
>>
>>    - if the hash build completes without batching, we do the resize
>>
>> I believe the patch is pretty much perfect. I plan to do more thorough
>> testing on a wide range of queries in the next few days.
>>
>> I also removed the 'enable_hash_resize' GUC, because it would be more
>> complex to implement this properly after doing the resize as part of
>> rebatch etc.. So either it would make the patch more complex, or it
>> wouldn't do what the name promises.
>
> A variety of trivial comments on this:
>
> PostgreSQL style is un-cuddled curly braces.  Also, multi-line
> comments need to start with a line containing only /* and end with a
> line containing only */.  In one place you've added curly braces
> around a single-line block that is otherwise unmodified; please don't
> do that.  In one place, you have "becase" instead of "because".  In
> another place, you write "add if after it" but it should say "add it
> after it" or maybe better "add the new one after it".  Avoid using
> punctuation like "=>" in comments to illustrate the connection between
> sentences; instead, use a connecting word like "then" or "therefore"
> or whatever is appropriate; in this instance, a period followed by the
> start of a new sentence seems sufficient.  Revert the removal of a
> single line of whitespace near the top of nodeHash.c.
>
> There are too many things marked XXX in this patch.  They should
> either be fixed, if they are real problems, or they should be
> commented in a way that doesn't give rise to the idea that they're
> problems if they aren't.

OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).

I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.

I have also removed one constant

>
> OK, now on to some more substantive stuff:
>
> 1. It's not clear to me what the overall effect of this patch on
> memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
> to use, on the average, 10x as much bucket-header memory per tuple.
> Specifically, I think it means we'll use about 8 bytes of
> bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
> tuples are narrow, that could be significant; concerns have been
> expressed about that here in the past.  Increasing the number of
> buckets could also increase memory usage.  On the other hand, the
> dense allocation stuff probably saves a ton of memory, so maybe we end
> up overall, but I'm not sure.  Your thoughts, and maybe some test
> results with narrow and wide tuples, would be appreciated.

The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:

http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz

The dense allocation removes pretty much all the palloc overhead. For a
40B tuple, I did get this before the dense allocation

   HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

and this with the dense allocation patch applied

   HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So it's pretty much 50% reduction in memory consumption.

Now, the palloc header is >=16B, and removing this more than compensates
the increased number of buckets (in the worst case, there may be ~2x
buckets per tuple, which is 2x8B pointers).

I did a number of tests, and the results are quite consistent with this.


> 2. But, on the positive side, modulo the memory utilization questions
> mentioned above, I would expect the impact on hash join performance to
> be positive.  Going from 10 tuples per bucket to just 1 should help,
> and on cases where the actual load factor would have ended up much
> higher because of poor estimation, increasing the number of buckets on
> the fly should help even more.  I haven't tested this, though.

Yes, that's the results I was getting.

I've done a number of tests, and not a single one showed a slowdown so
far. Most well-estimated queries were 2-3x faster, and the poorly
estimated ones were orders of magnitude faster.

We're working on getting this fixed on a range of workloads, I'll post
some results once I have that.


> I haven't had a chance to completely go through this yet, so these are
> just some initial thoughts.

Thanks!

Tomas

Attachment

pgsql-hackers by date:

From: Rahila Syed
Date: 19 August 2014, 21:36:53
Subject: Re: [REVIEW] Re: Compression of full-page-writes

From: Stephen Frost
Date: 19 August 2014, 21:40:20
Subject: Re: PQgetssl() and alternative SSL implementations

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

Attachment

Previous

Next