Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: bad estimation together with large work_mem generates terrible slow hash joins
Date
Msg-id 54134BD8.1000906@fuzzy.cz
Whole thread Raw
In response to Re: bad estimation together with large work_mem generates terrible slow hash joins  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: bad estimation together with large work_mem generates terrible slow hash joins  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 12.9.2014 18:49, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Attached is the patch split as suggested:
>>>
>>> (a) hashjoin-nbuckets-v14a-size.patch
>>>
>>>     * NTUP_PER_BUCKET=1
>>>     * counting buckets towards work_mem
>>>     * changes in ExecChooseHashTableSize (due to the other changes)
>>
>> OK, I'm going to work on this one now.  I have some ideas about how to
>> simplify this a bit and will post an update in a few hours once I see
>> how that pans out.
> 
> Here's what I came up with. I believe this has the same semantics as 
> your patch for less code, but please check, because I might be
> wrong. The main simplifications I made were:
> 
> (1) In master, we decide whether or not to batch based on the size
> of the data, without knowing the number of buckets. If we want to 
> account for the bucket overhead, that doesn't work. But in your 
> patch, we basically computed the number of buckets we'd need for the 
> single-batch case, then decide whether to do a single batch, and if 
> yes, computed the same values over again with the same formula
> phrased differently. I revised that so that the single-batch case is 
> calculated only once. If everything fits, we're all set, and don't 
> need to recalculate anything.
> 
> (2) In your patch, once we decide we need to batch, you set nbatch
> as now, and then check whether the computed value is still adequate
> after subtracting buckets_byte from hash_table_bytes. I revised that
> so that we make the initial batch estimate of dbatch by dividing 
> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than 
> hash_table_bytes. I think that avoids the need to maybe increase 
> nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

> I'm comfortable with this version if you are, but (maybe as a 
> follow-on commit) I think we could make this even a bit smarter. If 
> inner_rel_bytes + bucket_bytes > hash_table_bytes but
> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
> the number of buckets by 2x or 4x to make it fit. That would provide
> a bit of additional safety margin against cases where this patch
> might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

regards
Tomas



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Next
From: Kevin Grittner
Date:
Subject: Re: Stating the significance of Lehman & Yao in the nbtree README