Home > mailing lists

Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers

From	Bryce Cutt
Subject	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date	March 20, 2009 23:11:53
Msg-id	1924d1180903201745i7e3f6876w2e108571e3b1843d@mail.gmail.com Whole thread Raw
In response to	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt


On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Bryce Cutt <pandasuit@gmail.com> writes:
>>> Here is the new patch.
>>
>> Applied with revisions.  I undid some of the "optimizations" that
>> cluttered the code in order to save a cycle or two per tuple --- as per
>> previous discussion, that's not what the performance questions were
>> about.  Also, I did not like the terminology "in-memory"/"IM"; it seemed
>> confusing since the main hash table is in-memory too.  I revised the
>> code to consistently refer to the additional hash table as a "skew"
>> hashtable and the optimization in general as skew optimization.  Hope
>> that seems reasonable to you --- we could search-and-replace it to
>> something else if you'd prefer.
>>
>> For the moment, I didn't really do anything about teaching the planner
>> to account for this optimization in its cost estimates.  The initial
>> estimate of the number of MCVs that will be specially treated seems to
>> me to be too high (it's only accurate if the inner relation is unique),
>> but getting a more accurate estimate seems pretty hard, and it's not
>> clear it's worth the trouble.  Without that, though, you can't tell
>> what fraction of outer tuples will get the short-circuit treatment.
>
> If the inner relation isn't fairly close to unique you shouldn't be
> using this optimization in the first place.
>
> ...Robert
>

pgsql-hackers by date:

From: Robert Haas
Date: 20 March 2009, 21:51:44
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: Andrew Gierth
Date: 20 March 2009, 23:23:20
Subject: contrib function naming, and upgrade issues

Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers

Previous

Next