Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers

From Bryce Cutt
Subject Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date
Msg-id 1924d1180903201745i7e3f6876w2e108571e3b1843d@mail.gmail.com
Whole thread Raw
In response to Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Not necessarily true.  Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique.  There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets.  The potential win is still pretty high though.

- Bryce Cutt


On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Bryce Cutt <pandasuit@gmail.com> writes:
>>> Here is the new patch.
>>
>> Applied with revisions.  I undid some of the "optimizations" that
>> cluttered the code in order to save a cycle or two per tuple --- as per
>> previous discussion, that's not what the performance questions were
>> about.  Also, I did not like the terminology "in-memory"/"IM"; it seemed
>> confusing since the main hash table is in-memory too.  I revised the
>> code to consistently refer to the additional hash table as a "skew"
>> hashtable and the optimization in general as skew optimization.  Hope
>> that seems reasonable to you --- we could search-and-replace it to
>> something else if you'd prefer.
>>
>> For the moment, I didn't really do anything about teaching the planner
>> to account for this optimization in its cost estimates.  The initial
>> estimate of the number of MCVs that will be specially treated seems to
>> me to be too high (it's only accurate if the inner relation is unique),
>> but getting a more accurate estimate seems pretty hard, and it's not
>> clear it's worth the trouble.  Without that, though, you can't tell
>> what fraction of outer tuples will get the short-circuit treatment.
>
> If the inner relation isn't fairly close to unique you shouldn't be
> using this optimization in the first place.
>
> ...Robert
>


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Next
From: Andrew Gierth
Date:
Subject: contrib function naming, and upgrade issues