Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date
Msg-id 603c8f070901012003s60a6cfb9l1d91ba6326f85b0b@mail.gmail.com
Whole thread Raw
In response to Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets  ("Bryce Cutt" <pandasuit@gmail.com>)
List pgsql-hackers
On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt <pandasuit@gmail.com> wrote:
> Here is the next patch version.

Thanks for posting this update.  This is definitely getting better,
but I still see some style issues.  We can work on fixing those once
the rest of the details have been finalized.

However, one question in this area - isn't
ExecHashFreezeNextMCVPartition actually a most common TUPLE partition,
rather than a most common VALUE partition (and similarly for
ExecHashGetMCVPartition)?  I'm not quite sure what to do about this as
the names are already quite long - is there some better name for the
functions and structure members than MostCommonTuplePartition?  Maybe
we could call it the in-memory partition and abbreviate it IMPartition
throughout.  I think that might make things more clear.

> The code can now find the the MCVs in more cases.  Even if the probe
> side is an operator other than a seq scan (such as another hashjoin)
> the code can now find the stats tuple for the underlying relation.

You're using varnoold in a way that directly contradicts the comment
in primnodes.h (essentially, that it's not used for anything other
than debugging).  I don't think this is a bad thing, but you have to
patch the comment.

Have you done any performance testing on the impact of this change?

> The new idea of limiting the number of MCVs to a percentage of memory
> has not been added yet.

That's a pretty important change, I think, though it would be nice to
have one of the committers chime in here.  For those who may not have
been following the thread closely, the current implementation's memory
usage can go quite a bit higher than work_mem - the in-memory open
hash table can be up to 1MB or so (if statistics_target = 10K) plus it
can contain up to work_mem of tuples plus each batch can contain
another work_mem of tuples.  The proposal is to carve out 1-3% of
work_mem for the in-memory hash table and leave the rest for the
batches, thus hopefully not affecting the # of batches very much.  If
it doesn't look like the whole MCV list will fit, we'll take a shot at
guessing what length prefix of it will.

...Robert


pgsql-hackers by date:

Previous
From: Robert Treat
Date:
Subject: Re: Copyright update
Next
From: "Robert Haas"
Date:
Subject: Re: posix_fadvise v22