Home > mailing lists

Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets - Mailing list pgsql-hackers

From	Joshua Tolley
Subject	Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date	December 23, 2008 10:52:07
Msg-id	20081223145146.GA5882@uber Whole thread
In response to	Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets ("Robert Haas" <robertmhaas@gmail.com>)
Responses	Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
List	pgsql-hackers

Tree view

On Tue, Dec 23, 2008 at 09:22:27AM -0500, Robert Haas wrote:
> On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt <pandasuit@gmail.com> wrote:
> > Because there is no nice way in PostgreSQL (that I know of) to derive
> > a histogram after a join (on an intermediate result) currently
> > usingMostCommonValues is only enabled on a join when the outer (probe)
> > side is a table scan (seq scan only actually).  See
> > getMostCommonValues (soon to be called
> > ExecHashJoinGetMostCommonValues) for the logic that determines this.

So my test case of "do a whole bunch of hash joins in a test query"
isn't really valid. Makes sense. I did another, more haphazard test on a
query with fewer joins, and saw noticeable speedups.

> It's starting to seem to me that the case where this patch provides a
> benefit is so narrow that I'm not sure it's worth the extra code.

Not that anyone asked, but I don't consider myself qualified to render
judgement on that point. Code size is, I guess, a maintainability issue,
and I'm not terribly experienced maintaining PostgreSQL :)
> Is it realistic to think that the MCVs of the base relation might
> still be applicable to the joinrel?  It's certainly easy to think of
> counterexamples, but it might be a good approximation more often than
> not.

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude. In
other words, it will increase the variance of our results.

- Josh

pgsql-hackers by date:

From: "Kevin Grittner"
Date: 23 December 2008, 10:51:14
Subject: Re: incoherent view of serializable transactions

From: Emmanuel Cecchet
Date: 23 December 2008, 10:59:38
Subject: Re: incoherent view of serializable transactions

Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets - Mailing list pgsql-hackers

Previous

Next