Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers

From Lawrence, Ramon
Subject Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date
Msg-id 6EEA43D22289484890D119821101B1DF2C1885@exchange20.mercury.ad.ubc.ca
Whole thread Raw
In response to Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets  ("Bryce Cutt" <pandasuit@gmail.com>)
Responses Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
> Here is a cleaned-up version.  I fixed a number of whitespace issues,
> improved a few comments, and rearranged one set of nested if-else
> statements (hopefully without breaking anything in the process).
>
> Josh / eggyknap -
>
> Can you rerun your performance tests with this version of the patch?

To help with testing, we have constructed a patch specifically for
testing.  The patch is the same as Robert's version except that it
tracks and prints out statistics during the join on how many tuples are
affected and has the enable_hashjoin_usestatmcvs variable defined so
that it is easy to turn on/off skew handling.  This is useful as
although the patch reduces the number of I/Os performed, this
improvement may not be seen in some queries which are dominated by other
cost factors (non-skew joins, CPU time, time to scan input relations,
etc.).

The sample output looks like this:

LI-P
Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est.
Save: 16114709.99
Total Inner Tuples: 2000000
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98

The other change is that the system calculates the skew and will not use
the in-memory skew partition if the skew is less than 1%.

Finally, we have attached some performance results for the TPCH 10G data
set (skew factors z=1 and z=2).  For the Customer-Orders-Lineitem-Part
query that Josh was testing, we see no overall time difference that is
significant compared to experimental error (although there is I/O
benefit for the Lineitem-Part join).  This query cost is dominated by
the non-skew joins of Customer-Orders and Orders-Lineitem and output
tuple construction.

The joins with skew, Lineitem-Supplier and Lineitem-Part, show
significantly improved performance.  Note how the statistics show that
the percentage I/O savings is directly proportional to the skew.
However, the overall query time savings is always less than this as
there are other costs such as reading the relations, performing the hash
comparisons, building the output tuples, etc. that are unaffected by the
optimization.

At this point, we await further feedback on what is necessary to get
this patch accepted.  We would also like to thank Josh and Robert again
for their review time.

Sincerely,

Ramon Lawrence and Bryce Cutt

Attachment

pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Updates of SE-PostgreSQL 8.4devel patches (r1403)
Next
From: Bruce Momjian
Date:
Subject: Re: solaris libpq threaded build fails