Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets - Mailing list pgsql-hackers
From | Bryce Cutt |
---|---|
Subject | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets |
Date | |
Msg-id | 1924d1180812222321t6ea99444xc01feaee47b26326@mail.gmail.com Whole thread Raw |
In response to | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets (Joshua Tolley <eggyknap@gmail.com>) |
Responses |
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
("Robert Haas" <robertmhaas@gmail.com>)
|
List | pgsql-hackers |
Because there is no nice way in PostgreSQL (that I know of) to derive a histogram after a join (on an intermediate result) currently usingMostCommonValues is only enabled on a join when the outer (probe) side is a table scan (seq scan only actually). See getMostCommonValues (soon to be called ExecHashJoinGetMostCommonValues) for the logic that determines this. Here is the result of explain (on a 100MB version of PostgreSQL): "Hash Left Join (cost=16232.00..91035.00 rows=600000 width=526)" " Hash Cond: (l.l_partkey = p.p_partkey)" " -> Hash Left Join (cost=15368.00..75171.00 rows=600000 width=395)" " Hash Cond: (l.l_orderkey = o.o_orderkey)" " -> Seq Scan on lineitem l (cost=0.00..17867.00 rows=600000 width=125)" " -> Hash (cost=8073.00..8073.00 rows=150000 width=270)" " -> Hash Left Join (cost=700.50..8073.00 rows=150000 width=270)" " Hash Cond: (o.o_custkey = c.c_custkey)" " -> Seq Scan on orders o (cost=0.00..4185.00 rows=150000 width=109)" " -> Hash (cost=513.00..513.00 rows=15000 width=161)" " -> Seq Scan on customer c (cost=0.00..513.00 rows=15000 width=161)" " -> Hash (cost=614.00..614.00 rows=20000 width=131)" " -> Seq Scan on part p (cost=0.00..614.00 rows=20000 width=131)" If you take a look at the explain for that join you will see that the first of the relations joined are orders and customer on custkey. There is almost no skew in the o_custkey attribute of orders even in the Z2 dataset so the difference between hashjoin with and without usingMostCommonValues enabled is quite small. The second join performed is to join the result of the first join with lineitem on orderkey. There is no skew at all in the l_orderkey attribute of lineitem so usingMostCommonValues can not help at all. The third join performed is to join the result of the second join with part on partkey. There is a lot of skew in the l_partkey attribute of lineitem but because the probe side of the third join is an intermediate from the second join and not a seq scan the algorithm cannot figure out the MCVs of the probe side. So on the query presented almost no skew can be exploited on the first join and no other joins can have their skew exploited at all because of the order PostgreSQL does the joins in. Basically yes, you would not see any real benefit from using the most common values on this query. We experimented with sampling (mentioned in the paper) to make an educated guess of MCVs on intermediary results but found that because a random sample could not be obtained the results were always very inaccurate. I basically just read a percentage of tuples from the probe relation before partitioning the build relation, derived the MCVs in a single pass, wrote the tuples back out to a temp file (because reading back from here is less expensive than resetting the probe side tree), then did the join as usual while remembering to read back from my temp file before reading the rest of the probe side tuples. Our tests indicate that sampling is not likely a good solution for deriving MCVs from intermediary results. In the Java implementation of histojoin we experimented with exploiting skew in multiple joins of a star join with some success (detailed in paper). I am not sure how this would be accomplished nicely in PostgreSQL. If the cost operators knew how to order the joins to make the best use of skew in the relations PostgreSQL could use the benefits of histojoin more often if perhaps doing a join with skew first would have speed benefits over doing the smaller join first. This change could be a future addition to PostgreSQL if this patch is accepted. It relies on getting the stats tuple for the join during the planning phase (in the cost function) and estimating the benefit that would have on the join cost. - Bryce Cutt On Mon, Dec 22, 2008 at 6:15 AM, Joshua Tolley <eggyknap@gmail.com> wrote: > On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote: >> [Some performance testing.] > > I (finally!) have a chance to post my performance testing results... my > apologies for the really long delay. <Excuses omitted> > > Unfortunately I'm not seeing wonderful speedups with the particular > queries I did in this case. I generated three 1GB datasets, with skews > set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs > and runs EXPLAIN ANALYZE on the same query five times. Then it turns > enable_usestatmcvs off, and runs the same query five more times. It does > this with each of the three datasets in turn, and then starts over at > the beginning until I tell it to quit. My results showed a statistically > significant improvement in speed only on the skew == 3 dataset. > > I did the same tests twice, once with default_statistics_target set to > 10, and once with it set to 100. I've attached boxplots of the total > query times as reported by EXPLAIN ANALYZE ("dst10" in the filename > indicates default_statistics_target was 10, and so on), my results > parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and > test.filtered.100), the results of one-tailed Student's T tests of the > result set (ttests), and the R code to run the tests if anyone's really > interested (t.test.R). > > The results data includes six columns: the skew value, whether > enable_usestatmcvs was on or not (represented by a 1 or 0), total times > for each of the three joins that made up the query, and total time for > the query itself. The results above pay attention only to the total > query time. > > Finally, the query involved: > > SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey) > LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c > ON (c.c_custkey = o.o_custkey); > > - Josh / eggyknap > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > > iEYEARECAAYFAklPoRYACgkQRiRfCGf1UMNUJgCcCxCRNXJS65nXqMsY2h6PENKF > YkQAoJlSlaaHd2L5dkFUAc8GPKfKezS5 > =KWfi > -----END PGP SIGNATURE----- > >
pgsql-hackers by date: