Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours! - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
Date
Msg-id 4de4bd67-6bfe-2b44-9769-db750e169a5f@2ndquadrant.com
Whole thread Raw
In response to [HACKERS] TPC-H Q20 from 1 hour to 19 hours!  (Rafia Sabih <rafia.sabih@enterprisedb.com>)
Responses Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

I've been looking at this issue today, and so far I don't think it's a 
bug in the foreign key estimation. It seems mostly that the 9.5 
estimates were hopelessly bad, and the join estimation changes simply 
pushed it a tiny bit the wrong direction.

Although maybe there is a bug (or at least a change of behavior) in one 
case, but I'll get to that.

I've managed to extract a small part of Q20 that demonstrates the 
differences between versions quite nicely, I think. The part causing the 
trouble looks like this:

   explain select
       ps_suppkey
   from
       partsupp,
       (
           select
               l_partkey agg_partkey,
               l_suppkey agg_suppkey
           from
               lineitem
           where
               l_shipdate >= date '1997-01-01'
               and l_shipdate < date '1997-01-01' + interval '1' year
           group by
               l_partkey,
               l_suppkey
       ) agg_lineitem
   where
       agg_partkey = ps_partkey
       and agg_suppkey = ps_suppkey
       and ps_partkey in (
           select
               p_partkey
           from
               part
           where
               p_name like 'hot%'
       );

i.e. it aggregates the "lineitem" table, and then joins "partsupp" and 
"part" tables to it.

     "aggregated lineitem" <-> partsupp <-> part

I've collected estimates from four different variants of the query (see 
the attached exlain.sql):

1) SIMPLE
   - join directly to lineitem (without the aggregation)
   - remove the p_name LIKE pattern matching

2) SIMPLE+LIKE
   - like SIMPLE, but keep the LIKE condition

3) GROUPING
   - join to the aggregated lineitem table
   - remove the p_name LIKE pattern matching

4) GROUPING+LIKE
   - like GROUPING, but keep the LIKE condition

I've collected estimates on a 20GB data set, both from 9.5 (so without 
any of the FK estimation changes) and on master with different foreign 
keys between the tables.

no-keys  - no foreign keys between the three tables
lineitem - lineitem references partsupp
partsupp - partsupp references part
both     - both foreign keys

And the results look like this (actual row counts were collected on 9.5, 
but that should not matter - the results should be the same on all 
versions):

       branch     SIMPLE     SIMPLE+LIKE     GROUPING    GROUPING+LIKE
   --------------------------------------------------------------------
       actual  119994608         1311974     10897186           119238
          9.5       2863              35          160              160
      no-keys       2340              24          868              868
     lineitem  119994848         1229750          868              868
     partsupp       2340              24         1737               18
    both-keys  119994848         1212065         1737               18

This seems mostly sane, I guess, but let's look at various cases.

In the SIMPLE cases, the foreign key "lineitem->partsupp" makes a huge 
difference - the estimates are pretty exact, both with and without the 
LIKE condition. The "partsupp->part" key makes almost no difference, 
though - the minor differences (35/24 and 1229750/1212065) seem to be 
mostly due to minor differences in stats built by ANALYZE, particularly 
in histograms used by patternsel().

In the GROUPING cases, the situation is obviously much worse. The 
grouping makes it impossible to use the "lineitem->partsupp" foreign 
key, resulting in severe underestimates. The "partsupp->part" is used, 
but the difference is pretty negligible as it's a simple (one column) 
foreign key.

The change from 160 -> 868 is merely due to 84f9a35e3 changing how we 
estimate number of groups in a GROUP BY clause. In 9.5 we get this:

     ->  HashAggregate  (rows=1836028) (actual rows=10897186)

while since 9.6 we get this

     ->  GroupAggregate  (rows=9674242)

Not only is that much closer to the actual value than the 9.5 estimate, 
but it's almost exactly the factor between 160 and 868:

     9674242 / 1836028 = 5.27
     160 * 5.26 = 843

So I'd say the 160 vs. 868 is expected, although the result is still way 
off, of course.

Which brings me to the slightly suspicious bit. On 9.5, there's no 
difference between GROUP and GROUP+LIKE cases - the estimates are 
exactly the same in both cases. This is true too, but only without the 
foreign key between "partsupp" and "part", i.e. the two non-grouped 
relations in the join. And what's more, the difference (1737 vs. 16) is 
pretty much exactly 100x, which is the estimate for the LIKE condition.

So it kinda seems 9.5 does not apply this condition for semi-joins, 
while >=9.6 does that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] Letting the client choose the protocol to use during aSASL exchange
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] parallel bitmapscan isn't exercised in regression tests