Re: Our trial to TPC-DS but optimizer made unreasonable plan - Mailing list pgsql-hackers
From | Kouhei Kaigai |
---|---|
Subject | Re: Our trial to TPC-DS but optimizer made unreasonable plan |
Date | |
Msg-id | 9A28C8860F777E439AA12E8AEA7694F801133F57@BPXM15GP.gisp.nec.co.jp Whole thread Raw |
In response to | Re: Our trial to TPC-DS but optimizer made unreasonable plan (Kouhei Kaigai <kaigai@ak.jp.nec.com>) |
Responses |
Re: Our trial to TPC-DS but optimizer made unreasonable plan
Re: Our trial to TPC-DS but optimizer made unreasonable plan Re: Our trial to TPC-DS but optimizer made unreasonable plan |
List | pgsql-hackers |
Here is one other thing I could learn from TPC-DS benchmark. The attached query is Q4 of TPC-DS, and its result was towards SF=100. It took long time to compete (about 30min), please see the attached EXPLAIN ANALYZE output. Its core workload is placed on CTE year_total. Its Append node has underlying three HashAggregate nodes which also has tables join for each. Below shows the first HashAggregate node. It consumes 268M rows, then generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows, so we can guess HashAggregate consumed 400sec. HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918 rows=9142442loops=1) Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country,customer.c_login, customer.c_email_address, date_dim.d_year -> Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024-> 275041999, 95.50% expected 95.47%) Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999-> 268562375, 93.25% expected 91.18%) -> Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803rows=287997024 loops=1) -> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049 loops=1) -> Seq Scan on customer (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.018..582.373 rows=2000000loops=1) Other two HashAggregate nodes have similar behavior. The second one consumed 281sec, including 30sec by underlying GpuJoin. The third one consumed 138sec, including 25sec by underlying GpuJoin. Apart from my custom join implementation, It seems to me HashAggregate node consumed too much time than expectation. One characteristics of this workload is, this aggregation takes eight grouping-keys. I doubt cost of function invocation for hash-value and equal-checks may be criminal. Let's dive into nodeAgg.c. ExecAgg() calls agg_fill_hash_table() to fill up the hash table with rows fetched from the underlying tables. On construction of the hash table, it calls hash functions (at TupleHashTableHash) and equal check functions (at execTuplesMatch) repeatedly. Both of them uses FunctionCallX() interface that exchange the argument using FmgrInfo structure. Usually, it is not the best way from performance perspective. Especially, this workload takes 268M input rows and 8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal), 4.2B times function calls via FmgrInfo shall happen. I think SortSupport logic provides a reasonable way to solve this kind of problem. For example, btint4sortsupport() informs a function pointer of the fast version of comparator (btint4fastcmp) which takes two Datum argument without indirect memory reference. This mechanism will also make sense for HashAggregate logic, to reduce the cost of function invocations. Please comment on the idea I noticed here. And, as an aside, if HashAggregate picks up a hash-value of grouping-keys from the target-list of underlying plan node (GpuJoin in this case), it may be possible to off-load calculation of hash-values on GPU device. :-) Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Thursday, August 13, 2015 8:23 PM > To: Greg Stark > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan > > > On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > > > In fact, cost of HashJoin underlying Sort node is: > > > -> Hash Join (cost=621264.91..752685.48 rows=1 width=132) > > > > > > On the other hands, NestedLoop on same place is: > > > -> Nested Loop (cost=0.00..752732.26 rows=1 width=132) > > > > > > Probably, small GUC adjustment may make optimizer to prefer HashJoin towards > > > these kind of queries. > > > > With that kind of discrepancy I doubt adjusting GUCs will be sufficient > > > > > Do you have a good idea? > > > > Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any > > row estimates that are way off? > > > Yes, EXPLAIN ANALYZE is attached. > > According to this, CTE year_total generates 384,208 rows. It is much smaller > than estimation (4.78M rows), however, filter's selectivity of CTE Scan was > not large as expectation. > For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even though > 40 rows were expected. On the next level, relations join between 11324 rows and > 9952 rows, towards to estimation of 40rows x 8 rows. > If NestLoop is placed instead of HashJoin, it will make an explosion of the number > of loops. > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project > KaiGai Kohei <kaigai@ak.jp.nec.com>
Attachment
pgsql-hackers by date: