Re: Our trial to TPC-DS but optimizer made unreasonable plan - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: Our trial to TPC-DS but optimizer made unreasonable plan
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F801133F57@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: Our trial to TPC-DS but optimizer made unreasonable plan  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: Our trial to TPC-DS but optimizer made unreasonable plan
Re: Our trial to TPC-DS but optimizer made unreasonable plan
Re: Our trial to TPC-DS but optimizer made unreasonable plan
List pgsql-hackers
Here is one other thing I could learn from TPC-DS benchmark.

The attached query is Q4 of TPC-DS, and its result was towards SF=100.
It took long time to compete (about 30min), please see the attached
EXPLAIN ANALYZE output.

Its core workload is placed on CTE year_total. Its Append node has
underlying three HashAggregate nodes which also has tables join for
each.

Below shows the first HashAggregate node. It consumes 268M rows, then
generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows,
so we can guess HashAggregate consumed 400sec.

  HashAggregate  (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918
rows=9142442loops=1)
 
    Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag,
customer.c_birth_country,customer.c_login, customer.c_email_address, date_dim.d_year
 
    ->  Custom Scan (GpuJoin)  (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894
rows=268562375loops=1)
 
          Bulkload: On (density: 100.00%)
          Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows
(287997024-> 275041999, 95.50% expected 95.47%)
 
          Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows
(275041999-> 268562375, 93.25% expected 91.18%)
 
          ->  Custom Scan (BulkScan) on store_sales  (cost=0.00..9649559.60 rows=287996960 width=38) (actual
time=18.372..54522.803rows=287997024 loops=1)
 
          ->  Seq Scan on date_dim  (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049
loops=1)
          ->  Seq Scan on customer  (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.018..582.373
rows=2000000loops=1)
 

Other two HashAggregate nodes have similar behavior. The second one
consumed 281sec, including 30sec by underlying GpuJoin. The third one
consumed 138sec, including 25sec by underlying GpuJoin.

Apart from my custom join implementation, It seems to me HashAggregate
node consumed too much time than expectation.

One characteristics of this workload is, this aggregation takes eight
grouping-keys. I doubt cost of function invocation for hash-value and
equal-checks may be criminal.


Let's dive into nodeAgg.c.
ExecAgg() calls agg_fill_hash_table() to fill up the hash table with
rows fetched from the underlying tables. On construction of the hash
table, it calls hash functions (at TupleHashTableHash) and equal check
functions (at execTuplesMatch) repeatedly.
Both of them uses FunctionCallX() interface that exchange the argument
using FmgrInfo structure. Usually, it is not the best way from performance
perspective. Especially, this workload takes 268M input rows and
8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal),
4.2B times function calls via FmgrInfo shall happen.

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.


And, as an aside, if HashAggregate picks up a hash-value of grouping-keys
from the target-list of underlying plan node (GpuJoin in this case), it
may be possible to off-load calculation of hash-values on GPU device. :-)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Thursday, August 13, 2015 8:23 PM
> To: Greg Stark
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan
> 
> > On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > > In fact, cost of HashJoin underlying Sort node is:
> > >     ->  Hash Join  (cost=621264.91..752685.48 rows=1 width=132)
> > >
> > > On the other hands, NestedLoop on same place is:
> > >     ->  Nested Loop  (cost=0.00..752732.26 rows=1 width=132)
> > >
> > > Probably, small GUC adjustment may make optimizer to prefer HashJoin towards
> > > these kind of queries.
> >
> > With that kind of discrepancy I doubt adjusting GUCs will be sufficient
> >
> > > Do you have a good idea?
> >
> > Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any
> > row estimates that are way off?
> >
> Yes, EXPLAIN ANALYZE is attached.
> 
> According to this, CTE year_total generates 384,208 rows. It is much smaller
> than estimation (4.78M rows), however, filter's selectivity of CTE Scan was
> not large as expectation.
> For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even though
> 40 rows were expected. On the next level, relations join between 11324 rows and
> 9952 rows, towards to estimation of 40rows x 8 rows.
> If NestLoop is placed instead of HashJoin, it will make an explosion of the number
> of loops.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>


Attachment

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Bear Giles
Date:
Subject: what would tar file FDW look like?