longer-term optimizer musings - Mailing list pgsql-hackers

From Erik Riedel
Subject longer-term optimizer musings
Date
Msg-id Mqy5Us600gNtQmTHt8@andrew.cmu.edu
Whole thread Raw
In response to Re: [HACKERS] optimizer and type question  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] longer-term optimizer musings
Re: [HACKERS] longer-term optimizer musings
List pgsql-hackers
[The changes discussed below are more longer-term issues than my two
previous posts, which is why I am not attaching the patch for these
changes.  My hope is that the description of this experiment will
serve as input to whoever works on the optimizer in the future.

This should be considered a long-term "think about" and "probably
ignore as too specific" optimization.  You have been warned.  I just
wanted to brain-dump this, in the chance that it is useful somehow in
the future.   - Erik ]

Context: cost and size estimates inside the optimizer, particularly
Aggregate/Group nodes.

Basic idea is as follows: we look at the min and max values for all
the GROUP BY attributes and if they have a small dynamic range then we
_know_ the result of the aggregation will have a small number of rows
(bounded by the range of these attributes).  This information is then
used to make better cost estimates during optimization.

e.g. if we have two attributes 'a' and 'b' and we know (from
pg_statistics) that all 'a' values are in the range 25 to 35 and all
'b' values are in the range 75 to 100 then a GROUP BY on columns 'a'
and 'b' cannot result in more than 250 [ (100 - 75) * (35 - 25) = 25 *
10 ] rows, no matter how many rows are input.  We might be wrong if
the statistics are out of date, but we are just doing size/cost
estimation, so that's ok.  Since aggregations are often very selective
(i.e. much smaller outputs than inputs), this can make a big
difference further up the tree.

Gory details are as follows:

Table    = lineitem
+------------------------+----------------------------------+-------+
|              Field     |              Type                | Length|
+------------------------+----------------------------------+-------+
| l_orderkey             | int4 not null                    |     4 |
| l_partkey              | int4 not null                    |     4 |
| l_suppkey              | int4 not null                    |     4 |
| l_linenumber           | int4 not null                    |     4 |
| l_quantity             | float4 not null                  |     4 |
| l_extendedprice        | float4 not null                  |     4 |
| l_discount             | float4 not null                  |     4 |
| l_tax                  | float4 not null                  |     4 |
| l_returnflag           | char() not null                  |     1 |
| l_linestatus           | char() not null                  |     1 |
| l_shipdate             | date                             |     4 |
| l_commitdate           | date                             |     4 |
| l_receiptdate          | date                             |     4 |
| l_shipinstruct         | char() not null                  |    25 |
| l_shipmode             | char() not null                  |    10 |
| l_comment              | char() not null                  |    44 |
+------------------------+----------------------------------+-------+
Index:    lineitem_index_

--
-- Query 1
--
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, 
sum(l_extendedprice) as sum_base_price, 
sum(l_extendedprice*(1-l_discount)) as sum_disc_price, 
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, 
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, 
avg(l_discount) as avg_disc, count(*) as count_order 
from lineitem 
where l_shipdate <= '1998-09-02'
group by l_returnflag, l_linestatus 
order by l_returnflag, l_linestatus;

For our favorite query, the current code (with my selectivity fixes of
yesterday, but they aren't crucial to this discussion) produces the
original plan:

Sort  (cost=35623.88 size=0 width=0)->  Aggregate  (cost=35623.88 size=0 width=0) ->  Group  (cost=35623.88 size=0
width=0) ->  Sort  (cost=35623.88 size=0 width=0)   ->  Seq Scan on lineitem  (cost=35623.88 size=579166 width=60)
 

[Problem 1 - no cost estimates are done for sort nodes, and costs are
not passed up the query tree for Sorts, Groups, or Aggregates]

Now, let's say we have it actually add a cost for the sorts and pass
the size and width up the tree as we go:

Sort  (cost=713817.69 size=579166 width=60)->  Aggregate  (cost=374720.78 size=579166 width=60) ->  Group
(cost=374720.78size=579166 width=60)  ->  Sort  (cost=374720.78 size=579166 width=60)   ->  Seq Scan on lineitem
(cost=35623.88size=579166 width=60)
 

we do this by adding several bits of code.  For sort nodes near line
470 of planmain.c, something like:
 sortplan->plan.cost = subplan->cost +      cost_sort(XXX, subplan->plan_size, subplan->plan_width, false);
 /* sorting doesn't change size or width */ sortplan->plan.plan_size = subplan->plan_size; sortplan->plan.plan_width =
subplan->plan_width;

and for group nodes near line 480 of planmain.c, something like:
 /* pretty easy, right?  the size and width will not change    */ /* from the sort node, so we can just carry those
along      */ /* and group has (practically) no cost, so just keep that too */ grpplan->plan.cost =
sortplan->plan.cost;grpplan->plan.plan_size = sortplan->plan.plan_size; grpplan->plan.plan_width =
sortplan->plan.plan_width;

and something similar for the other sort nodes near line 410 of
planner.c, and a similar pass along of the cost values from the group
node through the aggregation node near line 260 of planner.c

And, of course, we'd have to check hash nodes, join nodes and all the
others to do the right thing, but I only looked at the ones necessary
for my query.

Adding all those parts, we now have the plan:

Sort  (cost=713817.69 size=579166 width=60)->  Aggregate  (cost=374720.78 size=579166 width=60) ->  Group
(cost=374720.78size=579166 width=60)  ->  Sort  (cost=374720.78 size=579166 width=60)   ->  Seq Scan on lineitem
(cost=35623.88size=579166 width=60)
 

which actually includes the cost of the sorts and passes the size and
width up the tree.

[End discussion of Problem 1 - no sort costs or cost-passing]

[Craziness starts here]

[Problem 2 - the Aggregate node is way over-estimated by assuming the
output is as large as the input, but how can you do better?           ]

It turns out the above plan estimate isn't very satisfactory, because
the Aggregate actually significantly reduces the size and the
estimation of the final Sort is way higher than it needs to be.  But
how can we estimate what happens at an Aggregation node?

For general types in the GROUP BY, there is probably not much we can
do, but if we can somehow "know" that the GROUP BY columns have a
limited dynamic range - e.g. the char(1) fields in my lineitem, or
integer fields with small ranges - then we should be able to do some
"back of the envelope" estimation and get a better idea of costs
higher up in the plan.  A quick look shows that this would be
applicable in 5 of the 17 queries from TPC-D that I am looking at, not
huge applicability, but not single-query specific either.  What are
the "normal" or "expected" types and columns for all the GROUP BYs out
there - I don't know.

Anyway, so we do the estimation described above and get to the much
more accurate plan:

Sort  (cost=374734.87 size=153 width=60)->  Aggregate  (cost=374720.78 size=153 width=60) ->  Group  (cost=374720.78
size=579166width=60)  ->  Sort  (cost=374720.78 size=579166 width=60)   ->  Seq Scan on lineitem  (cost=35623.88
size=579166width=60)
 

this works because pg_statistic knows that l_returnflag has at most 17
possible values (min 'A' to max 'R') and l_linestatus has at most 9
possible values (min 'F' to max 'O').  In fact, l_returnflag and
l_linestatus are categorical and have only 3 possible values each, but
the estimate of 153 (17 * 9) is much better than the 579166 that the
node must otherwise assume, and much closer to the actual result,
which has only 4 rows.

I get the above plan by further modifcation after line 260 of
planner.c.  I am not attaching the patch because it is ugly in spots
and I just did "proof of concept", full implementation of this idea
would require much more thinking about generality.

Basically I go through the target list of the agg node and identify
all the Var expressions.  I then figure out what columns in the
original relations these refer to.  I then look up the range of the
attributes in each of these columns (using the min and max values for
the attributes obtained via gethilokey() as discussed in my last post
and doing the "strings map to ASCII value of their first letter" that
Tom suggested).  I multiply all these values together (i.e. the cross
product of all possible combinations that the GROUP BY on all n
columns could produce - call this x).  The size of the agg result is
than the smaller of this value or the input size (can't be bigger than
the input, and if there are only x unique combinations of the GROUP BY
attributes, then there will never be more than x rows in the output of
the GROUP BY).

I do this very conservatively, if I can't get statistics for one of
the columns, then I forget it and go with the old estimate (or could
go with an estimate of 0, if one were feeling optimistic.  It is not
clear what a "good" default estimate is here, maybe the 1/3 that is
used for selectivities would be as good as anything else).

[End discussion of Problem 2 - aggregate/group estimation]

As with my previous posts, this is most likely not a general solution,
it's just an idea that works (very well) for the query I am looking
at, and has some general applicability.  I am sure that the above
ignores a number of "bigger picture" issues, but it does help the
particular query I care about.

Also note that none of this actually speeds up even my query, it only
makes the optimizer estimate much closer to the actual query cost
(which is what I care about for the work I am doing).

Maybe this will be of help in any future work on the optimizer.  Maybe
it is simply the rantings of a lunatic.

Enjoy.

Erik Riedel
Carnegie Mellon University
www.cs.cmu.edu/~riedel



pgsql-hackers by date:

Previous
From: Taral
Date:
Subject: Re: [HACKERS] PostgreSQL LOGO (was: Developers Globe (FINAL))
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] PostgreSQL LOGO (was: Developers Globe (FINAL))