longer-term optimizer musings - Mailing list pgsql-hackers
From | Erik Riedel |
---|---|
Subject | longer-term optimizer musings |
Date | |
Msg-id | Mqy5Us600gNtQmTHt8@andrew.cmu.edu Whole thread Raw |
In response to | Re: [HACKERS] optimizer and type question (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] longer-term optimizer musings
Re: [HACKERS] longer-term optimizer musings |
List | pgsql-hackers |
[The changes discussed below are more longer-term issues than my two previous posts, which is why I am not attaching the patch for these changes. My hope is that the description of this experiment will serve as input to whoever works on the optimizer in the future. This should be considered a long-term "think about" and "probably ignore as too specific" optimization. You have been warned. I just wanted to brain-dump this, in the chance that it is useful somehow in the future. - Erik ] Context: cost and size estimates inside the optimizer, particularly Aggregate/Group nodes. Basic idea is as follows: we look at the min and max values for all the GROUP BY attributes and if they have a small dynamic range then we _know_ the result of the aggregation will have a small number of rows (bounded by the range of these attributes). This information is then used to make better cost estimates during optimization. e.g. if we have two attributes 'a' and 'b' and we know (from pg_statistics) that all 'a' values are in the range 25 to 35 and all 'b' values are in the range 75 to 100 then a GROUP BY on columns 'a' and 'b' cannot result in more than 250 [ (100 - 75) * (35 - 25) = 25 * 10 ] rows, no matter how many rows are input. We might be wrong if the statistics are out of date, but we are just doing size/cost estimation, so that's ok. Since aggregations are often very selective (i.e. much smaller outputs than inputs), this can make a big difference further up the tree. Gory details are as follows: Table = lineitem +------------------------+----------------------------------+-------+ | Field | Type | Length| +------------------------+----------------------------------+-------+ | l_orderkey | int4 not null | 4 | | l_partkey | int4 not null | 4 | | l_suppkey | int4 not null | 4 | | l_linenumber | int4 not null | 4 | | l_quantity | float4 not null | 4 | | l_extendedprice | float4 not null | 4 | | l_discount | float4 not null | 4 | | l_tax | float4 not null | 4 | | l_returnflag | char() not null | 1 | | l_linestatus | char() not null | 1 | | l_shipdate | date | 4 | | l_commitdate | date | 4 | | l_receiptdate | date | 4 | | l_shipinstruct | char() not null | 25 | | l_shipmode | char() not null | 10 | | l_comment | char() not null | 44 | +------------------------+----------------------------------+-------+ Index: lineitem_index_ -- -- Query 1 -- select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= '1998-09-02' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; For our favorite query, the current code (with my selectivity fixes of yesterday, but they aren't crucial to this discussion) produces the original plan: Sort (cost=35623.88 size=0 width=0)-> Aggregate (cost=35623.88 size=0 width=0) -> Group (cost=35623.88 size=0 width=0) -> Sort (cost=35623.88 size=0 width=0) -> Seq Scan on lineitem (cost=35623.88 size=579166 width=60) [Problem 1 - no cost estimates are done for sort nodes, and costs are not passed up the query tree for Sorts, Groups, or Aggregates] Now, let's say we have it actually add a cost for the sorts and pass the size and width up the tree as we go: Sort (cost=713817.69 size=579166 width=60)-> Aggregate (cost=374720.78 size=579166 width=60) -> Group (cost=374720.78size=579166 width=60) -> Sort (cost=374720.78 size=579166 width=60) -> Seq Scan on lineitem (cost=35623.88size=579166 width=60) we do this by adding several bits of code. For sort nodes near line 470 of planmain.c, something like: sortplan->plan.cost = subplan->cost + cost_sort(XXX, subplan->plan_size, subplan->plan_width, false); /* sorting doesn't change size or width */ sortplan->plan.plan_size = subplan->plan_size; sortplan->plan.plan_width = subplan->plan_width; and for group nodes near line 480 of planmain.c, something like: /* pretty easy, right? the size and width will not change */ /* from the sort node, so we can just carry those along */ /* and group has (practically) no cost, so just keep that too */ grpplan->plan.cost = sortplan->plan.cost;grpplan->plan.plan_size = sortplan->plan.plan_size; grpplan->plan.plan_width = sortplan->plan.plan_width; and something similar for the other sort nodes near line 410 of planner.c, and a similar pass along of the cost values from the group node through the aggregation node near line 260 of planner.c And, of course, we'd have to check hash nodes, join nodes and all the others to do the right thing, but I only looked at the ones necessary for my query. Adding all those parts, we now have the plan: Sort (cost=713817.69 size=579166 width=60)-> Aggregate (cost=374720.78 size=579166 width=60) -> Group (cost=374720.78size=579166 width=60) -> Sort (cost=374720.78 size=579166 width=60) -> Seq Scan on lineitem (cost=35623.88size=579166 width=60) which actually includes the cost of the sorts and passes the size and width up the tree. [End discussion of Problem 1 - no sort costs or cost-passing] [Craziness starts here] [Problem 2 - the Aggregate node is way over-estimated by assuming the output is as large as the input, but how can you do better? ] It turns out the above plan estimate isn't very satisfactory, because the Aggregate actually significantly reduces the size and the estimation of the final Sort is way higher than it needs to be. But how can we estimate what happens at an Aggregation node? For general types in the GROUP BY, there is probably not much we can do, but if we can somehow "know" that the GROUP BY columns have a limited dynamic range - e.g. the char(1) fields in my lineitem, or integer fields with small ranges - then we should be able to do some "back of the envelope" estimation and get a better idea of costs higher up in the plan. A quick look shows that this would be applicable in 5 of the 17 queries from TPC-D that I am looking at, not huge applicability, but not single-query specific either. What are the "normal" or "expected" types and columns for all the GROUP BYs out there - I don't know. Anyway, so we do the estimation described above and get to the much more accurate plan: Sort (cost=374734.87 size=153 width=60)-> Aggregate (cost=374720.78 size=153 width=60) -> Group (cost=374720.78 size=579166width=60) -> Sort (cost=374720.78 size=579166 width=60) -> Seq Scan on lineitem (cost=35623.88 size=579166width=60) this works because pg_statistic knows that l_returnflag has at most 17 possible values (min 'A' to max 'R') and l_linestatus has at most 9 possible values (min 'F' to max 'O'). In fact, l_returnflag and l_linestatus are categorical and have only 3 possible values each, but the estimate of 153 (17 * 9) is much better than the 579166 that the node must otherwise assume, and much closer to the actual result, which has only 4 rows. I get the above plan by further modifcation after line 260 of planner.c. I am not attaching the patch because it is ugly in spots and I just did "proof of concept", full implementation of this idea would require much more thinking about generality. Basically I go through the target list of the agg node and identify all the Var expressions. I then figure out what columns in the original relations these refer to. I then look up the range of the attributes in each of these columns (using the min and max values for the attributes obtained via gethilokey() as discussed in my last post and doing the "strings map to ASCII value of their first letter" that Tom suggested). I multiply all these values together (i.e. the cross product of all possible combinations that the GROUP BY on all n columns could produce - call this x). The size of the agg result is than the smaller of this value or the input size (can't be bigger than the input, and if there are only x unique combinations of the GROUP BY attributes, then there will never be more than x rows in the output of the GROUP BY). I do this very conservatively, if I can't get statistics for one of the columns, then I forget it and go with the old estimate (or could go with an estimate of 0, if one were feeling optimistic. It is not clear what a "good" default estimate is here, maybe the 1/3 that is used for selectivities would be as good as anything else). [End discussion of Problem 2 - aggregate/group estimation] As with my previous posts, this is most likely not a general solution, it's just an idea that works (very well) for the query I am looking at, and has some general applicability. I am sure that the above ignores a number of "bigger picture" issues, but it does help the particular query I care about. Also note that none of this actually speeds up even my query, it only makes the optimizer estimate much closer to the actual query cost (which is what I care about for the work I am doing). Maybe this will be of help in any future work on the optimizer. Maybe it is simply the rantings of a lunatic. Enjoy. Erik Riedel Carnegie Mellon University www.cs.cmu.edu/~riedel
pgsql-hackers by date: