Re: cost_sort() improvements - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: cost_sort() improvements |
Date | |
Msg-id | 2d09f0c0-5ad2-6f26-bb15-4d3231cc8e43@2ndquadrant.com Whole thread Raw |
In response to | Re: cost_sort() improvements (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: cost_sort() improvements
|
List | pgsql-hackers |
On 07/12/2018 04:31 PM, Teodor Sigaev wrote: >> At least [1] and [2] hit into to that issues and have an >> objections/questions about correctness of cost sort estimation. >> Suggested patch tries to improve current estimation and solve that >> issues. > > Sorry for long delay but issue was even more complicated than I thought. > When I tried to add cost_sort to GROUP BY patch I found it isn't work > well as I hoped :(. The root of problem is suggested cost_sort > improvement doesn't take into account uniqueness of first column and > it's cost always the same. I tried to make experiments with all the same > and all unique column and found that execution time could be > significantly differ (up to 3 times on 1e7 randomly generated integer > values). And I went to read book and papers. > > So, I suggest new algorithm based on ideas in [3], [4]. In term of that > papers, let I use designations from previous my email and Xi - number > of tuples with key Ki, then estimation is: > log(N! / (X1! * X2! * ..)) ~ sum(Xi * log(N/Xi)) > In our case all Xi are the same because now we don't have an estimation of > group sizes, we have only estimation of number of groups. In this case, > formula becomes: N * log(NumberOfGroups). Next, to support correct > estimation > of multicolumn sort we need separately compute each column, so, let k is a > column number, Gk - number of groups defined by k columns: > N * sum( FCk * log(Gk) ) > and FCk is a sum(Cj) over k columns. Gk+1 is defined as > estimate_num_groups(NGk) - i.e. it's recursive, that's means that > comparison of k-th columns includes all comparison j-th columns < k. > > Next, I found that this formula gives significant underestimate with N < > 1e4 and using [5] (See Chapter 8.2 and Theorem 4.1) found that we can > use N + N * log(N) formula which actually adds a multiplier in simple > case but it's unclear for me how to add that multimplier to multicolumn > formula, so I added just a separate term proportional to N. > I'm sorry, but I'm getting lost in the notation and how you came up with those formulas - I don't know where to look in the papers, books etc. Could you perhaps summarize the reasoning or at least point me to the relevant parts of the sources, so that I know which parts to focus on? > As demonstration, I add result of some test, *sh and *plt are scripts to > reproduce results. Note, all charts are normalized because computed > cost as not a milliseconds. p.pl is a parser for JSON format of explain > analyze. > > N test - sort unique values with different number of tuples > NN test - same as previous one but sort of single value (all the same > values) > U test - fixed number of total values (1e7) but differ number of unique > values > > Hope, new estimation much more close to real sort timing. New estimation > function gives close result to old estimation function on trivial > examples, but ~10% more expensive, and three of regression tests aren't > passed, will look closer later. Patch doesn't include regression test > changes. Interesting. I wonder if there's a way to address the difference at the lower end? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: