Re: cost_sort() improvements - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: cost_sort() improvements
Date
Msg-id 2d09f0c0-5ad2-6f26-bb15-4d3231cc8e43@2ndquadrant.com
Whole thread Raw
In response to Re: cost_sort() improvements  (Teodor Sigaev <teodor@sigaev.ru>)
Responses Re: cost_sort() improvements
List pgsql-hackers
On 07/12/2018 04:31 PM, Teodor Sigaev wrote:
>> At least [1] and [2] hit into to that issues and have an
>> objections/questions about correctness of cost sort estimation.
>> Suggested patch tries to improve current estimation and solve that
>> issues.
> 
> Sorry for long delay but issue was even more complicated than I thought.
> When I tried to add cost_sort to GROUP BY patch I found it isn't work
> well as I hoped :(. The root of problem is suggested cost_sort
> improvement doesn't take into account uniqueness of first column and
> it's cost always the same. I tried to make experiments with all the same
> and all unique column and found that execution time could be
> significantly differ (up to 3 times on 1e7 randomly generated integer
> values). And I went to read book and papers.
> 
> So, I suggest new algorithm based on ideas in [3], [4]. In term of that
> papers, let I use designations  from previous my email and Xi - number
> of tuples with key Ki, then estimation is:
> log(N! / (X1! * X2! * ..))  ~  sum(Xi * log(N/Xi))
> In our case all Xi are the same because now we don't have an estimation of
> group sizes, we have only estimation of number of groups. In this case,
> formula becomes: N * log(NumberOfGroups). Next, to support correct
> estimation
> of multicolumn sort we need separately compute each column, so, let k is a
> column number, Gk - number of groups  defined by k columns:
> N * sum( FCk * log(Gk) )
> and FCk is a sum(Cj) over k columns. Gk+1 is defined as
> estimate_num_groups(NGk) - i.e. it's recursive, that's means that
> comparison of k-th columns includes all comparison j-th columns < k.
> 
> Next, I found that this formula gives significant underestimate with N <
> 1e4 and using [5] (See Chapter 8.2 and Theorem 4.1) found that we can
> use N + N * log(N) formula which actually adds a multiplier in simple
> case but it's unclear for me how to add that multimplier to multicolumn
> formula, so I added just a separate term proportional to N.
> 

I'm sorry, but I'm getting lost in the notation and how you came up with
those formulas - I don't know where to look in the papers, books etc.

Could you perhaps summarize the reasoning or at least point me to the
relevant parts of the sources, so that I know which parts to focus on?

> As demonstration, I add result of some test, *sh and *plt are scripts to
> reproduce results. Note, all charts are normalized because  computed
> cost as not a milliseconds. p.pl is a parser for JSON format of explain
> analyze.
> 
> N test - sort unique values with different number of tuples
> NN test - same as previous one but sort of single value (all the same
> values)
> U test - fixed number of total values (1e7) but differ number of unique
> values
> 
> Hope, new estimation much more close to real sort timing. New estimation
> function gives close result to old estimation function on trivial
> examples, but ~10% more expensive, and three of regression tests aren't
> passed, will look closer later. Patch doesn't include  regression test
> changes.

Interesting. I wonder if there's a way to address the difference at the
lower end?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: cost_sort() improvements
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] plpgsql - additional extra checks