Re: 回复: An implementation of multi-key sort - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: 回复: An implementation of multi-key sort
Date
Msg-id a3093ea7-ef66-4ae5-a041-273b73b05e80@garret.ru
Whole thread Raw
In response to Re: 回复: An implementation of multi-key sort  (Yao Wang <yao-yw.wang@broadcom.com>)
Responses Re: 回复: An implementation of multi-key sort
Re: 回复: An implementation of multi-key sort
List pgsql-hackers
On 04/07/2024 3:45 pm, Yao Wang wrote:
> Generally, the benefit of mksort is mainly from duplicated values and sort
> keys: the more duplicated values and sort keys are, the bigger benefit it
> gets.
...
> 1. Use distinct stats info of table to enable mksort
>
> It's kind of heuristics: in optimizer, check Form_pg_statistic->stadistinct
> of a table via pg_statistics. Enable mksort only when it is less than a
> threshold.
>
> The hacked code works, which need to modify a couple of interfaces of
> optimizer. In addition, a complete solution should consider types and
> distinct values of all columns, which might be too complex, and the benefit
> seems not so big.


If mksort really provides advantage only when there are a lot of 
duplicates (for prefix keys?) and of small fraction of duplicates there 
is even some (small) regression
then IMHO taking in account in planner information about estimated 
number of distinct values seems to be really important. What was a 
problem with accessing this statistics and why it requires modification 
of optimizer interfaces? There is `get_variable_numdistinct` function 
which is defined and used only in selfuncs.c

Information about values distribution seems to be quite useful for  
choosing optimal sort algorithm. Not only for multi-key sort 
optimization. For example if we know min.max value of sort key and it is 
small, we can use O(N) algorithm for sorting. Also it can help to 
estimate when TOP-N search is preferable.

Right now Posgres creates special path for incremental sort. I am not 
sure if we also need to be separate path for mk-sort.
But IMHO if we need to change some optimizer interfaces to be able to 
take in account statistic and choose preferred sort algorithm at 
planning time, then it should be done.
If mksort can increase sort more than two times (for large number of 
duplicates), it will be nice to take it in account when choosing optimal 
plan.

Also in this case we do not need extra GUC for explicit enabling of 
mksort. There are too many parameters for optimizer and adding one more 
will make tuning more complex. So I prefer that decision is take buy 
optimizer itself based on the available information, especially if 
criteria seems to be obvious.


Best regards,
Konstantin




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: tests fail on windows with default git settings
Next
From: Andres Freund
Date:
Subject: dropping privileges on windows vs privileged accounts