Re: 回复: An implementation of multi-key sort - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: 回复: An implementation of multi-key sort |
Date | |
Msg-id | bc6bece9-9bd6-444e-a933-4fe82a00fb10@enterprisedb.com Whole thread Raw |
In response to | Re: 回复: An implementation of multi-key sort (Konstantin Knizhnik <knizhnik@garret.ru>) |
List | pgsql-hackers |
On 7/7/24 08:32, Konstantin Knizhnik wrote: > > On 04/07/2024 3:45 pm, Yao Wang wrote: >> Generally, the benefit of mksort is mainly from duplicated values and >> sort >> keys: the more duplicated values and sort keys are, the bigger benefit it >> gets. > ... >> 1. Use distinct stats info of table to enable mksort >> >> It's kind of heuristics: in optimizer, check >> Form_pg_statistic->stadistinct >> of a table via pg_statistics. Enable mksort only when it is less than a >> threshold. >> >> The hacked code works, which need to modify a couple of interfaces of >> optimizer. In addition, a complete solution should consider types and >> distinct values of all columns, which might be too complex, and the >> benefit >> seems not so big. > > > If mksort really provides advantage only when there are a lot of > duplicates (for prefix keys?) and of small fraction of duplicates there > is even some (small) regression > then IMHO taking in account in planner information about estimated > number of distinct values seems to be really important. What was a > problem with accessing this statistics and why it requires modification > of optimizer interfaces? There is `get_variable_numdistinct` function > which is defined and used only in selfuncs.c > Yeah, I've been wondering about that too. But I'm also a bit unsure if using this known-unreliable statistics (especially with filters and multiple columns) would actually fix the regressions. > Information about values distribution seems to be quite useful for > choosing optimal sort algorithm. Not only for multi-key sort > optimization. For example if we know min.max value of sort key and it is > small, we can use O(N) algorithm for sorting. Also it can help to > estimate when TOP-N search is preferable. > This assumes the information is accurate / reliable, and I'm far from sure about that. > Right now Posgres creates special path for incremental sort. I am not > sure if we also need to be separate path for mk-sort. > But IMHO if we need to change some optimizer interfaces to be able to > take in account statistic and choose preferred sort algorithm at > planning time, then it should be done. > If mksort can increase sort more than two times (for large number of > duplicates), it will be nice to take it in account when choosing optimal > plan. > I did commit the incremental sort patch, and TBH I'm not convinced I'd do that again. It's a great optimization when it works (and it seems to work in plenty of cases), but we've also had a number of reports about significant regressions, where the incremental sort costing is quite off. Granted, it's often about cases where we already had issues and incremental sort just "exacerbates" that (say, with LIMIT queries), but that's kinda the point I'm trying to make - stats are inherently incomplete / simplified, and some plans are more sensitive to that. Which is why I'm wondering if we might do the decision based on some information collected at runtime. > Also in this case we do not need extra GUC for explicit enabling of > mksort. There are too many parameters for optimizer and adding one more > will make tuning more complex. So I prefer that decision is take buy > optimizer itself based on the available information, especially if > criteria seems to be obvious. > The GUC is very useful for testing, so let's keep it for now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: