Re: Why don't we consider explicit Incremental Sort? - Mailing list pgsql-hackers
From | Richard Guo |
---|---|
Subject | Re: Why don't we consider explicit Incremental Sort? |
Date | |
Msg-id | CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com Whole thread Raw |
In response to | Re: Why don't we consider explicit Incremental Sort? (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Why don't we consider explicit Incremental Sort?
|
List | pgsql-hackers |
On Fri, Sep 13, 2024 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote: > Sure, an incremental sort can improve things if things go well, no doubt > about that. But how significant can the improvement be, especially if we > need to sort everything? In your example it's ~15%, which is nice, and > maybe the benefits can be much larger if the incremental sort can do > everything in memory, without flushing to disk. > > But what about the risks? What will happen if the groups are not this > uniformly and independently sized, and stuff like that? Maybe it'll be > costed well enough, I haven't tried. I understand the concern and agree that there is a risk of regression if there is a skew in the number of rows per pre-sorted group. Actually there were discussions about this during the work on commit 4a29eabd1. Please see [1]. I will repeat David's demonstration and rerun his query on the current master branch to see what happens. create table ab (a int not null, b int not null); insert into ab select 0,x from generate_Series(1,999000)x union all select x%1000+1,0 from generate_Series(999001,1000000)x; create index on ab (a); analyze ab; So this is a table with a very large skew: the 0 group has 999000 rows, and the remaining groups 1-1000 have just 1 row each. -- on master explain (analyze, timing off) select * from ab order by a,b; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8) (actual rows=1000000 loops=1) Sort Key: a, b Presorted Key: a Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB Peak Memory: 26kB Pre-sorted Groups: 1 Sort Method: external merge Average Disk: 17680kB Peak Disk: 17680kB -> Index Scan using ab_a_idx on ab (cost=0.42..22832.42 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.829 ms Execution Time: 1028.892 ms (8 rows) -- manually disable incremental sort set enable_incremental_sort to off; explain (analyze, timing off) select * from ab order by a,b; QUERY PLAN ------------------------------------------------------------------------------------------------ Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual rows=1000000 loops=1) Sort Key: a, b Sort Method: external merge Disk: 17704kB -> Seq Scan on ab (cost=0.00..14425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.814 ms Execution Time: 765.198 ms (6 rows) Look, regression happens on current master! This is a question I’ve often pondered: each time we introduce a new optimization, there are always potential cases where it could lead to regressions. What should we do about this? I kind of agree on David's option that, as in the commit message of 4a29eabd1: " That, of course, as with teaching the planner any new tricks, means an increased likelihood that the planner will perform an incremental sort when it's not the best method. Our standard escape hatch for these cases is an enable_* GUC. " I know ideally we should not rely on these enable_* GUCs, but in reality it seems that sometimes we do not have a better solution. [1] https://postgr.es/m/CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com Thanks Richard
pgsql-hackers by date: