Re: Why don't we consider explicit Incremental Sort? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Why don't we consider explicit Incremental Sort?
Date
Msg-id 9aea8344-a33d-4e78-bf31-303e1a95d7df@vondra.me
Whole thread Raw
In response to Re: Why don't we consider explicit Incremental Sort?  (Richard Guo <guofenglinux@gmail.com>)
Responses Re: Why don't we consider explicit Incremental Sort?
Re: Why don't we consider explicit Incremental Sort?
List pgsql-hackers
On 9/23/24 05:03, Richard Guo wrote:
> On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> Just looking at the commit message:
>>
>>> The rationale is based on the assumption that incremental sort is
>>> always faster than full sort when there are presorted keys, a premise
>>> that has been applied in various parts of the code.  This assumption
>>> does not always hold, particularly in cases with a large skew in the
>>> number of rows within the presorted groups.
>>
>> My understanding is that the worst case for incremental sort is the
>> same as sort, i.e. only 1 presorted group, which is the same effort to
>> sort. Is there something I'm missing?
> 
> I was thinking that when there’s a large skew in the number of rows
> per pre-sorted group, incremental sort might underperform full sort.
> As mentioned in the comments for cost_incremental_sort, it has to
> detect the sort groups, plus it needs to perform tuplesort_reset after
> each group.  However, I doubt these factors would have a substantial
> impact on the performance of incremental sort.  So maybe in the worst
> skewed groups case, incremental sort may still perform similarly to
> full sort.
> 
> Another less obvious reason is that in cases of skewed groups, we may
> significantly underestimate the cost of incremental sort.  This could
> result in choosing a more expensive subpath under the sort.  Such as
> the example in [1], we end up with IndexScan->IncrementalSort rather
> than SeqScan->FullSort, while the former plan is more expensive to
> execute.  However, this point does not affect this patch, because for
> a mergejoin, we only consider outerrel's cheapest-total-cost when we
> assume that an explicit sort atop is needed, i.e., we do not have a
> chance to select which subpath to use in this case.
> 
> [1] https://postgr.es/m/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
> 

You don't need any skew. Consider this perfectly uniform dataset:

create table t (a int, b int);
insert into t select 100000 * random(), 100 * random()
  from generate_series(1,1000000) s(i);
create index on t (a);
vacuum analyze;
checkpoint;

explain analyze select * from t order by a, b;

                            QUERY PLAN
-----------------------------------------------------------------
 Sort  (cost=127757.34..130257.34 rows=1000000 width=8)
       (actual time=186.288..275.813 rows=1000000 loops=1)
   Sort Key: a, b
   Sort Method: external merge  Disk: 17704kB
   ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=8)
                (actual time=0.005..35.989 rows=1000000 loops=1)
 Planning Time: 0.075 ms
 Execution Time: 301.143 ms
(6 rows)


set enable_incremental_sort = on;
explain analyze select * from t order by a, b;

                            QUERY PLAN
-----------------------------------------------------------------
 Incremental Sort  (cost=1.03..68908.13 rows=1000000 width=8)
            (actual time=0.102..497.143 rows=1000000 loops=1)
   Sort Key: a, b
   Presorted Key: a
   Full-sort Groups: 27039  Sort Method: quicksort
     Average Memory: 25kB  Peak Memory: 25kB
   ->  Index Scan using t_a_idx on t
                   (cost=0.42..37244.25 rows=1000000 width=8)
            (actual time=0.050..376.403 rows=1000000 loops=1)
 Planning Time: 0.057 ms
 Execution Time: 519.301 ms
(7 rows)

Sure, the table is very small, but the example is not crazy. In fact,
this is the "nicest" example for estimation - it's perfectly random, no
correlations, no skew.


regards

-- 
Tomas Vondra



pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Incremental Sort Cost Estimation Instability
Next
From: Nathan Bossart
Date:
Subject: Re: miscellaneous pg_upgrade cleanup