Re: sqlsmith crash incremental sort - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: sqlsmith crash incremental sort
Date
Msg-id 20200417005420.ye342t2nljg4nuhk@development
Whole thread Raw
In response to Re: sqlsmith crash incremental sort  (James Coleman <jtc331@gmail.com>)
Responses Re: sqlsmith crash incremental sort  (James Coleman <jtc331@gmail.com>)
Re: sqlsmith crash incremental sort  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Apr 15, 2020 at 11:26:12AM -0400, James Coleman wrote:
>On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> ...
>>
>> Yeah. And I'm not even sure having that information would allow good
>> estimates e.g. for UNIONs of multiple relations etc.
>>
>> >> Another option is to use something as simple as checking for Vars with
>> >> varno==0 in cost_incremental_sort() and ignoring them somehow. We could
>> >> simply use some arbitrary estimate - by assuming the rows are unique or
>> >> something like that. Yes, I agree it's pretty ugly and I'd much rather
>> >> find a way to generate something sensible, but I'm not even sure we can
>> >> generate good estimate when doing UNION of data from different relations
>> >> and so on. The attached (ugly) patch does this.
>> >
>> >...therefore I think this is worth proceeding with.
>> >
>>
>> OK, then the question is what estimate to use in this case. Should we
>> assume 1 group or uniqueness? I'd assume a single group produces costs
>> slightly above regular sort, right?
>
>Originally I'd intuitively leaned towards assuming they were unique.
>But that would be the best case for memory/disk space usage, for
>example, and the costing for incremental sort is always (at least
>mildly) higher than regular sort if the number of groups is 1. That
>also guarantees the startup cost is higher than regular sort also.
>
>So I think using a number of groups estimate of 1, we just wouldn't
>choose an incremental sort ever in this case.
>
>Maybe that's the right choice? It'd certainly be the conservative
>choice. What are your thoughts on the trade-offs there?
>

I think we have essentially three options:

1) assuming there's just a single group

This should produce cost estimate higher than plain sort, disabling
incremental sort. I'd say this is "worst case" assumption. I think this
might be overly pessimistic, though.

2) assuming each row is a separate group

If (1) is worst case scenario, then this is probably the best case,
particularly when the query is sensitive to startup cost.


3) something in between

If (1) and (2) are worst/best-case scenarios, maybe we should pick
something in between. We have DEFAULT_NUM_DISTINCT (200) which
essentially says "we don't know what the number of groups is" so maybe
we should use that. Another option would be nrows/10, which is the cap
we use in estimate_num_groups without extended stats.


I was leaning towards (1) as "worst case" choice seems natural to
prevent possible regressions. But consider this:

   create table small (a int);
   create table large (a int);
   
   insert into small select mod(i,10) from generate_series(1,100) s(i);
   insert into large select mod(i,10) from generate_series(1,100000) s(i);

   analyze small;
   analyze large;

   explain select i from large union select i from small;

                                     QUERY PLAN
   -------------------------------------------------------------------------------
    Unique  (cost=11260.35..11760.85 rows=100100 width=4)
      ->  Sort  (cost=11260.35..11510.60 rows=100100 width=4)
            Sort Key: large.i
            ->  Append  (cost=0.00..2946.50 rows=100100 width=4)
                  ->  Seq Scan on large  (cost=0.00..1443.00 rows=100000 width=4)
                  ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=4)
   (6 rows)

The estimate fo number of groups is clearly bogus - we know there are
only 10 groups in each relation, but here we end up with 100100.

So perhaps we should do (2) to keep the behavior consistent?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Do we need to handle orphaned prepared transactions in theserver?
Next
From: James Coleman
Date:
Subject: Re: sqlsmith crash incremental sort