Re: sqlsmith crash incremental sort - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: sqlsmith crash incremental sort |
Date | |
Msg-id | 20200417005420.ye342t2nljg4nuhk@development Whole thread Raw |
In response to | Re: sqlsmith crash incremental sort (James Coleman <jtc331@gmail.com>) |
Responses |
Re: sqlsmith crash incremental sort
Re: sqlsmith crash incremental sort |
List | pgsql-hackers |
On Wed, Apr 15, 2020 at 11:26:12AM -0400, James Coleman wrote: >On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> ... >> >> Yeah. And I'm not even sure having that information would allow good >> estimates e.g. for UNIONs of multiple relations etc. >> >> >> Another option is to use something as simple as checking for Vars with >> >> varno==0 in cost_incremental_sort() and ignoring them somehow. We could >> >> simply use some arbitrary estimate - by assuming the rows are unique or >> >> something like that. Yes, I agree it's pretty ugly and I'd much rather >> >> find a way to generate something sensible, but I'm not even sure we can >> >> generate good estimate when doing UNION of data from different relations >> >> and so on. The attached (ugly) patch does this. >> > >> >...therefore I think this is worth proceeding with. >> > >> >> OK, then the question is what estimate to use in this case. Should we >> assume 1 group or uniqueness? I'd assume a single group produces costs >> slightly above regular sort, right? > >Originally I'd intuitively leaned towards assuming they were unique. >But that would be the best case for memory/disk space usage, for >example, and the costing for incremental sort is always (at least >mildly) higher than regular sort if the number of groups is 1. That >also guarantees the startup cost is higher than regular sort also. > >So I think using a number of groups estimate of 1, we just wouldn't >choose an incremental sort ever in this case. > >Maybe that's the right choice? It'd certainly be the conservative >choice. What are your thoughts on the trade-offs there? > I think we have essentially three options: 1) assuming there's just a single group This should produce cost estimate higher than plain sort, disabling incremental sort. I'd say this is "worst case" assumption. I think this might be overly pessimistic, though. 2) assuming each row is a separate group If (1) is worst case scenario, then this is probably the best case, particularly when the query is sensitive to startup cost. 3) something in between If (1) and (2) are worst/best-case scenarios, maybe we should pick something in between. We have DEFAULT_NUM_DISTINCT (200) which essentially says "we don't know what the number of groups is" so maybe we should use that. Another option would be nrows/10, which is the cap we use in estimate_num_groups without extended stats. I was leaning towards (1) as "worst case" choice seems natural to prevent possible regressions. But consider this: create table small (a int); create table large (a int); insert into small select mod(i,10) from generate_series(1,100) s(i); insert into large select mod(i,10) from generate_series(1,100000) s(i); analyze small; analyze large; explain select i from large union select i from small; QUERY PLAN ------------------------------------------------------------------------------- Unique (cost=11260.35..11760.85 rows=100100 width=4) -> Sort (cost=11260.35..11510.60 rows=100100 width=4) Sort Key: large.i -> Append (cost=0.00..2946.50 rows=100100 width=4) -> Seq Scan on large (cost=0.00..1443.00 rows=100000 width=4) -> Seq Scan on small (cost=0.00..2.00 rows=100 width=4) (6 rows) The estimate fo number of groups is clearly bogus - we know there are only 10 groups in each relation, but here we end up with 100100. So perhaps we should do (2) to keep the behavior consistent? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: