Re: sqlsmith crash incremental sort - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: sqlsmith crash incremental sort
Date
Msg-id 20200422225919.ucs5xnlgnr5ekhod@development
Whole thread Raw
In response to Re: sqlsmith crash incremental sort  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: sqlsmith crash incremental sort
List pgsql-hackers
On Mon, Apr 20, 2020 at 01:47:29AM +0200, Tomas Vondra wrote:
>On Sat, Apr 18, 2020 at 02:23:25PM -0400, James Coleman wrote:
>>On Thu, Apr 16, 2020 at 9:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>> I think we have essentially three options:
>>>> 1) assuming there's just a single group
>>>> 2) assuming each row is a separate group
>>>> 3) something in between
>>>> If (1) and (2) are worst/best-case scenarios, maybe we should pick
>>>> something in between. We have DEFAULT_NUM_DISTINCT (200) which
>>>> essentially says "we don't know what the number of groups is" so maybe
>>>> we should use that.
>>>
>>>I wouldn't recommend picking either the best or worst cases.
>>>
>>>Possibly DEFAULT_NUM_DISTINCT is a sane choice, though it's fair to
>>>wonder if it's quite applicable to the case where we already know
>>>we've grouped by some columns.
>>
>>Do you think defining a new default, say,
>>DEFAULT_NUM_DISTINCT_PRESORTED is preferred then? And choose some
>>value like "1/2 of the normal DEFAULT_NUM_DISTINCT groups" or some
>>such?
>>
>
>If we had a better intuition what a better value is, maybe. But I don't
>think we have that at all, so I'd just use the existing one.
>

I've pushed fix with the DEFAULT_NUM_DISTINCT. The input comes from a
set operation (which is where we call generate_append_tlist), so it's
probably fairly unique, so maybe we should use input_tuples. But it's
not guaranteed, so DEFAULT_NUM_DISTINCT seems reasonably defensive.

One detail I've changed is that instead of matching the expression
directly to a Var, it now calls pull_varnos() to also detect Vars
somewhere deeper. Lookig at examine_variable() it calls find_base_rel
for such case too, but I haven't tried constructing a query triggering
the issue.

One improvement I can think of is handling lists with only some
expressions containing varno 0. We could still call estimate_num_groups
for expressions with varno != 0, and multiply that by the estimate for
the other part (be it DEFAULT_NUM_DISTINCT). This might produce a higher
estimate than just using DEFAULT_NUM_DISTINCT directly, resulting in a
lower incremenal sort cost. But it's not clear to me if this can even
happen - AFAICS either all Vars have varno 0 or none, so I haven't done
this.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Parallel Append can break run-time partition pruning
Next
From: Tom Lane
Date:
Subject: Re: Parallel Append can break run-time partition pruning