Home > mailing lists

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date	November 25, 2024 19:45:48
Msg-id	61d7cb95-9194-42cf-8225-351703de9cb8@iki.fi Whole thread Raw
Responses	Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
List	pgsql-hackers

Tree view

On 24/09/2024 08:08, Andrei Lepikhov wrote:
> On 19/9/2024 09:55, Andrei Lepikhov wrote:
>> This wrong prediction makes things much worse if the query has more 
>> upper query blocks.
>> His question was: Why not consider the grouping column unique in the 
>> upper query block? It could improve estimations.
>> After a thorough investigation, I discovered that in commit  
>> 4767bc8ff2 most of the work was already done for DISTINCT clauses. So, 
>> why not do the same for grouping? A sketch of the patch is attached.
>> As I see it, grouping in this sense works quite similarly to DISTINCT, 
>> and we have no reason to ignore it. After applying the patch, you can 
>> see that prediction has been improved:
>>
>> Hash Right Join  (cost=5.62..162.56 rows=50 width=36)
>>
> A regression test is added into new version.
> The code looks tiny, simple and non-invasive - it will be easy to commit 
> or reject. So I add it to next commitfest.

Looks good at a quick glance.

> @@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
>      }
>  
>      /*
> -     * If there is a unique index or DISTINCT clause for the variable, assume
> -     * it is unique no matter what pg_statistic says; the statistics could be
> -     * out of date, or we might have found a partial unique index that proves
> -     * the var is unique for this query.  However, we'd better still believe
> -     * the null-fraction statistic.
> +     * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
> +     * assume it is unique no matter what pg_statistic says; the statistics
> +     * could be out of date, or we might have found a partial unique index that
> +     * proves the var is unique for this query.  However, we'd better still
> +     * believe the null-fraction statistic.
>       */
>      if (vardata->isunique)
>          stadistinct = -1.0 * (1.0 - stanullfrac);

I wonder about the "we'd better still believe the null-fraction 
statistic" part. It makes sense for a unique index, but a DISTINCT or 
GROUP BY collapses all the NULLs to a single row. So I think there's 
some more work to be done here.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

pgsql-hackers by date:

From: Srirama Kucherlapati
Date: 25 November 2024, 19:26:56
Subject: RE: AIX support

From: Kirill Reshke
Date: 25 November 2024, 19:52:17
Subject: Re: Truncate logs by max_log_size

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' - Mailing list pgsql-hackers

Previous

Next