Thread: Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From
Heikki Linnakangas
Date:
On 24/09/2024 08:08, Andrei Lepikhov wrote:
> On 19/9/2024 09:55, Andrei Lepikhov wrote:
>> This wrong prediction makes things much worse if the query has more 
>> upper query blocks.
>> His question was: Why not consider the grouping column unique in the 
>> upper query block? It could improve estimations.
>> After a thorough investigation, I discovered that in commit  
>> 4767bc8ff2 most of the work was already done for DISTINCT clauses. So, 
>> why not do the same for grouping? A sketch of the patch is attached.
>> As I see it, grouping in this sense works quite similarly to DISTINCT, 
>> and we have no reason to ignore it. After applying the patch, you can 
>> see that prediction has been improved:
>>
>> Hash Right Join  (cost=5.62..162.56 rows=50 width=36)
>>
> A regression test is added into new version.
> The code looks tiny, simple and non-invasive - it will be easy to commit 
> or reject. So I add it to next commitfest.

Looks good at a quick glance.

> @@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
>      }
>  
>      /*
> -     * If there is a unique index or DISTINCT clause for the variable, assume
> -     * it is unique no matter what pg_statistic says; the statistics could be
> -     * out of date, or we might have found a partial unique index that proves
> -     * the var is unique for this query.  However, we'd better still believe
> -     * the null-fraction statistic.
> +     * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
> +     * assume it is unique no matter what pg_statistic says; the statistics
> +     * could be out of date, or we might have found a partial unique index that
> +     * proves the var is unique for this query.  However, we'd better still
> +     * believe the null-fraction statistic.
>       */
>      if (vardata->isunique)
>          stadistinct = -1.0 * (1.0 - stanullfrac);

I wonder about the "we'd better still believe the null-fraction 
statistic" part. It makes sense for a unique index, but a DISTINCT or 
GROUP BY collapses all the NULLs to a single row. So I think there's 
some more work to be done here.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From
Andrei Lepikhov
Date:
Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>> +     * proves the var is unique for this query.  However, we'd better 
>> still
>> +     * believe the null-fraction statistic.
>>       */
>>      if (vardata->isunique)
>>          stadistinct = -1.0 * (1.0 - stanullfrac);
> 
> I wonder about the "we'd better still believe the null-fraction 
> statistic" part. It makes sense for a unique index, but a DISTINCT or 
> GROUP BY collapses all the NULLs to a single row. So I think there's 
> some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we 
set vardata->isunique field and disallowed to recurse into the Var 
statistics inside subquery - likewise, DISTINCT already does. So, we 
have stanullfrac == 0 - it means the optimiser doesn't count the number 
of NULLs. In the case of the UNIQUE index, the optimiser will have the 
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some 
vardata_extra, which could count specific conditions, and let upper 
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in 
such a structure for the Aggregate node. Also, it could be a solution to 
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

  HashAggregate  (cost=20.91..22.35 rows=144 width=4)
                 (actual rows=50 loops=1)
    Group Key: gu_2.x
    Batches: 1  Memory Usage: 40kB
    ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
        (actual rows=50 loops=1)
          Group Key: gu_2.x
          Batches: 1  Memory Usage: 40kB
          ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
         (actual rows=149 loops=1)
                Filter: ((x < '50'::double precision) OR (x IS NULL))
                Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in 
vardata_extra and use it in upper GROUP-BY estimation.

-- 
regards, Andrei Lepikhov