Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date
Msg-id 41c5c820-bf91-4070-8261-fb87e91c5913@gmail.com
Whole thread Raw
In response to Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>> +     * proves the var is unique for this query.  However, we'd better 
>> still
>> +     * believe the null-fraction statistic.
>>       */
>>      if (vardata->isunique)
>>          stadistinct = -1.0 * (1.0 - stanullfrac);
> 
> I wonder about the "we'd better still believe the null-fraction 
> statistic" part. It makes sense for a unique index, but a DISTINCT or 
> GROUP BY collapses all the NULLs to a single row. So I think there's 
> some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we 
set vardata->isunique field and disallowed to recurse into the Var 
statistics inside subquery - likewise, DISTINCT already does. So, we 
have stanullfrac == 0 - it means the optimiser doesn't count the number 
of NULLs. In the case of the UNIQUE index, the optimiser will have the 
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some 
vardata_extra, which could count specific conditions, and let upper 
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in 
such a structure for the Aggregate node. Also, it could be a solution to 
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

  HashAggregate  (cost=20.91..22.35 rows=144 width=4)
                 (actual rows=50 loops=1)
    Group Key: gu_2.x
    Batches: 1  Memory Usage: 40kB
    ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
        (actual rows=50 loops=1)
          Group Key: gu_2.x
          Batches: 1  Memory Usage: 40kB
          ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
         (actual rows=149 loops=1)
                Filter: ((x < '50'::double precision) OR (x IS NULL))
                Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in 
vardata_extra and use it in upper GROUP-BY estimation.

-- 
regards, Andrei Lepikhov



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Changing shared_buffers without restart
Next
From: Andreas Karlsson
Date:
Subject: Re: Remove useless casts to (void *)