Home > mailing lists

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' - Mailing list pgsql-hackers

From	Andrei Lepikhov
Subject	Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date	November 28, 2024 05:39:18
Msg-id	41c5c820-bf91-4070-8261-fb87e91c5913@gmail.com Whole thread Raw
In response to	Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' (Heikki Linnakangas <hlinnaka@iki.fi>)
List	pgsql-hackers

Tree view

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>> +     * proves the var is unique for this query.  However, we'd better 
>> still
>> +     * believe the null-fraction statistic.
>>       */
>>      if (vardata->isunique)
>>          stadistinct = -1.0 * (1.0 - stanullfrac);
> 
> I wonder about the "we'd better still believe the null-fraction 
> statistic" part. It makes sense for a unique index, but a DISTINCT or 
> GROUP BY collapses all the NULLs to a single row. So I think there's 
> some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we 
set vardata->isunique field and disallowed to recurse into the Var 
statistics inside subquery - likewise, DISTINCT already does. So, we 
have stanullfrac == 0 - it means the optimiser doesn't count the number 
of NULLs. In the case of the UNIQUE index, the optimiser will have the 
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some 
vardata_extra, which could count specific conditions, and let upper 
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in 
such a structure for the Aggregate node. Also, it could be a solution to 
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

  HashAggregate  (cost=20.91..22.35 rows=144 width=4)
                 (actual rows=50 loops=1)
    Group Key: gu_2.x
    Batches: 1  Memory Usage: 40kB
    ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
        (actual rows=50 loops=1)
          Group Key: gu_2.x
          Batches: 1  Memory Usage: 40kB
          ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
         (actual rows=149 loops=1)
                Filter: ((x < '50'::double precision) OR (x IS NULL))
                Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in 
vardata_extra and use it in upper GROUP-BY estimation.

-- 
regards, Andrei Lepikhov

pgsql-hackers by date:

From: Robert Haas
Date: 28 November 2024, 04:28:59
Subject: Re: Changing shared_buffers without restart

From: Andreas Karlsson
Date: 28 November 2024, 06:54:32
Subject: Re: Remove useless casts to (void *)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' - Mailing list pgsql-hackers

Previous

Next