Thread: Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From

Heikki Linnakangas

Date:

25 November 2024, 19:45:48

On 24/09/2024 08:08, Andrei Lepikhov wrote:
> On 19/9/2024 09:55, Andrei Lepikhov wrote:
>> This wrong prediction makes things much worse if the query has more 
>> upper query blocks.
>> His question was: Why not consider the grouping column unique in the 
>> upper query block? It could improve estimations.
>> After a thorough investigation, I discovered that in commit  
>> 4767bc8ff2 most of the work was already done for DISTINCT clauses. So, 
>> why not do the same for grouping? A sketch of the patch is attached.
>> As I see it, grouping in this sense works quite similarly to DISTINCT, 
>> and we have no reason to ignore it. After applying the patch, you can 
>> see that prediction has been improved:
>>
>> Hash Right Join  (cost=5.62..162.56 rows=50 width=36)
>>
> A regression test is added into new version.
> The code looks tiny, simple and non-invasive - it will be easy to commit 
> or reject. So I add it to next commitfest.

Looks good at a quick glance.

> @@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
>      }
>  
>      /*
> -     * If there is a unique index or DISTINCT clause for the variable, assume
> -     * it is unique no matter what pg_statistic says; the statistics could be
> -     * out of date, or we might have found a partial unique index that proves
> -     * the var is unique for this query.  However, we'd better still believe
> -     * the null-fraction statistic.
> +     * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
> +     * assume it is unique no matter what pg_statistic says; the statistics
> +     * could be out of date, or we might have found a partial unique index that
> +     * proves the var is unique for this query.  However, we'd better still
> +     * believe the null-fraction statistic.
>       */
>      if (vardata->isunique)
>          stadistinct = -1.0 * (1.0 - stanullfrac);

I wonder about the "we'd better still believe the null-fraction 
statistic" part. It makes sense for a unique index, but a DISTINCT or 
GROUP BY collapses all the NULLs to a single row. So I think there's 
some more work to be done here.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From

Andrei Lepikhov

Date:

28 November 2024, 05:39:18

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>> +     * proves the var is unique for this query.  However, we'd better 
>> still
>> +     * believe the null-fraction statistic.
>>       */
>>      if (vardata->isunique)
>>          stadistinct = -1.0 * (1.0 - stanullfrac);
> 
> I wonder about the "we'd better still believe the null-fraction 
> statistic" part. It makes sense for a unique index, but a DISTINCT or 
> GROUP BY collapses all the NULLs to a single row. So I think there's 
> some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we 
set vardata->isunique field and disallowed to recurse into the Var 
statistics inside subquery - likewise, DISTINCT already does. So, we 
have stanullfrac == 0 - it means the optimiser doesn't count the number 
of NULLs. In the case of the UNIQUE index, the optimiser will have the 
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some 
vardata_extra, which could count specific conditions, and let upper 
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in 
such a structure for the Aggregate node. Also, it could be a solution to 
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

  HashAggregate  (cost=20.91..22.35 rows=144 width=4)
                 (actual rows=50 loops=1)
    Group Key: gu_2.x
    Batches: 1  Memory Usage: 40kB
    ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
        (actual rows=50 loops=1)
          Group Key: gu_2.x
          Batches: 1  Memory Usage: 40kB
          ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
         (actual rows=149 loops=1)
                Filter: ((x < '50'::double precision) OR (x IS NULL))
                Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in 
vardata_extra and use it in upper GROUP-BY estimation.

-- 
regards, Andrei Lepikhov

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From

Alexander Korotkov

Date:

19 February, 12:48:04

Hi, Vlada.

On Tue, Feb 18, 2025 at 6:56 PM Vlada Pogozhelskaya
<v.pogozhelskaya@postgrespro.ru> wrote:
> Following the discussion on improving statistics estimation by considering GROUP BY as a unique constraint, I’ve
prepareda patch that integrates GROUP BY into cardinality estimation in a similar way to DISTINCT. 
>
> This patch ensures that when a subquery contains a GROUP BY clause, the optimizer recognizes the grouped columns as
unique.The logic follows a straightforward approach, comparing the GROUP BY columns with the target list to determine
uniqueness.
>
> I’d appreciate any feedback or suggestions for further improvements.

Thank you for your patch, but your message lacking of explanation on
what is your approach and how is it different from previously
published patches on this thread.  As I get from the code, you check
if group by clauses are same to targetlist.  If that's true, you
assume every column to be unique.  But that's just doesn't work this
way.  If values are unique in some set of columns, individual columns
might have repeats.  See the example.

# select x, y from generate_series(1,3) x, generate_series(1, 3) y
group by x, y;
 x | y
---+---
 3 | 2
 2 | 2
 3 | 1
 2 | 1
 1 | 3
 1 | 2
 1 | 1
 2 | 3
 3 | 3
(9 rows)

x and y are unique here as a pair.  But individual x and y values have repeats.

------
Regards,
Alexander Korotkov
Supabase

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From

Alexander Korotkov

Date:

19 February, 13:00:51

On Tue, Feb 18, 2025 at 2:52 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> On 17/2/2025 02:06, Alexander Korotkov wrote:
> > On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> >> Here we also could count number of scanned NULLs separately in
> >> vardata_extra and use it in upper GROUP-BY estimation.
> >
> > What could be the type of vardata_extra?  And what information could
> > it store?  Yet seems too sketchy for me to understand.
> It is actually sketchy. Our estimation routines have no information
> about intermediate modifications of the data. Left-join generated NULLs
> is a good example here. So, my vague idea is to maintain that info and
> change statistical estimations somehow.
> Of course, it is out of the scope here.
> >
> > But, I think for now we should go with the original patch.  It seems
> > to be quite straightforward extension to what 4767bc8ff2 does.  I've
> > revised commit message and applied pg_indent to sources.  I'm going to
> > push this if no objections.
> Ok, I added one regression test to check that feature works properly.

Andrei, thank you.  I've pushed the patch applying some simplification
of regression test.

------
Regards,
Alexander Korotkov
Supabase