Re: Allow to collect statistics on virtual generated columns - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Allow to collect statistics on virtual generated columns
Date
Msg-id 20250801002830.143b25971fb9594b89d96aee@sraoss.co.jp
Whole thread Raw
In response to Re: Allow to collect statistics on virtual generated columns  (Yugo Nagata <nagata@sraoss.co.jp>)
List pgsql-hackers
Hi,

On Tue, 24 Jun 2025 17:05:33 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> Instead, I'm thinking of an alternative approach: expanding the expression
> at the time statistics are collected.

I've attached a new patch in this approache.

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
   Filter: ((i * 10) = 10)
   Rows Removed by Filter: 999
   Buffers: shared hit=9
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.299 ms
 Execution Time: 0.948 ms
(8 rows)


- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
   Filter: ((i * 10) = 10)
   Rows Removed by Filter: 999
   Buffers: shared hit=9
 Planning:
   Buffers: shared hit=6
 Planning Time: 0.374 ms
 Execution Time: 1.028 ms
(8 rows)


Note that the patch is still a work in progress, so documentation and tests are not included.

Regards,
Yugo Nagata


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Assertion failure in pgbench
Next
From: Fujii Masao
Date:
Subject: Re: Assertion failure in pgbench