Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 20210324164226.GB15100@telsasoft.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote:
> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
> > Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  

In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object":
|Name of the statistics object
|Owner of the statistics object
|An array of attribute numbers, indicating which table columns are covered by this statistics object;

But pg_stats_ext (the view) doesn't say "object", which sounds wrong:
|Name of extended statistics
|Owner of the extended statistics
|Names of the columns the extended statistics is defined on

Other pre-existing issues: should be singular "statistic":
doc/src/sgml/perform.sgml:     Another type of statistics stored for each column are most-common value
doc/src/sgml/ref/psql-ref.sgml:        The status of each kind of extended statistics is shown in a column

Pre-existing issues: doesn't say "object" but I think it should:
src/backend/commands/statscmds.c:                                        errmsg("statistics creation on system columns
isnot supported")));
 
src/backend/commands/statscmds.c:                                        errmsg("cannot have more than %d columns in
statistics",
src/backend/commands/statscmds.c:        * If we got here and the OID is not valid, it means the statistics does
src/backend/commands/statscmds.c: * Select a nonconflicting name for a new statistics.
src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given the list of column names for it
src/backend/statistics/extended_stats.c:                /* compute statistics target for this statistics */
src/backend/statistics/extended_stats.c: * attributes the statistics is defined on, and then the default statistics
src/backend/statistics/mcv.c: * The input is the OID of the statistics, and there are no rows returned if

should say "for a statistics object" or "for statistics objects"
src/backend/statistics/extended_stats.c: * target for a statistics objects (from the object target, attribute targets

Your patch adds these:

Should say "object":
+        * Check if we actually have a matching statistics for the expression.
                                                                                                          
 
+               /* evaluate expressions (if the statistics has any) */
                                                                                                          
 
+        * for the extended statistics. The second option seems more reasonable.
                                                                                                          
 
+                * the statistics had all options enabled on the original version.
                                                                                                          
 
+                * But if the statistics is defined on just a single column, it has to
                                                                                                          
 
+       /* has the statistics expressions? */
                                                                                                          
 
+                       /* expression - see if it's in the statistics */
                                                                                                          
 
+                                        * column(s) the statistics depends on.  Also require all
                                                                                                          
 
+        * statistics is defined on more than one column/expression).
                                                                                                          
 
+        * statistics is useless, but harmless).
                                                                                                          
 
+        * If there are no simply-referenced columns, give the statistics an auto
                                                                                                          
 


+                        * Then the first statistics matches no expressions and 3 vars,
                                                                                                          
 
+                        * while the second statistics matches one expression and 1 var.
                                                                                                          
 
+                        * Currently the first statistics wins, which seems silly.
                                                                                                          
 

+                        * [(a+c), d]. But maybe it's better than failing to match the
                                                                                                          
 
+                        * second statistics?
                                                                                                          
 

I can make patches for these (separate patches for HEAD and your patch), but I
don't think your patch has to wait on it, since the user-facing documentation
is consistent with what's already there, and the rest are internal comments.

-- 
Justin



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods