Re: Improve docs for n_distinct_inherited - Mailing list pgsql-hackers

From David Rowley
Subject Re: Improve docs for n_distinct_inherited
Date
Msg-id CAApHDvrRqVB6OesPq9Z0S_872u1+qq+FLyEFU9FVdiPUET66YA@mail.gmail.com
Whole thread Raw
In response to Improve docs for n_distinct_inherited  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Improve docs for n_distinct_inherited
List pgsql-hackers
On Thu, 8 May 2025 at 15:23, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Not liking the proposal, not sure it is even correct.  Somehow "children of inheritance parent tables" are omitted.

I don't see the quoted text anywhere in this area, so I'm not sure I
follow what you mean with the omission.

>  In both cases the only statistics affected are those of the table upon which the option is being read.  That can go
implied. Leaving "affects the statistics for the table itself, while" is causing most of the harm.
 

So the text, when it says "affects the statistics for the table
itself" is talking about stainherit==false statistics and the "the
statistics gathered which include the descendant tables of partitioned
tables and inheritance parent tables" which I added is trying to
describe the stainherits==true statistics.  You may have
misinterpreted that to think it was talking which table the option was
being applied to rather than where the sample rows were being gathered
from.

> In terms of wording - n_distinct must be used for all tables except partitioned tables and inheritance parent tables.
Those two acquire their override value from n_distinct_inherited.  So something like:
 
>
> diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
> index 24316bb816a..657e4c835ac 100644
> --- a/doc/src/sgml/ref/alter_table.sgml
> +++ b/doc/src/sgml/ref/alter_table.sgml
> @@ -339,9 +339,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
>        <literal>n_distinct_inherited</literal>, which override the
>        number-of-distinct-values estimates made by subsequent
>        <link linkend="sql-analyze"><command>ANALYZE</command></link>
> -      operations.  <literal>n_distinct</literal> affects the statistics for the table
> -      itself, while <literal>n_distinct_inherited</literal> affects the statistics
> -      gathered for the table plus its inheritance children.  When set to a
> +      operations.
> +      For partitioned tables and inheritance parents <command>ANALYZE</command>
> +      consults the <literal>n_distinct_inherited</literal> option and, if applicable,
> +      includes all descendants when calculating the table size for use in the
> +      number-of-distinct-values estimate.  For all other tables, <literal>n_distinct</literal>
> +      is used and the table size estimate is based on the table itself.
> +      When set to a
>        positive value, <command>ANALYZE</command> will assume that the column contains
>        exactly the specified number of distinct nonnull values.  When set to a
>        negative value, which must be greater
>
> It has a bit of forward-referencing to the following paragraph, but nothing surprising or novel.  The existing
wordingis already doing the same by alluding to descendants/children - but its just one data point that is involved so
letscall it out.
 

I can't follow your proposed text. It seems to indicate that ANALYZE
uses n_distinct_inherits to calculate the table size. ANALYZE does
nothing of the sort. I think you might have been confused by the
following text which talks about how *the planner* proportions
*negative* stadistinct values over the "table size", which really
should say estimated row count and not table size. table size has
nothing to do with this, but that seems like another issue.

As for my proposed patch and the text it changes, what this text is
meant to be talking about is which of ANALYZE's sample sets the
stadistinct estimate is overridden for. The two options for sample
sets are 1) stainherits==true and 2) stainherits==false. With
inheritance parent tables, ANALYZE gathers both sets and records both
sets in pg_statistic. Which one the planner uses depends on what it's
doing and whether you specify ONLY or not in your query.  For
partitioned tables, since they're empty, we don't bother with
stainherits==false stats.

I've attached another patch which adjusts the text to detail that
there are two sample row sets gathered for some cases, and then
mentions the two options that can be used to selectively overwrite
each sample's n_distinct estimate. I'd rather find a shorter set of
words to better explain this than what I'm proposing in v2, but the
current text has presumably confused one person (in the bug report)
and the v1 patch still seems to be a source of confusion.

David

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Improve explicit cursor handling in pg_stat_statements
Next
From: Daniel Gustafsson
Date:
Subject: Re: [PATCH] Fix tags in func.sgml