Thread: doc: Make selectivity example match wording

doc: Make selectivity example match wording

From
"David G. Johnston"
Date:
Hi,

Reposting this to its own thread.


    doc: make unique non-null join selectivity example match the prose
   
    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula used
    in the code (for which either variant is still an approximation).
   
    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

David J.

Attachment

Re: doc: Make selectivity example match wording

From
"Dian M Fay"
Date:
On Thu Jun 9, 2022 at 11:57 AM EDT, David G. Johnston wrote:
> Reposting this to its own thread.
>
> https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com
>
>     doc: make unique non-null join selectivity example match the prose
>
>     The description of the computation for the unique, non-null,
>     join selectivity describes a division by the maximum of two values,
>     while the example shows a multiplication by their reciprocal.  While
>     equivalent the max phrasing is easier to understand; which seems
>     more important here than precisely adhering to the formula used
>     in the code (for which either variant is still an approximation).
>
>     While both num_distinct and num_rows are equal for a unique column
>     both the concept and formula use row count (10,000) and the
>     field num_distinct has already been set to mean the specific value
>     present in the pg_stats table (i.e, -1), so use num_rows here.

Pointing out that n_distinct = -1 is helpful but changing "because" to
"and" suggests that the missing MCV info is coincidental or a side
effect. Is there any case in which the stronger "because" wouldn't be
appropriate?

The second parenthetical (num_rows, not shown, but "tenk") took me a
minute to get since the row counts are only apparent on looking somewhat
closely at the other examples in the chapter. num_rows also isn't a
column in pg_stats which the "not shown" could be taken to imply; it's
sourced from somewhere else and only given as num_rows in this example.
How's '(as num_rowsN, 10,000 for both "tenk" example tables)'?

By "this value does get scaled in the non-unique case" do you mean it
relies on n_distinct as in the uncorrected algorithm listing? If so I
think it'd help to specify that.

You didn't take this line on but "This is, subtract the null
fraction..." omits the step of multiplying the complements of the null
fractions together before dividing.

Should n_distinct and num_rows be <structname>d in the text?



Re: doc: Make selectivity example match wording

From
"David G. Johnston"
Date:
On Sat, Jul 2, 2022 at 12:42 PM Dian M Fay <dian.m.fay@gmail.com> wrote:
On Thu Jun 9, 2022 at 11:57 AM EDT, David G. Johnston wrote:
> Reposting this to its own thread.
>
> https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com
>
>     doc: make unique non-null join selectivity example match the prose
>
>     The description of the computation for the unique, non-null,
>     join selectivity describes a division by the maximum of two values,
>     while the example shows a multiplication by their reciprocal.  While
>     equivalent the max phrasing is easier to understand; which seems
>     more important here than precisely adhering to the formula used
>     in the code (for which either variant is still an approximation).

Should n_distinct and num_rows be <structname>d in the text?

Thanks for the review.  I generally like everything you said but it made me realize that I still didn't really understand the intent behind the formula.  I spent way too much time working that out for myself, then turned what I found useful into this v2 patch.

It may need some semantic markup still but figured I'd see if the idea makes sense.

I basically rewrote, in a bit different style, the same material into the code comments, then proceeded to rework the proof that was already present there.

I did do this in somewhat of a vacuum.  I'm not inclined to learn this all start-to-end though.  If the abrupt style change is unwanted so be it.  I'm not really sure how much benefit the proof really provides.  The comments in the docs are probably sufficient for the code as well - just define why the three pieces of the formula exist and are packaged into a single multiplier called selectivity as an API choice.  I suspect once someone gets to that comment it is fair to assume some prior knowledge.  Admittedly, I didn't really come into this that way...

David J.

Attachment

Re: doc: Make selectivity example match wording

From
"Dian M Fay"
Date:
On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote:
> Thanks for the review.  I generally like everything you said but it made me
> realize that I still didn't really understand the intent behind the
> formula.  I spent way too much time working that out for myself, then
> turned what I found useful into this v2 patch.
>
> It may need some semantic markup still but figured I'd see if the idea
> makes sense.
>
> I basically rewrote, in a bit different style, the same material into the
> code comments, then proceeded to rework the proof that was already present
> there.
>
> I did do this in somewhat of a vacuum.  I'm not inclined to learn this all
> start-to-end though.  If the abrupt style change is unwanted so be it.  I'm
> not really sure how much benefit the proof really provides.  The comments
> in the docs are probably sufficient for the code as well - just define why
> the three pieces of the formula exist and are packaged into a single
> multiplier called selectivity as an API choice.  I suspect once someone
> gets to that comment it is fair to assume some prior knowledge.
> Admittedly, I didn't really come into this that way...

Fair enough, I only know what I can glean from the comments in
eqjoinsel_inner and friends myself. I do think even this smaller change
is valuable because the current example talks about using an algorithm
based on the number of distinct values immediately after showing
n_distinct == -1, so making it clear that this case uses num_rows
instead is helpful.

"This value does get scaled in the non-unique case" again could be more
specific ("since here all values are unique, otherwise the calculation
uses num_distinct" perhaps?). But past that quibble I'm good.



Re: doc: Make selectivity example match wording

From
Bruce Momjian
Date:
On Sun, Jul 17, 2022 at 07:07:15PM -0400, Dian M Fay wrote:
> On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote:
> > Thanks for the review.  I generally like everything you said but it made me
> > realize that I still didn't really understand the intent behind the
> > formula.  I spent way too much time working that out for myself, then
> > turned what I found useful into this v2 patch.
> >
> > It may need some semantic markup still but figured I'd see if the idea
> > makes sense.
> >
> > I basically rewrote, in a bit different style, the same material into the
> > code comments, then proceeded to rework the proof that was already present
> > there.
> >
> > I did do this in somewhat of a vacuum.  I'm not inclined to learn this all
> > start-to-end though.  If the abrupt style change is unwanted so be it.  I'm
> > not really sure how much benefit the proof really provides.  The comments
> > in the docs are probably sufficient for the code as well - just define why
> > the three pieces of the formula exist and are packaged into a single
> > multiplier called selectivity as an API choice.  I suspect once someone
> > gets to that comment it is fair to assume some prior knowledge.
> > Admittedly, I didn't really come into this that way...
> 
> Fair enough, I only know what I can glean from the comments in
> eqjoinsel_inner and friends myself. I do think even this smaller change
> is valuable because the current example talks about using an algorithm
> based on the number of distinct values immediately after showing
> n_distinct == -1, so making it clear that this case uses num_rows
> instead is helpful.
> 
> "This value does get scaled in the non-unique case" again could be more
> specific ("since here all values are unique, otherwise the calculation
> uses num_distinct" perhaps?). But past that quibble I'm good.

Patch applied to master.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.