Patch Review: Collect frequency statistics and selectivity estimation for arrays - Mailing list pgsql-hackers

From Nathan Boley
Subject Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date
Msg-id CAHetpQSDG85YH41URD33+C34MnbSjkJgo0EG6cUhqYFb41HF6Q@mail.gmail.com
Whole thread Raw
Responses Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
List pgsql-hackers
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense )   doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.


Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity )   i) Any assumptions being made ( ie
independenceof elements within
 
an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+              * mult * dist[i] / mcelem_dist[i] gives us probability probability
+              * of qual matching from assumption of independent element occurence
+              * with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is  certainly just the order of the
estimatethat matters, why not just  perform the calculation in log space?
 

2) compute_array_stats is really long. Could you break it up? ( I know that  the other analyze functions are the same
way,but why continue in
 
that vein? )

Best,
Nathan


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: pg_class.relistemp
Next
From: "David E. Wheeler"
Date:
Subject: Re: pg_class.relistemp