Home > mailing lists

Patch Review: Collect frequency statistics and selectivity estimation for arrays - Mailing list pgsql-hackers

From	Nathan Boley
Subject	Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date	July 14, 2011 22:13:40
Msg-id	CAHetpQSDG85YH41URD33+C34MnbSjkJgo0EG6cUhqYFb41HF6Q@mail.gmail.com Whole thread Raw
Responses	Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
List	pgsql-hackers

Tree view

Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense ) doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.

Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity ) i) Any assumptions being made ( ie
independenceof elements within

an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+ * mult * dist[i] / mcelem_dist[i] gives us probability probability
+ * of qual matching from assumption of independent element occurence
+ * with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is certainly just the order of the
estimatethat matters, why not just perform the calculation in log space?

2) compute_array_stats is really long. Could you break it up? ( I know that the other analyze functions are the same
way,but why continue in

that vein? )

Best,
Nathan

pgsql-hackers by date:

From: Josh Berkus
Date: 14 July 2011, 22:06:33
Subject: Re: pg_class.relistemp

From: "David E. Wheeler"
Date: 14 July 2011, 22:13:57
Subject: Re: pg_class.relistemp

Patch Review: Collect frequency statistics and selectivity estimation for arrays - Mailing list pgsql-hackers

Previous

Next