Patch Review: Collect frequency statistics and selectivity estimation for arrays - Mailing list pgsql-hackers
From | Nathan Boley |
---|---|
Subject | Patch Review: Collect frequency statistics and selectivity estimation for arrays |
Date | |
Msg-id | CAHetpQSDG85YH41URD33+C34MnbSjkJgo0EG6cUhqYFb41HF6Q@mail.gmail.com Whole thread Raw |
Responses |
Re: Patch Review: Collect frequency statistics and
selectivity estimation for arrays
Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays |
List | pgsql-hackers |
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539 I glanced through the code and played around with the feature and, in general, it looks pretty good. But I have a some comments/questions. Design: First, it makes me uncomfortable that you are using the MCV and histogram slot kinds in a way that is very different from other data types. I realize that tsvector uses MCV in the same way that you do but: 1) I don't like that very much either. 2) TS vector is different in that equality ( in the btree sense ) doesn't make sense, whereas it does for arrays. Using the histogram slot for the array lengths is also very surprising to me. Why not just use a new STA_KIND? It's not like we are running out of room, and this will be the second 'container' type that splits the container and stores stats about the elements. Second, I'm fairly unclear about what the actual underlying statistical method is and what assumptions it makes. Before I can really review the method, I will probably need to see some documentation, as I say below. Do you have any tests/simulations that explore the estimation procedure that I can look at? When will it fail? In what cases does it improve estimation? Documentation: Seems a bit lacking - especially if you keep the non standard usage of mcv/histograms. Also, it would be really nice to see some update to the row-estimation-examples page ( in chapter 55 ). The code comments are good in general, but there are not enough high level comments . It would be really nice to have a comment that gives an overview of what each of the following functions are supposed to do: mcelem_array_selec mcelem_array_contain_overlap_selec and especially mcelem_array_contained_selec Also, in the compute_array_stats you reference compute_tsvector_stats for the algorithm, but I think that it would be smarter to either copy the relevant portions of the comment over or to reference a published document. If compute_tsvector_stats changes, I doubt that anyone will remember to update the comment. Finally, it would be really nice to see, explicitly, and in mathematical notation 1) The data that is being collected and summarized. ( the statistic ) 2) The estimate being derived from the statistic ( ie the selectivity ) i) Any assumptions being made ( ie independenceof elements within an array ) For instance, the only note I could find that actually addressed the estimator and assumptions underneath it was + * mult * dist[i] / mcelem_dist[i] gives us probability probability + * of qual matching from assumption of independent element occurence + * with condition that length = i. and that's pretty cryptic. That should be expanded upon and placed more prominently. A couple nit picks: 1) In calc_distr you go to some lengths to avoid round off errors. Since it is certainly just the order of the estimatethat matters, why not just perform the calculation in log space? 2) compute_array_stats is really long. Could you break it up? ( I know that the other analyze functions are the same way,but why continue in that vein? ) Best, Nathan
pgsql-hackers by date: