I wrote:
> The part of that that seems to be going off the rails is
> this selection of a cutoff frequency below which element values
> will be dropped:
> cutoff_freq = 9 * element_no / bucket_width;
> The first thing I find suspicious here is that the calculation is
> based on element_no (the total number of array elements processed)
> and not nonnull_cnt (the maximum possible frequency). Is that
> really right?
I did some more digging and found that that calculation was introduced
(in the older tsvector code) in bc0f08092, which traces to this
discussion:
https://www.postgresql.org/message-id/flat/4BF4357E.6000505%40krogh.cc
So the use of element_no is correct, because what we need to consider
here is the total number of values fed to the LC algorithm.
Also, my thought that maybe we should reject entries with f < 2
is bogus, because at the end of the algorithm f is not necessarily
the true count of occurrences of the value: some early occurrences
could have been forgotten via pruning. The "behavioral cliff" is
annoying but I'm not sure there is much to be done about it: having
a single (still-remembered) occurrence gets less and less significant
as the total input size increases, so sooner or later you are going
to hit a point where such values should be thrown away.
So at this point I'm thinking that there is nothing wrong with
ANALYZE's algorithm, although I now see that there are some relevant
comments in ts_typanalyze.c that probably ought to be transposed into
array_typanalyze.c.
The idea of treating lack of MCELEM differently from complete
lack of stats still seems to have merit, though.
regards, tom lane