Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Once again: the whole algorithm is a ripoff from analyze.c, with the
> dllist instead of an array because you don't know how big tracking list
> will you need and with a hashtable, because the tracking list will
> probably be large and looking up things linearly wouldn't be fun.
Hmmm ... I had forgotten how compute_minimal_stats() works, and looking
at it now I'm just itching to rewrite it. You have to realize that that
code was written with the assumptions that (1) it was only going to be
used for a few weird second-class-citizen data types, and (2) the stats
target would typically be around 10 or so. It really wasn't designed to
be industrial-strength code. (It was still a big improvement over what
we'd had before :-(.) So I'm not very comfortable with taking it as the
design base for something that does need to be industrial-strength.
Your point about not wanting the most-recently-added lexeme to drop off
first is a good one, but the approach is awfully fragile. Consider what
happens if (or when) all the lexemes in the list reach count 2 or more.
All other lexemes seen later will fight over the last list slot, and
have no chance of getting higher in the list; so the algorithm will
completely fail to adapt to any changes in the input statistics that
occur after that point is reached. I'm thinking that the idea of
periodically cutting off a bunch of low-scoring items, instead of trying
to do it "exactly", would actually be better because it'd still have a
chance of tracking new data even when the counts get larger.
I don't recommend doing two passes over the data because it's fairly
likely that tsvectors would be large enough to be toasted, which'd make
fetching them expensive. One idea is to start out with some reasonable
estimate of the max lexemes per tsvector (say a couple hundred) and
realloc the list bigger in sizable jumps (say 2X) when the estimate is
seen to be exceeded. Another possibility is to have a prescan that only
determines the width of the physically largest tsvector (you can get the
width without detoasting, so this would be quite cheap), and then
estimate the number of lexemes corresponding to that using some
fudge-factor guess about lexeme sizes, and then stick with the resulting
list size regardless of what the actual lexeme counts are. I kinda like
the latter since its behavior is order-independent.
regards, tom lane