Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> The data structure I'd suggest is a simple array of pointers
>>> to the underlying hash table entries. Since you have a predetermined
>>> maximum number of lexemes to track, you can just palloc the array once
>>> --- you don't need the expansibility properties of a list.
>
>> The number of lexemes isn't predetermined. It's 2 * (longest tsvector
>> seen so far), and we don't know beforehand how long the longest tsvector is.
>
> Hmm, I had just assumed without looking too closely that it was stats
> target times a fudge factor. What is the rationale for doing it as
> above? I don't think I like the idea of the limit varying over the
> course of the scan --- that means that lexemes in different places
> in the input will have significantly different probabilities of
> surviving to the final result.
Well, clearly if the list is smaller than the longest tsvector,
inserting all elements of that long tsvector will flush out all other
entries. Or if we throw away the newly inserted entries, some elements
will never have a chance to climb up the list. I'm not sure where the
"times two" figure comes from, maybe it's just a fudge factor, but the
bottom line is that the minimum size needed depends on the size of the
longest tsvector.
(Jan is offline until Saturday...)
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com