Re: gsoc, text search selectivity and dllist enhancments - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: gsoc, text search selectivity and dllist enhancments
Date
Msg-id 486E77E8.6010404@enterprisedb.com
Whole thread Raw
In response to Re: gsoc, text search selectivity and dllist enhancments  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: gsoc, text search selectivity and dllist enhancments  (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
List pgsql-hackers
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> The data structure I'd suggest is a simple array of pointers
>>> to the underlying hash table entries.  Since you have a predetermined
>>> maximum number of lexemes to track, you can just palloc the array once
>>> --- you don't need the expansibility properties of a list. 
> 
>> The number of lexemes isn't predetermined. It's 2 * (longest tsvector 
>> seen so far), and we don't know beforehand how long the longest tsvector is.
> 
> Hmm, I had just assumed without looking too closely that it was stats
> target times a fudge factor.  What is the rationale for doing it as
> above?  I don't think I like the idea of the limit varying over the
> course of the scan --- that means that lexemes in different places
> in the input will have significantly different probabilities of
> surviving to the final result.

Well, clearly if the list is smaller than the longest tsvector, 
inserting all elements of that long tsvector will flush out all other 
entries. Or if we throw away the newly inserted entries, some elements 
will never have a chance to climb up the list. I'm not sure where the 
"times two" figure comes from, maybe it's just a fudge factor, but the 
bottom line is that the minimum size needed depends on the size of the 
longest tsvector.

(Jan is offline until Saturday...)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [COMMITTERS] pgsql: Fix a couple of bugs in win32 shmem name generation: * Don't cut
Next
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: CITEXT 2.0