Re: gsoc, text search selectivity and dllist enhancments - Mailing list pgsql-hackers

From Tom Lane
Subject Re: gsoc, text search selectivity and dllist enhancments
Date
Msg-id 19287.1215728376@sss.pgh.pa.us
Whole thread Raw
In response to Re: gsoc, text search selectivity and dllist enhancments  (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
Responses Re: gsoc, text search selectivity and dllist enhancments  (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
List pgsql-hackers
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes:
> Tom Lane wrote:
>> The way I think it ought to work is that the number of lexemes stored in
>> the final pg_statistic entry is statistics_target times a constant
>> (perhaps 100).  I don't like having it vary depending on tsvector width

> I think the existing code puts at most statistics_target elements in a 
> pg_statistic tuple. In compute_minimal_stats() num_mcv starts with 
> stats->attr->attstattarget and is adjusted only downwards.
> My original thought was to keep that property for tsvectors (i.e. store 
> at most statistics_target lexemes) and advise people to set it high for 
> their tsvector columns (e.g. 100x their default).

Well, (1) the normal measure would be statistics_target *tsvectors*,
and we'd have to translate that to lexemes somehow; my proposal is just
to use a fixed constant instead of tsvector width as in your original
patch.  And (2) storing only statistics_target lexemes would be
uselessly small and would guarantee that people *have to* set a custom
target on tsvector columns to get useful results.  Obviously broken
defaults are not my bag.

> Also, the existing code decides which elements are worth storing as most 
> common ones by discarding those that are not frequent enough (that's 
> where num_mcv can get adjusted downwards). I mimicked that for lexemes 
> but maybe it just doesn't make sense?

Well, that's not unreasonable either, if you can come up with a
reasonable definition of "not frequent enough"; but that adds another
variable to the discussion.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Adding variables for segment_size, wal_segment_size and block sizes
Next
From: Oleg Bartunov
Date:
Subject: Re: gsoc, text search selectivity and dllist enhancments