Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: tsvector pg_stats seems quite a bit off.
Date
Msg-id 16212.1275014877@sss.pgh.pa.us
Whole thread Raw
In response to Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Responses Re: tsvector pg_stats seems quite a bit off.
Re: tsvector pg_stats seems quite a bit off.
Re: tsvector pg_stats seems quite a bit off.
List pgsql-hackers
Jan Urbański <wulczer@wulczer.org> writes:
> On 19/05/10 21:01, Jesper Krogh wrote:
>> In practice, just cranking the statistics estimate up high enough seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?

> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it.  The paper says (last sentence in
the definition of the algorithm, section 4.2):
When a user requests a list of items with threshold s, we outputthose entries in D where f >= (s-e)N.

What we are actually doing is emitting every entry with f >= 2.  Since
e is fixed at 1/w, this effectively means that we are setting s to be
only fractionally greater than e, which means very large relative errors
in the estimates.

Or, if you want it explained another way: we are emitting words whose f
is very small and whose delta is very large, representing items that got
added to the scan very late.  These really shouldn't be there because
their true frequency is probably a lot less than the intended threshold.

The net effect of this is first that there are a lot of rather useless
entries in the MCV list whose claimed frequency is quite small, like as
little as two occurrences.  Their true frequency could be quite a bit
more.  What's even worse is that we believe that the minimum of these
claimed frequencies is a reliable upper bound for the frequencies of
items not listed in the MCV list.  Both of these errors are manifest
in Jesper's description of his case, and they're also easy to reproduce
if you just take stats on any reasonable-size collection of documents.
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list.

So I think we have to fix this.  The right thing is to select s and e
values that are actually meaningful in the terms of the paper, then
compute w from that, and be sure to enforce the minimum f value
specified in the algorithm ... ie, don't be afraid to throw away values
in the final D table.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: quoting and recovery.conf
Next
From: Tom Lane
Date:
Subject: Re: Specification for Trusted PLs?