Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: tsvector pg_stats seems quite a bit off. |
Date | |
Msg-id | 16212.1275014877@sss.pgh.pa.us Whole thread Raw |
In response to | Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>) |
Responses |
Re: tsvector pg_stats seems quite a bit off.
Re: tsvector pg_stats seems quite a bit off. Re: tsvector pg_stats seems quite a bit off. |
List | pgsql-hackers |
Jan Urbański <wulczer@wulczer.org> writes: > On 19/05/10 21:01, Jesper Krogh wrote: >> In practice, just cranking the statistics estimate up high enough seems >> to solve the problem, but doesn't >> there seem to be something wrong in how the statistics are collected? > The algorithm to determine most common vals does not do it accurately. > That would require keeping all lexemes from the analysed tsvectors in > memory, which would be impractical. If you want to learn more about the > algorithm being used, try reading > http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in > ts_typanalyze.c I re-scanned that paper and realized that there is indeed something wrong with the way we are doing it. The paper says (last sentence in the definition of the algorithm, section 4.2): When a user requests a list of items with threshold s, we outputthose entries in D where f >= (s-e)N. What we are actually doing is emitting every entry with f >= 2. Since e is fixed at 1/w, this effectively means that we are setting s to be only fractionally greater than e, which means very large relative errors in the estimates. Or, if you want it explained another way: we are emitting words whose f is very small and whose delta is very large, representing items that got added to the scan very late. These really shouldn't be there because their true frequency is probably a lot less than the intended threshold. The net effect of this is first that there are a lot of rather useless entries in the MCV list whose claimed frequency is quite small, like as little as two occurrences. Their true frequency could be quite a bit more. What's even worse is that we believe that the minimum of these claimed frequencies is a reliable upper bound for the frequencies of items not listed in the MCV list. Both of these errors are manifest in Jesper's description of his case, and they're also easy to reproduce if you just take stats on any reasonable-size collection of documents. Cranking up the stats target actually makes it worse not better, since low-frequency items are then more likely to get into the MCV list. So I think we have to fix this. The right thing is to select s and e values that are actually meaningful in the terms of the paper, then compute w from that, and be sure to enforce the minimum f value specified in the algorithm ... ie, don't be afraid to throw away values in the final D table. regards, tom lane
pgsql-hackers by date: