Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | Re: tsvector pg_stats seems quite a bit off. |
Date | |
Msg-id | 4C0021B7.2090604@wulczer.org Whole thread Raw |
In response to | Re: tsvector pg_stats seems quite a bit off. (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: tsvector pg_stats seems quite a bit off.
|
List | pgsql-hackers |
On 28/05/10 04:47, Tom Lane wrote: > I re-scanned that paper and realized that there is indeed something > wrong with the way we are doing it. The paper says (last sentence in > the definition of the algorithm, section 4.2): > > When a user requests a list of items with threshold s, we output > those entries in D where f >= (s-e)N. > > What we are actually doing is emitting every entry with f >= 2. Since > e is fixed at 1/w, this effectively means that we are setting s to be > only fractionally greater than e, which means very large relative errors > in the estimates. I gave it a though and reread the paper, but since I already blundered once, please verify me on this. We follow the algorithm as written, the trouble starts when we want to output the result. The paper says which items from the D structure should be returned when the user asks for items that have frequencies higher than a threshold s. What we want to put in the statistics table are items accompanied by their frequencies, so we need to do some reasoning before we can construct the result. Say we have an item I with count f (taken from our D structure). The total number of entries is N. The question would be: what would be the minimum frequency that the user could specify, that would still make us output this element. From f >= (s - e) * N we can say it's s <= (f / N) + e So if the user wants items that occur with frequency (f / N) + e or less. This would mean that the corresponding value in the statistics entry should be < I, (f / N) + e) > The thing is, this doesn't change much, because currently we are putting (f / N) there, and e is set to 1 / stats_target * 10, so the difference would not be dramatic. > Or, if you want it explained another way: we are emitting words whose f > is very small and whose delta is very large, representing items that got > added to the scan very late. These really shouldn't be there because > their true frequency is probably a lot less than the intended threshold. Well, the idea of the algorithm is that if their frequency would have been bigger, they would appear earlier and would survive the pruning, as I understand it. > The net effect of this is first that there are a lot of rather useless > entries in the MCV list whose claimed frequency is quite small, like as > little as two occurrences. Their true frequency could be quite a bit > more. What's even worse is that we believe that the minimum of these > claimed frequencies is a reliable upper bound for the frequencies of > items not listed in the MCV list. Per the algorithm it *is* the upper bound, if I got my maths correctly. The last item in the list would not be returned if the requested frequency was higher than the one that is associated to that item. > So I think we have to fix this. The right thing is to select s and e > values that are actually meaningful in the terms of the paper, then > compute w from that, and be sure to enforce the minimum f value > specified in the algorithm ... ie, don't be afraid to throw away values > in the final D table. I we should definitely prune the table one last time in the very probable case that the loop ended in the middle of two regularly happening prunes. As for selecting the algorithm parameters: we don't get to select s. We do get to select e, but that's it. I have a feeling that our problems are caused by thte fact, that the algorithm tries to answer the question "which elements occur more frequently than X" and we actually want the answer to the "what's the frequency of each element". I've almost convinced myself that the transformation between the answers to these questions exists, but this trouble report keeps giving me doubts. Cheers, Jan
pgsql-hackers by date: