Thread: tsvector pg_stats seems quite a bit off.
Hi. I am working on getting full-text-search to work and have come across something I think look a bit strange. The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. # ANALYZE verbose reference (document_tsvector); INFO: analyzing "reference" INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows ANALYZE Ok, so analyze allmost examined all rows. Looking into "most_common_freqs" I find # select count(unnest) from (select unnest(most_common_freqs) from pg_stats where attname = 'document_tsvector') as foo; count ------- 2810 (1 row) But the distribution is very "flat" at the end, the last 128 values are excactly 1.00189e-05 which means that any term sitting outside the array would get an estimate of 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows So far I have no idea if this is bad or good, so a couple of sample runs of stuff that is sitting outside the "most_common_vals" array: # explain analyze select id from efam.reference where document_tsvector @@ to_tsquery('searchterm') order by id limit 2000; QUERYPLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135 rows=1612 loops=1) -> Sort (cost=35.99..35.99 rows=2 width=4) (actual time=20.709..23.190 rows=1612 loops=1) Sort Key: id Sort Method: quicksort Memory: 124kB -> BitmapHeap Scan on reference (cost=28.02..35.98 rows=2 width=4) (actual time=3.522..17.238 rows=1612 loops=1) Recheck Cond: (document_tsvector @@ to_tsquery('searchterm'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613 loops=1) Index Cond: (document_tsvector @@ to_tsquery('searchterm'::text)) Total runtime: 30.743 ms (9 rows) Ok, the query-planner estimates that there are 2 rows .. excactly as predicted, works as expected but in fact there are 1612 rows that matches. So, analyze has sampled 6 of 7 rows in the table and this term exists in 1612/350174 rows ~ freq: 0.0046 which is way higher than the lower bound of 1.00189e-05 .. or it should have been sitting around the center of the 2810 values of the histogram collected. So the "most_common_vals" seems to contain a lot of values that should never have been kept in favor of other values that are more common. In practice, just cranking the statistics estimate up high enough seems to solve the problem, but doesn't there seem to be something wrong in how the statistics are collected? # select version(); version --------------------------------------------------------------------------------------------------------------------------- PostgreSQL9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit Jesper -- Jesper
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010: > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows I don't know if this is related, but tsvector stats are computed and stored per term, not per datum. This is different from all other datatypes. Maybe there's code somewhere that's assuming per-datum and coming up with the wrong estimates? Or maybe the tsvector-specific code contains a bug somewhere; maybe a rounding error? -- Álvaro Herrera <alvherre@alvh.no-ip.org>
On 19/05/10 21:01, Jesper Krogh wrote: > The document base is arount 350.000 documents and > I have set the statistics target on the tsvector column > to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less "at any time track at most 10 * <target> lexemes simultaneously" where "track" means keeping them in memory while going through the tuples being analysed. Remember that the measure is in lexemes, not whole tsvectors and the 10 factor is meant to approximate the average number of unique lexemes in a tsvector. If your documents are very large, this might not be a good approximation. > # ANALYZE verbose reference (document_tsvector); > INFO: analyzing "reference" > INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live > rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows > ANALYZE > > Ok, so analyze allmost examined all rows. Looking into > "most_common_freqs" I find > # select count(unnest) from (select unnest(most_common_freqs) from > pg_stats where attname = 'document_tsvector') as foo; > count > ------- > 2810 > (1 row) So the size of the most_common_freqs and most_common_vals rows in pg_statistics for tsvectors has an upper bound of <stats-target> * 10 (for the same reasons as mentioned before) and holds lexemes (not whole tsvectors). What happens also is that lexemes that where seen only one while going through the analysed set are discarded, so that's why you can actually get less entries in these arrays, even if your document set is big. > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an > estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows Yeah, this might meant that you could try cranking up the stats target a lot, to make the set of simulatenously tracked lexemes larger (it will cost time and memory during analyse though). If the documents have completely different contents, what can happen is that almost all lexemes are only seen a few times and get removed during the pruning of the working set. I have seen similar behaviour while working on the typanalyze function for tsvectors. > So far I have no idea if this is bad or good, so a couple of sample runs > of stuff that > is sitting outside the "most_common_vals" array: > > [gathered statistics suck] > So the "most_common_vals" seems to contain a lot of values that should > never have been kept in favor > of other values that are more common. > In practice, just cranking the statistics estimate up high enough seems > to solve the problem, but doesn't > there seem to be something wrong in how the statistics are collected? The algorithm to determine most common vals does not do it accurately. That would require keeping all lexemes from the analysed tsvectors in memory, which would be impractical. If you want to learn more about the algorithm being used, try reading http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in ts_typanalyze.c It would be interesting to know what's the average size of a tsvector in your document set (ie. how many unique lexemes does a tsvector have on average). In general, the tsvector typanalyze function is designed to suck less than the constant factor that has been used previously, but it only works really well on the most common lexemes (thus preventing most gross misestimates). I'm not very surprised it misses the difference between 1612/350174 and 4/350174 and I'm quite happy that is gets that if you set the stats target really high :o) There's always the possibility that there's some stupid bug there, but I think you just set your expectations too high for the tsvector typanalze function. If you could come up with a better way of doing tsvector stats, that would be awesome - currently it's just doing its best to prevent the most outrageous errors. Cheers, Jan
On 26/05/2010, at 01.16, Jan Urbański <wulczer@wulczer.org> wrote: > On 19/05/10 21:01, Jesper Krogh wrote: >> The document base is arount 350.000 documents and >> I have set the statistics target on the tsvector column >> to 1000 since the 100 seems way of. > > So for tsvectors the statistics target means more or less "at any time > track at most 10 * <target> lexemes simultaneously" where "track" > means > keeping them in memory while going through the tuples being analysed. > > Remember that the measure is in lexemes, not whole tsvectors and the > 10 > factor is meant to approximate the average number of unique lexemes > in a > tsvector. If your documents are very large, this might not be a good > approximation. I just did a avg(length(document_tsvector)) which is 154 Doc count is 1.3m now in my sample set. > >> But the distribution is very "flat" at the end, the last 128 values >> are >> excactly >> 1.00189e-05 >> which means that any term sitting outside the array would get an >> estimate of >> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows > > Yeah, this might meant that you could try cranking up the stats > target a > lot, to make the set of simulatenously tracked lexemes larger (it will > cost time and memory during analyse though). If the documents have > completely different contents, what can happen is that almost all > lexemes are only seen a few times and get removed during the pruning > of > the working set. I have seen similar behaviour while working on the > typanalyze function for tsvectors. I Think i would prefer something less "magic" I Can increase the statistics target and get more reliable data but that increases also the amount of tuples being picked out for analysis which is really time consuming. But that also means that what gets stored as the lower bound of the historgram isn't anywhere near the lower bound, more the lower bound of the "artificial" histogram that happened after the last pruning. I Would suggest that the pruning in the end should be aware of this. Perhaps by keeping track of the least frequent value that never got pruned and using that as the last pruning ans lower bound? Thanks a lot for the explanation it fits fairly well why i couldn't construct a simple test set that had the problem. > >> So far I have no idea if this is bad or good, so a couple of sample >> runs >> of stuff that >> is sitting outside the "most_common_vals" array: >> >> [gathered statistics suck] > >> So the "most_common_vals" seems to contain a lot of values that >> should >> never have been kept in favor >> of other values that are more common. > >> In practice, just cranking the statistics estimate up high enough >> seems >> to solve the problem, but doesn't >> there seem to be something wrong in how the statistics are collected? > > The algorithm to determine most common vals does not do it accurately. > That would require keeping all lexemes from the analysed tsvectors in > memory, which would be impractical. If you want to learn more about > the > algorithm being used, try reading > http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in > ts_typanalyze.c I'll do some Reading Jesper
Jan Urbański <wulczer@wulczer.org> writes: > On 19/05/10 21:01, Jesper Krogh wrote: >> In practice, just cranking the statistics estimate up high enough seems >> to solve the problem, but doesn't >> there seem to be something wrong in how the statistics are collected? > The algorithm to determine most common vals does not do it accurately. > That would require keeping all lexemes from the analysed tsvectors in > memory, which would be impractical. If you want to learn more about the > algorithm being used, try reading > http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in > ts_typanalyze.c I re-scanned that paper and realized that there is indeed something wrong with the way we are doing it. The paper says (last sentence in the definition of the algorithm, section 4.2): When a user requests a list of items with threshold s, we outputthose entries in D where f >= (s-e)N. What we are actually doing is emitting every entry with f >= 2. Since e is fixed at 1/w, this effectively means that we are setting s to be only fractionally greater than e, which means very large relative errors in the estimates. Or, if you want it explained another way: we are emitting words whose f is very small and whose delta is very large, representing items that got added to the scan very late. These really shouldn't be there because their true frequency is probably a lot less than the intended threshold. The net effect of this is first that there are a lot of rather useless entries in the MCV list whose claimed frequency is quite small, like as little as two occurrences. Their true frequency could be quite a bit more. What's even worse is that we believe that the minimum of these claimed frequencies is a reliable upper bound for the frequencies of items not listed in the MCV list. Both of these errors are manifest in Jesper's description of his case, and they're also easy to reproduce if you just take stats on any reasonable-size collection of documents. Cranking up the stats target actually makes it worse not better, since low-frequency items are then more likely to get into the MCV list. So I think we have to fix this. The right thing is to select s and e values that are actually meaningful in the terms of the paper, then compute w from that, and be sure to enforce the minimum f value specified in the algorithm ... ie, don't be afraid to throw away values in the final D table. regards, tom lane
On 28/05/10 04:47, Tom Lane wrote: > Jan Urbański <wulczer@wulczer.org> writes: >> On 19/05/10 21:01, Jesper Krogh wrote: >>> In practice, just cranking the statistics estimate up high enough seems >>> to solve the problem, but doesn't >>> there seem to be something wrong in how the statistics are collected? > >> The algorithm to determine most common vals does not do it accurately. >> That would require keeping all lexemes from the analysed tsvectors in >> memory, which would be impractical. If you want to learn more about the >> algorithm being used, try reading >> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in >> ts_typanalyze.c > > I re-scanned that paper and realized that there is indeed something > wrong with the way we are doing it. > So I think we have to fix this. Hm, I'll try to take another look this evening (CEST). Cheers, Jan
On 28/05/10 04:47, Tom Lane wrote: > I re-scanned that paper and realized that there is indeed something > wrong with the way we are doing it. The paper says (last sentence in > the definition of the algorithm, section 4.2): > > When a user requests a list of items with threshold s, we output > those entries in D where f >= (s-e)N. > > What we are actually doing is emitting every entry with f >= 2. Since > e is fixed at 1/w, this effectively means that we are setting s to be > only fractionally greater than e, which means very large relative errors > in the estimates. I gave it a though and reread the paper, but since I already blundered once, please verify me on this. We follow the algorithm as written, the trouble starts when we want to output the result. The paper says which items from the D structure should be returned when the user asks for items that have frequencies higher than a threshold s. What we want to put in the statistics table are items accompanied by their frequencies, so we need to do some reasoning before we can construct the result. Say we have an item I with count f (taken from our D structure). The total number of entries is N. The question would be: what would be the minimum frequency that the user could specify, that would still make us output this element. From f >= (s - e) * N we can say it's s <= (f / N) + e So if the user wants items that occur with frequency (f / N) + e or less. This would mean that the corresponding value in the statistics entry should be < I, (f / N) + e) > The thing is, this doesn't change much, because currently we are putting (f / N) there, and e is set to 1 / stats_target * 10, so the difference would not be dramatic. > Or, if you want it explained another way: we are emitting words whose f > is very small and whose delta is very large, representing items that got > added to the scan very late. These really shouldn't be there because > their true frequency is probably a lot less than the intended threshold. Well, the idea of the algorithm is that if their frequency would have been bigger, they would appear earlier and would survive the pruning, as I understand it. > The net effect of this is first that there are a lot of rather useless > entries in the MCV list whose claimed frequency is quite small, like as > little as two occurrences. Their true frequency could be quite a bit > more. What's even worse is that we believe that the minimum of these > claimed frequencies is a reliable upper bound for the frequencies of > items not listed in the MCV list. Per the algorithm it *is* the upper bound, if I got my maths correctly. The last item in the list would not be returned if the requested frequency was higher than the one that is associated to that item. > So I think we have to fix this. The right thing is to select s and e > values that are actually meaningful in the terms of the paper, then > compute w from that, and be sure to enforce the minimum f value > specified in the algorithm ... ie, don't be afraid to throw away values > in the final D table. I we should definitely prune the table one last time in the very probable case that the loop ended in the middle of two regularly happening prunes. As for selecting the algorithm parameters: we don't get to select s. We do get to select e, but that's it. I have a feeling that our problems are caused by thte fact, that the algorithm tries to answer the question "which elements occur more frequently than X" and we actually want the answer to the "what's the frequency of each element". I've almost convinced myself that the transformation between the answers to these questions exists, but this trouble report keeps giving me doubts. Cheers, Jan
Jan Urbański <wulczer@wulczer.org> writes: > We follow the algorithm as written, the trouble starts when we want to > output the result. The paper says which items from the D structure > should be returned when the user asks for items that have frequencies > higher than a threshold s. What we want to put in the statistics table > are items accompanied by their frequencies, so we need to do some > reasoning before we can construct the result. Well, the estimated frequency is still just f/N. The point is that we must filter out items with small f values because they're probably inaccurate --- in particular, anything with f < eN is completely untrustworthy. I agree that we currently aren't bothering to determine a specific s value, but we probably need to do that in order to have a clear understanding of what we are getting. The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number of MCV entries or perhaps a bit more. Then use that estimate as the "s" value, and set e = s/10 or so, and then w = 1/e and continue as per the paper. If the eventual filtering results in a lot less than the target number of MCV entries (because the input wasn't so Zipfian), we lose, but at least we have accurate numbers for the entries we kept. > I we should definitely prune the table one last time in the very > probable case that the loop ended in the middle of two regularly > happening prunes. The paper doesn't say that you need to do that. I suspect if you work through the math, you'll find out that the minimum-f filter skips anything that would have been pruned anyway. regards, tom lane
On 28/05/10 22:22, Tom Lane wrote: > The idea that I was toying with is to assume a Zipfian distribution of > the input (with some reasonable parameter), and use that to estimate > what the frequency of the K'th element will be, where K is the target > number of MCV entries or perhaps a bit more. Then use that estimate as > the "s" value, and set e = s/10 or so, and then w = 1/e and continue as > per the paper. If the eventual filtering results in a lot less than the > target number of MCV entries (because the input wasn't so Zipfian), we > lose, but at least we have accurate numbers for the entries we kept. I see what you mean, so the idea would be: * assume some value of W as the number of all words in the language* estimate s as 1/(st + 10)*H(W), where H(W) is the W'thharmonic number and st is the statistics target, using Zipf's law* set e = s/10 and w = 1/e, that is 10/s* perform LC using that valueof w* remove all elements for which f < (s-e)N, that is f < 0.9*sN, where N is the total number of lexemes processed* create the MCELEM entries as (item, f/N) Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assume the statistics target to be 100. I chose s as 1/(st + 10)*H(W) because the top 10 English words will most probably be stopwords, so we will never see them in the input. Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes After that, we remove lexemes with f < 0.9 * 0.06 * N = 0.054*N So assuming that on average a tsvector has 154 elements and that we went through 35017 rows (as it would be in Jesper's case, before he raised the stats target from 100 to 1000), we will remove lexemes with f < 0.054 * 35017 * 154 that is f < 291201.37 I wonder what would happen if Jasper's case if we did that... And I wonder how sound that maths is. >> I we should definitely prune the table one last time in the very >> probable case that the loop ended in the middle of two regularly >> happening prunes. > > The paper doesn't say that you need to do that. I suspect if you work > through the math, you'll find out that the minimum-f filter skips > anything that would have been pruned anyway. Possible. Cheers, Jan
On 2010-05-28 23:47, Jan Urbański wrote: > On 28/05/10 22:22, Tom Lane wrote: > >> The idea that I was toying with is to assume a Zipfian distribution of >> the input (with some reasonable parameter), and use that to estimate >> what the frequency of the K'th element will be, where K is the target >> number of MCV entries or perhaps a bit more. Then use that estimate as >> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as >> per the paper. If the eventual filtering results in a lot less than the >> target number of MCV entries (because the input wasn't so Zipfian), we >> lose, but at least we have accurate numbers for the entries we kept. >> > I see what you mean, so the idea would be: > > * assume some value of W as the number of all words in the language > * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic > number and st is the statistics target, using Zipf's law > * set e = s/10 and w = 1/e, that is 10/s > * perform LC using that value of w > * remove all elements for which f< (s-e)N, that is f< 0.9*sN, where N > is the total number of lexemes processed > * create the MCELEM entries as (item, f/N) > > Now I tried to substitute some numbers there, and so assuming the > English language has ~1e6 words H(W) is around 6.5. Let's assume the > statistics target to be 100. > > I chose s as 1/(st + 10)*H(W) because the top 10 English words will most > probably be stopwords, so we will never see them in the input. > I think you should skip the assumption about stop-words, users may use something where they are needed in the index or have a language than the typical. (and they dont seem to influcence the math that much). Isn't it the same "type" of logic that is used for collecting statistics for "array-types", say integer-arrays and text arrays? > Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 > > We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes > Im not sure I get this one.. does this mean that we prune everytime we have collected 167 new datapoints .. that would seem too often for me since that would roughly be once per "row". > After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N > > So assuming that on average a tsvector has 154 elements and that we went > through 35017 rows (as it would be in Jesper's case, before he raised > the stats target from 100 to 1000), we will remove lexemes with f< > 0.054 * 35017 * 154 that is f< 291201.37 > > I wonder what would happen if Jasper's case if we did that... And I > wonder how sound that maths is > If it means that I would get an accurate MCE-histogram for all things that have an occourance of more than 5.4% of the rows (given the samples chosen), then I think that would be really reasonable. I can "fairly easy" try out patches or do other kind of testing. -- Jesper
On 2010-05-28 04:47, Tom Lane wrote: > Cranking up the stats target actually makes it worse not better, since > low-frequency items are then more likely to get into the MCV list > I should have been more precise in the wording. Cranking up the stats target gave me overall a "better plan", but that is due to that the range in the MCE histogram where the query-plan for my sample query tipped from a "Bitmap Index Scan" on the gin-index to "Index Scan" on a btree index actually became reliable. This is more due to the nature of my application and test queries than has anything to do with the correctness of the MCE histogram. So cranking up the statistics target made the problem move to somewhere, where it didnt matter that much to me. -- Jesper
On 29/05/10 12:34, Jesper Krogh wrote: > On 2010-05-28 23:47, Jan Urbański wrote: >> On 28/05/10 22:22, Tom Lane wrote: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. >> >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. >> > I think you should skip the assumption about stop-words, users may > use something where they are needed in the index or have a language > than the typical. (and they dont seem to influcence the math that much). Turns out it has nearly linear influence on the bucket width and the frequency necessary to survive the final pruning. I put some data in a spreadsheet, results below. > Isn't it the same "type" of logic that is used for collecting statistics > for > "array-types", say integer-arrays and text arrays? AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. ts_typanalyze is the only typanalyze function that takes the trouble of looping over the actual contents of each cell, all the others just compare whole arrays (which means that for a text[] field you will probably a quite useless MCV entry). >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 >> >> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes >> > > Im not sure I get this one.. does this mean that we prune everytime > we have collected 167 new datapoints .. that would seem too often > for me since that would roughly be once per "row". Hm, if we pick s to be 0.06, we say that the K'th word in the English language will have a frequency of 0.06, so if we want to have statistics with an error of s/10, we can prune every 167 lexemes (K is the statistics target, possibly +top_stopwords). Hm, I am now thinking that maybe this theory is flawed, because tsvecors contain only *unique* words, and Zipf's law is talking about words in documents in general. Normally a word like "the" would appear lots of times in a document, but (even ignoring the fact that it's a stopword and so won't appear at all) in a tsvector it will be present only once. This may or may not be a problem, not sure if such "squashing" of occurences as tsvectors do skewes the distribution away from Zipfian or not. Anyway, figuring that out would require some more math and thinking, and to fix the problem at hand we can say Zipf is good enough. >> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N >> >> So assuming that on average a tsvector has 154 elements and that we went >> through 35017 rows (as it would be in Jesper's case, before he raised >> the stats target from 100 to 1000), we will remove lexemes with f< >> 0.054 * 35017 * 154 that is f< 291201.37 >> >> I wonder what would happen if Jasper's case if we did that... And I >> wonder how sound that maths is >> > > If it means that I would get an accurate MCE-histogram for all > things that have an occourance of more than 5.4% of the rows > (given the samples chosen), then I think that would be really > reasonable. Here's the spreadsheet spat out. The variables are: * the statistics target * top stopwords * error factor Where top stopwords is the number of top words in the English language that would be stopwords. You can also think about it as the smudge factor determinig how well do we trust that the distribution is Zipfian. Theoretically if you want to keep X values in the MCE array, you should discard inputs with frequency lower than the frequency of the X'th value in a Zipfian distribution. If you would write out all English words and their frequencies (according to Zipf's law), the top Y of them would be stopwords. We want to discard words with frequency that's lower than X + Y, and then we probably want to have some breathing space as well. That cutoff frequency is called s in the LC algorithm. Error factor determines the relation between s and e, since apparently we want e to be proportional to s (e is the error from the LC algorithm). It directly determines the bucket width, since the larger the bucket, the more accurate the results will be, as there will be less pruning going on. There are also constants: H(len(eng)) is the harmonic number from Zipf's law, that assuming 1e6 words in English is 6.5. tsvector length and rows in sample are just some values to get concrete numbers out. They influence the final pruning frequency, because the rule is f < (s-e)N and N is the total number of lexemes seen The results are attached in a text (CSV) file, to preserve formatting. Based on them I'd like to propose top_stopwords and error_factor to be 100. With your dataset this would mean pruning every 3076 lexemes and discarding from the result all lexemes with < 173507 occurrences. With statistics target set to 1000 it would change to 16923 and 31546, respectively. > I can "fairly easy" try out patches or do other kind of testing. I'll try to come up with a patch for you to try and fiddle with these values before Monday. Cheers, Jan
Attachment
Jan Urbański <wulczer@wulczer.org> writes: > Now I tried to substitute some numbers there, and so assuming the > English language has ~1e6 words H(W) is around 6.5. Let's assume the > statistics target to be 100. > I chose s as 1/(st + 10)*H(W) because the top 10 English words will most > probably be stopwords, so we will never see them in the input. > Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 There is definitely something wrong with your math there. It's not possible for the 100'th most common word to have a frequency as high as 0.06 --- the ones above it presumably have larger frequencies, which makes the total quite a lot more than 1.0. For the purposes here, I think it's probably unnecessary to use the more complex statements of Zipf's law. The interesting property is the rule "the k'th most common element occurs 1/k as often as the most common one". So if you suppose the most common lexeme has frequency 0.1, the 100'th most common should have frequency around 0.0001. That's pretty crude of course but it seems like the right ballpark. regards, tom lane
Jan Urbański <wulczer@wulczer.org> writes: > Hm, I am now thinking that maybe this theory is flawed, because tsvecors > contain only *unique* words, and Zipf's law is talking about words in > documents in general. Normally a word like "the" would appear lots of > times in a document, but (even ignoring the fact that it's a stopword > and so won't appear at all) in a tsvector it will be present only once. > This may or may not be a problem, not sure if such "squashing" of > occurences as tsvectors do skewes the distribution away from Zipfian or not. Well, it's still going to approach Zipfian distribution over a large number of documents. In any case we are not really depending on Zipf's law heavily with this approach. The worst-case result if it's wrong is that we end up with an MCE list shorter than our original target. I suggest we could try this and see if we notice that happening a lot. regards, tom lane
On 29/05/10 17:09, Tom Lane wrote: > Jan Urbański <wulczer@wulczer.org> writes: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. > >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. > >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 > > There is definitely something wrong with your math there. It's not > possible for the 100'th most common word to have a frequency as high > as 0.06 --- the ones above it presumably have larger frequencies, > which makes the total quite a lot more than 1.0. Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 With regards to my other mail this means that top_stopwords = 10 and error_factor = 10 would mean bucket_width = 7150 and final prune value of 6787. Jan
Jan Urbański <wulczer@wulczer.org> writes: > On 29/05/10 17:09, Tom Lane wrote: >> There is definitely something wrong with your math there. It's not >> possible for the 100'th most common word to have a frequency as high >> as 0.06 --- the ones above it presumably have larger frequencies, >> which makes the total quite a lot more than 1.0. > Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be > 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 Um, apparently I can't do simple arithmetic first thing in the morning either, cause I got my number wrong too ;-) After a bit more research: if you use the basic form of Zipf's law with a 1/k distribution, the first frequency has to be about 0.07 to make the total come out to 1.0 for a reasonable number of words. So we could use s = 0.07 / K when we wanted a final list of K words. Some people (including the LC paper) prefer a higher exponent, ie 1/k^S with S around 1.25. That makes the F1 value around 0.22 which seems awfully high for the type of data we're working with, so I think the 1/k rule is probably what we want here. regards, tom lane
On 29/05/10 17:34, Tom Lane wrote: > Jan Urbański <wulczer@wulczer.org> writes: >> On 29/05/10 17:09, Tom Lane wrote: >>> There is definitely something wrong with your math there. It's not >>> possible for the 100'th most common word to have a frequency as high >>> as 0.06 --- the ones above it presumably have larger frequencies, >>> which makes the total quite a lot more than 1.0. > >> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be >> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 > > Um, apparently I can't do simple arithmetic first thing in the morning > either, cause I got my number wrong too ;-) > > After a bit more research: if you use the basic form of Zipf's law > with a 1/k distribution, the first frequency has to be about 0.07 > to make the total come out to 1.0 for a reasonable number of words. > So we could use s = 0.07 / K when we wanted a final list of K words. > Some people (including the LC paper) prefer a higher exponent, ie > 1/k^S with S around 1.25. That makes the F1 value around 0.22 which > seems awfully high for the type of data we're working with, so I think > the 1/k rule is probably what we want here. OK, I think we're getting somewhere :o) I took the formula from Wikipedia's page on Zipf's law, assuming an exponent of 1: rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is the number of words in English Then I took the nth harmonic number expansion from the page on harmonic numbers: H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 + O(n^-6) Assuming 1 million words in English and the big-O term in the harmonic expansion to be 1, we get H(1e6) = 14.3927, which would make the frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say 0.07). Which brings me to the same result as yours, which in turn reassures me a lot ;) My previous result was wrong because I used the wrong logarithm base, go figure. So with this, for statistics target of 100 we would predict the frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in the input the bucket width and the final pruning value depend only on the epsilon that we choose for the LC algorithm. So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 = 1428 lexemes and would not discard anything from the result. If we want e to be s/2 we'd prune every 2857 lexemes and discard lexemes with counts < 1887. For s/3, s/4 etc the numbers look like this: s/1 1428 0 s/2 2857 1887 s/3 4285 2516 s/4 5714 2831 s/5 7142 3019 s/6 8571 3145 s/7 10000 3235 s/8 11428 3302 s/9 12857 3355 s/2 or s/3 look reasonable. So, should I just write a patch that sets the bucket width and pruning count using 0.07 as the assumed frequency of the most common word and epsilon equal to s/2 or s/3? Cheers, Jan
Jan Urbański <wulczer@wulczer.org> writes: > [ e of ] s/2 or s/3 look reasonable. The examples in the LC paper seem to all use e = s/10. Note the stated assumption e << s. > So, should I just write a patch that sets the bucket width and pruning > count using 0.07 as the assumed frequency of the most common word and > epsilon equal to s/2 or s/3? I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for a first cut to experiment with. regards, tom lane
On 2010-05-29 15:56, Jan Urbański wrote: > On 29/05/10 12:34, Jesper Krogh wrote: > >> On 2010-05-28 23:47, Jan Urbański wrote: >> >>> On 28/05/10 22:22, Tom Lane wrote: >>> Now I tried to substitute some numbers there, and so assuming the >>> English language has ~1e6 words H(W) is around 6.5. Let's assume the >>> statistics target to be 100. >>> >>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >>> probably be stopwords, so we will never see them in the input. >>> >>> >> I think you should skip the assumption about stop-words, users may >> use something where they are needed in the index or have a language >> than the typical. (and they dont seem to influcence the math that much). >> > Turns out it has nearly linear influence on the bucket width and the > frequency necessary to survive the final pruning. I put some data in a > spreadsheet, results below. > > How about setting it to "some default" in the first analyze round, but setting it to the count of MCE's with a frequency of 1 in the subsequent analyze rounds? >> Isn't it the same "type" of logic that is used for collecting statistics >> for "array-types", say integer-arrays and text arrays? >> > AFAIK statistics for everything other than tsvectors are built based on > the values of whole rows. ts_typanalyze is the only typanalyze function > that takes the trouble of looping over the actual contents of each cell, > all the others just compare whole arrays (which means that for a text[] > field you will probably a quite useless MCV entry). > In another area, I was thinking about modelling a complete tree structure where I would like to extract complete sub-btranches as int[] of the node-ids in the set and then indexing them using gin. That seems like a "really bad idea" based on the above information. Wouldn't it make sense to treat array types like the tsvectors? > The results are attached in a text (CSV) file, to preserve formatting. > Based on them I'd like to propose top_stopwords and error_factor to be 100. > I know it is not percieved the correct way to do things, but I would really like to keep the "stop words" in the dataset and have something that is robust to that. There are 2 issues for that wish, one is that the application becomes more general. I really cannot stop my users from searching for stop-words and they would expect the "full set" and not the "empty set" as we get now. The list of stop words is by no means an finite and would very much depend on the input data set. I would try to add the stop-words to the dictionary, so they still work, but doesn't occupy that much space in the actual index. That seems to solve the same task but with fewer issues for the users and a more generalized code around it. >> I can "fairly easy" try out patches or do other kind of testing. >> > I'll try to come up with a patch for you to try and fiddle with these > values before Monday. > Excellent. testdb=# explain select id from testdb.reference where document_tsvector @@ plainto_tsquery('where') order by id limit 50; NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored QUERY PLAN --------------------------------------------------------------------------------------------- Limit (cost=41.02..41.03 rows=1width=4) -> Sort (cost=41.02..41.03 rows=1 width=4) Sort Key: id -> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1 width=4) Recheck Cond: (document_tsvector @@ plainto_tsquery('where'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..34.50 rows=1 width=0) Index Cond: (document_tsvector @@ plainto_tsquery('where'::text)) (7 rows) testdb=# select id from testdb.reference where document_tsvector @@ plainto_tsquery('where') order by id limit 50; NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored id ---- (0 rows) testdb=# I would indeed have expected the first 50 rows ordered by id.. trivial to extract. -- Jesper
Jesper Krogh <jesper@krogh.cc> writes: > On 2010-05-29 15:56, Jan Urbański wrote: >> AFAIK statistics for everything other than tsvectors are built based on >> the values of whole rows. > Wouldn't it make sense to treat array types like the tsvectors? Yeah, I have a personal TODO item to look into that in the future. >> The results are attached in a text (CSV) file, to preserve formatting. >> Based on them I'd like to propose top_stopwords and error_factor to be 100. > I know it is not percieved the correct way to do things, but I would > really like to keep the "stop words" in the dataset and have > something that is robust to that. Any stop words would already have been eliminated in the transformation to tsvector (or not, if none were configured in the dictionary setup). We should not assume that there are any in what ts_typanalyze is seeing. I think the only relevance of stopwords to the current problem is that *if* stopwords have been removed, we would see a Zipfian distribution with the first few entries removed, and I'm not sure if it's still really Zipfian afterwards. However, we only need the assumption of Zipfianness to compute a target frequency cutoff, so it's not like things will be completely broken if the distribution isn't quite Zipfian. regards, tom lane
> Jesper Krogh <jesper@krogh.cc> writes: > > On 2010-05-29 15:56, Jan Urbański wrote: > > > AFAIK statistics for everything other than tsvectors are built based > > > on the values of whole rows. > > > Wouldn't it make sense to treat array types like the tsvectors? > > Yeah, I have a personal TODO item to look into that in the future. There were plans to generalise the functions in ts_typanalyze and use LC for array types as well. If one day I'd find myselfwith a lot of free time I'd take a stab at that. > > > The results are attached in a text (CSV) file, to preserve > > > formatting. Based on them I'd like to propose top_stopwords and > > > error_factor to be 100. > > > I know it is not percieved the correct way to do things, but I would > > really like to keep the "stop words" in the dataset and have > > something that is robust to that. > > Any stop words would already have been eliminated in the transformation > to tsvector (or not, if none were configured in the dictionary setup). > We should not assume that there are any in what ts_typanalyze is seeing. Yes, and as a side note, if you want to be indexing stopwords, just don't pass a stopword file when creating the text searchdictionary (or pass a custom one). > > I think the only relevance of stopwords to the current problem is that > *if* stopwords have been removed, we would see a Zipfian distribution > with the first few entries removed, and I'm not sure if it's still > really Zipfian afterwards. However, we only need the assumption of > Zipfianness to compute a target frequency cutoff, so it's not like > things will be completely broken if the distribution isn't quite > Zipfian. That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much. Jan
Jan Urbański <wulczer@wulczer.org> writes: >> I think the only relevance of stopwords to the current problem is that >> *if* stopwords have been removed, we would see a Zipfian distribution >> with the first few entries removed, and I'm not sure if it's still >> really Zipfian afterwards. > That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much. Oh, now I get the point of that. Yeah, it is probably a good idea. If the input doesn't have stopwords removed, the worst that will happen is we'll collect stats for an extra 10 or so lexemes, which will then get thrown away when they don't fit into the MCE list. +1. regards, tom lane
On 30/05/10 09:08, Jesper Krogh wrote: > On 2010-05-29 15:56, Jan Urbański wrote: >> On 29/05/10 12:34, Jesper Krogh wrote: >>> I can "fairly easy" try out patches or do other kind of testing. >>> >> I'll try to come up with a patch for you to try and fiddle with these >> values before Monday. Here's a patch against recent git, but should apply to 8.4 sources as well. It would be interesting to measure the memory and time needed to analyse the table after applying it, because we will be now using a lot bigger bucket size and I haven't done any performance impact testing on it. I updated the initial comment block in compute_tsvector_stats, but the prose could probably be improved. > testdb=# explain select id from testdb.reference where document_tsvector > @@ plainto_tsquery('where') order by id limit 50; > NOTICE: text-search query contains only stop words or doesn't contain > lexemes, ignored That's orthogonal to the issue with the statistics collection, you just need to modify your stopwords list (for instance make it empty). Cheers, Jan
Attachment
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer@wulczer.org> writes: > Here's a patch against recent git, but should apply to 8.4 sources as > well. It would be interesting to measure the memory and time needed to > analyse the table after applying it, because we will be now using a lot > bigger bucket size and I haven't done any performance impact testing on > it. I did a little bit of testing using a dataset I had handy (a couple hundred thousand publication titles) and found that ANALYZE seems to be noticeably but far from intolerably slower --- it's almost the same speed at statistics targets up to 100, and even at the max setting of 10000 it's only maybe 25% slower. However I'm not sure if this result will scale to very large document sets, so more testing would be a good idea. I committed the attached revised version of the patch. Revisions are mostly minor but I did make two substantive changes: * The patch changed the target number of mcelems from 10 * statistics_target to just statistics_target. I reverted that since I don't think it was intended; at least we hadn't discussed it. * I modified the final processing to avoid one qsort step if there are fewer than num_mcelems hashtable entries that pass the cutoff frequency filter, and in any case to sort only those entries that pass it rather than all of them. With the significantly larger number of hashtable entries that will now be used, it seemed like a good thing to try to cut the qsort overhead. regards, tom lane Index: ts_typanalyze.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tsearch/ts_typanalyze.c,v retrieving revision 1.8 diff -c -r1.8 ts_typanalyze.c *** ts_typanalyze.c 2 Jan 2010 16:57:53 -0000 1.8 --- ts_typanalyze.c 30 May 2010 21:42:30 -0000 *************** *** 92,112 **** * http://www.vldb.org/conf/2002/S10P03.pdf * * The Lossy Counting (aka LC) algorithm goes like this: ! * Let D be a set of triples (e, f, d), where e is an element value, f is ! * that element's frequency (occurrence count) and d is the maximum error in ! * f. We start with D empty and process the elements in batches of size ! * w. (The batch size is also known as "bucket size".) Let the current batch ! * number be b_current, starting with 1. For each element e we either ! * increment its f count, if it's already in D, or insert a new triple into D ! * with values (e, 1, b_current - 1). After processing each batch we prune D, ! * by removing from it all elements with f + d <= b_current. Finally, we ! * gather elements with largest f. The LC paper proves error bounds on f ! * dependent on the batch size w, and shows that the required table size ! * is no more than a few times w. * ! * We use a hashtable for the D structure and a bucket width of ! * statistics_target * 10, where 10 is an arbitrarily chosen constant, ! * meant to approximate the number of lexemes in a single tsvector. */ static void compute_tsvector_stats(VacAttrStats *stats, --- 92,140 ---- * http://www.vldb.org/conf/2002/S10P03.pdf * * The Lossy Counting (aka LC) algorithm goes like this: ! * Let s be the threshold frequency for an item (the minimum frequency we ! * are interested in) and epsilon the error margin for the frequency. Let D ! * be a set of triples (e, f, delta), where e is an element value, f is that ! * element's frequency (actually, its current occurrence count) and delta is ! * the maximum error in f. We start with D empty and process the elements in ! * batches of size w. (The batch size is also known as "bucket size" and is ! * equal to 1/epsilon.) Let the current batch number be b_current, starting ! * with 1. For each element e we either increment its f count, if it's ! * already in D, or insert a new triple into D with values (e, 1, b_current ! * - 1). After processing each batch we prune D, by removing from it all ! * elements with f + delta <= b_current. After the algorithm finishes we ! * suppress all elements from D that do not satisfy f >= (s - epsilon) * N, ! * where N is the total number of elements in the input. We emit the ! * remaining elements with estimated frequency f/N. The LC paper proves ! * that this algorithm finds all elements with true frequency at least s, ! * and that no frequency is overestimated or is underestimated by more than ! * epsilon. Furthermore, given reasonable assumptions about the input ! * distribution, the required table size is no more than about 7 times w. * ! * We set s to be the estimated frequency of the K'th word in a natural ! * language's frequency table, where K is the target number of entries in ! * the MCELEM array plus an arbitrary constant, meant to reflect the fact ! * that the most common words in any language would usually be stopwords ! * so we will not actually see them in the input. We assume that the ! * distribution of word frequencies (including the stopwords) follows Zipf's ! * law with an exponent of 1. ! * ! * Assuming Zipfian distribution, the frequency of the K'th word is equal ! * to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of ! * words in the language. Putting W as one million, we get roughly 0.07/K. ! * Assuming top 10 words are stopwords gives s = 0.07/(K + 10). We set ! * epsilon = s/10, which gives bucket width w = (K + 10)/0.007 and ! * maximum expected hashtable size of about 1000 * (K + 10). ! * ! * Note: in the above discussion, s, epsilon, and f/N are in terms of a ! * lexeme's frequency as a fraction of all lexemes seen in the input. ! * However, what we actually want to store in the finished pg_statistic ! * entry is each lexeme's frequency as a fraction of all rows that it occurs ! * in. Assuming that the input tsvectors are correctly constructed, no ! * lexeme occurs more than once per tsvector, so the final count f is a ! * correct estimate of the number of input tsvectors it occurs in, and we ! * need only change the divisor from N to nonnull_cnt to get the number we ! * want. */ static void compute_tsvector_stats(VacAttrStats *stats, *************** *** 133,151 **** LexemeHashKey hash_key; TrackItem *item; ! /* We want statistics_target * 10 lexemes in the MCELEM array */ num_mcelem = stats->attr->attstattarget * 10; /* ! * We set bucket width equal to the target number of result lexemes. This ! * is probably about right but perhaps might need to be scaled up or down ! * a bit? */ ! bucket_width = num_mcelem; /* * Create the hashtable. It will be in local memory, so we don't need to ! * worry about initial size too much. Also we don't need to pay any * attention to locking and memory management. */ MemSet(&hash_ctl, 0, sizeof(hash_ctl)); --- 161,183 ---- LexemeHashKey hash_key; TrackItem *item; ! /* ! * We want statistics_target * 10 lexemes in the MCELEM array. This ! * multiplier is pretty arbitrary, but is meant to reflect the fact that ! * the number of individual lexeme values tracked in pg_statistic ought ! * to be more than the number of values for a simple scalar column. ! */ num_mcelem = stats->attr->attstattarget * 10; /* ! * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the ! * comment above. */ ! bucket_width = (num_mcelem + 10) * 1000 / 7; /* * Create the hashtable. It will be in local memory, so we don't need to ! * worry about overflowing the initial size. Also we don't need to pay any * attention to locking and memory management. */ MemSet(&hash_ctl, 0, sizeof(hash_ctl)); *************** *** 155,167 **** hash_ctl.match = lexeme_match; hash_ctl.hcxt = CurrentMemoryContext; lexemes_tab = hash_create("Analyzed lexemes table", ! bucket_width * 4, &hash_ctl, HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); /* Initialize counters. */ b_current = 1; ! lexeme_no = 1; /* Loop over the tsvectors. */ for (vector_no = 0; vector_no < samplerows; vector_no++) --- 187,199 ---- hash_ctl.match = lexeme_match; hash_ctl.hcxt = CurrentMemoryContext; lexemes_tab = hash_create("Analyzed lexemes table", ! bucket_width * 7, &hash_ctl, HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); /* Initialize counters. */ b_current = 1; ! lexeme_no = 0; /* Loop over the tsvectors. */ for (vector_no = 0; vector_no < samplerows; vector_no++) *************** *** 232,237 **** --- 264,272 ---- item->delta = b_current - 1; } + /* lexeme_no is the number of elements processed (ie N) */ + lexeme_no++; + /* We prune the D structure after processing each bucket */ if (lexeme_no % bucket_width == 0) { *************** *** 240,246 **** } /* Advance to the next WordEntry in the tsvector */ - lexeme_no++; curentryptr++; } } --- 275,280 ---- *************** *** 252,257 **** --- 286,292 ---- int i; TrackItem **sort_table; int track_len; + int cutoff_freq; int minfreq, maxfreq; *************** *** 264,297 **** stats->stadistinct = -1.0; /* ! * Determine the top-N lexemes by simply copying pointers from the ! * hashtable into an array and applying qsort() */ ! track_len = hash_get_num_entries(lexemes_tab); ! sort_table = (TrackItem **) palloc(sizeof(TrackItem *) * track_len); hash_seq_init(&scan_status, lexemes_tab); ! i = 0; while ((item = (TrackItem *) hash_seq_search(&scan_status)) != NULL) { ! sort_table[i++] = item; } ! Assert(i == track_len); ! qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_frequencies_desc); ! /* Suppress any single-occurrence items */ ! while (track_len > 0) { ! if (sort_table[track_len - 1]->frequency > 1) ! break; ! track_len--; } ! ! /* Determine the number of most common lexemes to be stored */ ! if (num_mcelem > track_len) num_mcelem = track_len; /* Generate MCELEM slot entry */ --- 299,349 ---- stats->stadistinct = -1.0; /* ! * Construct an array of the interesting hashtable items, that is, ! * those meeting the cutoff frequency (s - epsilon)*N. Also identify ! * the minimum and maximum frequencies among these items. ! * ! * Since epsilon = s/10 and bucket_width = 1/epsilon, the cutoff ! * frequency is 9*N / bucket_width. */ ! cutoff_freq = 9 * lexeme_no / bucket_width; ! i = hash_get_num_entries(lexemes_tab); /* surely enough space */ ! sort_table = (TrackItem **) palloc(sizeof(TrackItem *) * i); hash_seq_init(&scan_status, lexemes_tab); ! track_len = 0; ! minfreq = lexeme_no; ! maxfreq = 0; while ((item = (TrackItem *) hash_seq_search(&scan_status)) != NULL) { ! if (item->frequency > cutoff_freq) ! { ! sort_table[track_len++] = item; ! minfreq = Min(minfreq, item->frequency); ! maxfreq = Max(maxfreq, item->frequency); ! } } ! Assert(track_len <= i); ! /* emit some statistics for debug purposes */ ! elog(DEBUG3, "tsvector_stats: target # mces = %d, bucket width = %d, " ! "# lexemes = %d, hashtable size = %d, usable entries = %d", ! num_mcelem, bucket_width, lexeme_no, i, track_len); ! /* ! * If we obtained more lexemes than we really want, get rid of ! * those with least frequencies. The easiest way is to qsort the ! * array into descending frequency order and truncate the array. ! */ ! if (num_mcelem < track_len) { ! qsort(sort_table, track_len, sizeof(TrackItem *), ! trackitem_compare_frequencies_desc); ! /* reset minfreq to the smallest frequency we're keeping */ ! minfreq = sort_table[num_mcelem - 1]->frequency; } ! else num_mcelem = track_len; /* Generate MCELEM slot entry */ *************** *** 301,310 **** Datum *mcelem_values; float4 *mcelem_freqs; - /* Grab the minimal and maximal frequencies that will get stored */ - minfreq = sort_table[num_mcelem - 1]->frequency; - maxfreq = sort_table[0]->frequency; - /* * We want to store statistics sorted on the lexeme value using * first length, then byte-for-byte comparison. The reason for --- 353,358 ---- *************** *** 334,339 **** --- 382,391 ---- mcelem_values = (Datum *) palloc(num_mcelem * sizeof(Datum)); mcelem_freqs = (float4 *) palloc((num_mcelem + 2) * sizeof(float4)); + /* + * See comments above about use of nonnull_cnt as the divisor + * for the final frequency estimates. + */ for (i = 0; i < num_mcelem; i++) { TrackItem *item = sort_table[i];
On 31/05/10 00:07, Tom Lane wrote: > Jan Urbański <wulczer@wulczer.org> writes: > I committed the attached revised version of the patch. Revisions are > mostly minor but I did make two substantive changes: > > * The patch changed the target number of mcelems from 10 * > statistics_target to just statistics_target. I reverted that since > I don't think it was intended; at least we hadn't discussed it. Yeah, that was accidental. > * I modified the final processing to avoid one qsort step if there are > fewer than num_mcelems hashtable entries that pass the cutoff frequency > filter, and in any case to sort only those entries that pass it rather > than all of them. With the significantly larger number of hashtable > entries that will now be used, it seemed like a good thing to try to > cut the qsort overhead. Make sense. Thanks, Jan
On 2010-05-30 20:02, Jan Urbański wrote: > Here's a patch against recent git, but should apply to 8.4 sources as > well. It would be interesting to measure the memory and time needed to > analyse the table after applying it, because we will be now using a lot > bigger bucket size and I haven't done any performance impact testing on > it. I updated the initial comment block in compute_tsvector_stats, but > the prose could probably be improved. > Just a small follow up. I tried out the patch (or actually a fresh git checkout) and it now gives very accurate results for both upper and lower end of the MCE-histogram with a lower cutoff that doesn't approach 2. Thanks alot. -- Jesper
Jesper Krogh <jesper@krogh.cc> writes: > Just a small follow up. I tried out the patch (or actually a fresh git > checkout) and it now gives very accurate results for both upper and > lower end of the MCE-histogram with a lower cutoff that doesn't > approach 2. Good. How much did the ANALYZE time change for your table? regards, tom lane
On 2010-05-31 20:38, Tom Lane wrote: > Jesper Krogh<jesper@krogh.cc> writes: > >> Just a small follow up. I tried out the patch (or actually a fresh git >> checkout) and it now gives very accurate results for both upper and >> lower end of the MCE-histogram with a lower cutoff that doesn't >> approach 2. >> > Good. How much did the ANALYZE time change for your table? > 1.3m documents. New code ( 3 runs): statistics target 1000 => 155s/124s/110s statictics target 100 => 86s/55s/61s Old code: statistics target 1000 => 158s/101s/99s statistics target 100 => 90s/29s/33s Somehow I think that the first run is the relevant one, its pretty much a "dead disk" test, and I wouldn't expect that random sampling of tuples would have any sane caching effect in a production system. But it looks like the algoritm is "a bit" slower. Thanks again.. Jesper -- Jesper