Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | Re: tsvector pg_stats seems quite a bit off. |
Date | |
Msg-id | 4BFC5A67.1070703@wulczer.org Whole thread Raw |
In response to | tsvector pg_stats seems quite a bit off. (Jesper Krogh <jesper@krogh.cc>) |
Responses |
Re: tsvector pg_stats seems quite a bit off.
Re: tsvector pg_stats seems quite a bit off. |
List | pgsql-hackers |
On 19/05/10 21:01, Jesper Krogh wrote: > The document base is arount 350.000 documents and > I have set the statistics target on the tsvector column > to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less "at any time track at most 10 * <target> lexemes simultaneously" where "track" means keeping them in memory while going through the tuples being analysed. Remember that the measure is in lexemes, not whole tsvectors and the 10 factor is meant to approximate the average number of unique lexemes in a tsvector. If your documents are very large, this might not be a good approximation. > # ANALYZE verbose reference (document_tsvector); > INFO: analyzing "reference" > INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live > rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows > ANALYZE > > Ok, so analyze allmost examined all rows. Looking into > "most_common_freqs" I find > # select count(unnest) from (select unnest(most_common_freqs) from > pg_stats where attname = 'document_tsvector') as foo; > count > ------- > 2810 > (1 row) So the size of the most_common_freqs and most_common_vals rows in pg_statistics for tsvectors has an upper bound of <stats-target> * 10 (for the same reasons as mentioned before) and holds lexemes (not whole tsvectors). What happens also is that lexemes that where seen only one while going through the analysed set are discarded, so that's why you can actually get less entries in these arrays, even if your document set is big. > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an > estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows Yeah, this might meant that you could try cranking up the stats target a lot, to make the set of simulatenously tracked lexemes larger (it will cost time and memory during analyse though). If the documents have completely different contents, what can happen is that almost all lexemes are only seen a few times and get removed during the pruning of the working set. I have seen similar behaviour while working on the typanalyze function for tsvectors. > So far I have no idea if this is bad or good, so a couple of sample runs > of stuff that > is sitting outside the "most_common_vals" array: > > [gathered statistics suck] > So the "most_common_vals" seems to contain a lot of values that should > never have been kept in favor > of other values that are more common. > In practice, just cranking the statistics estimate up high enough seems > to solve the problem, but doesn't > there seem to be something wrong in how the statistics are collected? The algorithm to determine most common vals does not do it accurately. That would require keeping all lexemes from the analysed tsvectors in memory, which would be impractical. If you want to learn more about the algorithm being used, try reading http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in ts_typanalyze.c It would be interesting to know what's the average size of a tsvector in your document set (ie. how many unique lexemes does a tsvector have on average). In general, the tsvector typanalyze function is designed to suck less than the constant factor that has been used previously, but it only works really well on the most common lexemes (thus preventing most gross misestimates). I'm not very surprised it misses the difference between 1612/350174 and 4/350174 and I'm quite happy that is gets that if you set the stats target really high :o) There's always the possibility that there's some stupid bug there, but I think you just set your expectations too high for the tsvector typanalze function. If you could come up with a better way of doing tsvector stats, that would be awesome - currently it's just doing its best to prevent the most outrageous errors. Cheers, Jan
pgsql-hackers by date: