Home > mailing lists

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From	Jan Urbański
Subject	Re: tsvector pg_stats seems quite a bit off.
Date	May 25, 2010 20:35:52
Msg-id	4BFC5A67.1070703@wulczer.org Whole thread Raw
In response to	tsvector pg_stats seems quite a bit off. (Jesper Krogh <jesper@krogh.cc>)
Responses	Re: tsvector pg_stats seems quite a bit off. Re: tsvector pg_stats seems quite a bit off.
List	pgsql-hackers

Tree view

On 19/05/10 21:01, Jesper Krogh wrote:
> The document base is arount 350.000 documents and
> I have set the statistics target on the tsvector column
> to 1000 since the 100 seems way of.

So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track" means
keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the 10
factor is meant to approximate the average number of unique lexemes in a
tsvector. If your documents are very large, this might not be a good
approximation.

> # ANALYZE verbose reference (document_tsvector);
> INFO:  analyzing "reference"
> INFO:  "reference": scanned 14486 of 14486 pages, containing 350174 live
> rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
> ANALYZE
> 
> Ok, so analyze allmost examined all rows. Looking into
> "most_common_freqs" I find
> # select count(unnest) from (select unnest(most_common_freqs) from
> pg_stats where attname = 'document_tsvector') as foo;
>  count
> -------
>   2810
> (1 row)

So the size of the most_common_freqs and most_common_vals rows in
pg_statistics for tsvectors has an upper bound of <stats-target> * 10
(for the same reasons as mentioned before) and holds lexemes (not whole
tsvectors). What happens also is that lexemes that where seen only one
while going through the analysed set are discarded, so that's why you
can actually get less entries in these arrays, even if your document set
is big.

> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an
> estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

Yeah, this might meant that you could try cranking up the stats target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

> So far I have no idea if this is bad or good, so a couple of sample runs
> of stuff that
> is sitting outside the "most_common_vals" array:
> 
> [gathered statistics suck]

> So the "most_common_vals" seems to contain a lot of values that should
> never have been kept in favor
> of other values that are more common.

> In practice, just cranking the statistics estimate up high enough seems
> to solve the problem, but doesn't
> there seem to be something wrong in how the statistics are collected?

The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c

It would be interesting to know what's the average size of a tsvector in
your document set (ie. how many unique lexemes does a tsvector have on
average). In general, the tsvector typanalyze function is designed to
suck less than the constant factor that has been used previously, but it
only works really well on the most common lexemes (thus preventing most
gross misestimates). I'm not very surprised it misses the difference
between 1612/350174 and 4/350174 and I'm quite happy that is gets that
if you set the stats target really high :o)

There's always the possibility that there's some stupid bug there, but I
think you just set your expectations too high for the tsvector typanalze
function. If you could come up with a better way of doing tsvector
stats, that would be awesome - currently it's just doing its best to
prevent the most outrageous errors.

Cheers,
Jan

pgsql-hackers by date:

From: Sam Vilain
Date: 25 May 2010, 20:01:06
Subject: Re: [spf:guess] Re: ROLLBACK TO SAVEPOINT

From: "Joshua D. Drake"
Date: 25 May 2010, 20:45:28
Subject: Re: Synchronization levels in SR

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

Previous

Next