Home > mailing lists

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: tsvector pg_stats seems quite a bit off.
Date	May 29, 2010 12:09:25
Msg-id	19353.1275145753@sss.pgh.pa.us Whole thread Raw
In response to	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
Responses	Re: tsvector pg_stats seems quite a bit off.
List	pgsql-hackers

Tree view

Jan Urbański <wulczer@wulczer.org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there.  It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law.  The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001.  That's pretty crude
of course but it seems like the right ballpark.
        regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 29 May 2010, 11:31:21
Subject: Re: pg_trgm

From: Tom Lane
Date: 29 May 2010, 12:12:52
Subject: Re: tsvector pg_stats seems quite a bit off.

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

Previous

Next