Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: tsvector pg_stats seems quite a bit off.
Date
Msg-id 19353.1275145753@sss.pgh.pa.us
Whole thread Raw
In response to Re: tsvector pg_stats seems quite a bit off.  (Jan Urbański <wulczer@wulczer.org>)
Responses Re: tsvector pg_stats seems quite a bit off.
List pgsql-hackers
Jan Urbański <wulczer@wulczer.org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there.  It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law.  The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001.  That's pretty crude
of course but it seems like the right ballpark.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_trgm
Next
From: Tom Lane
Date:
Subject: Re: tsvector pg_stats seems quite a bit off.