Home > mailing lists

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: tsvector pg_stats seems quite a bit off.
Date	May 29, 2010 12:12:52
Msg-id	19403.1275145960@sss.pgh.pa.us Whole thread Raw
In response to	Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>)
List	pgsql-hackers

Tree view

Jan Urbański <wulczer@wulczer.org> writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignoring the fact that it's a stopword
> and so won't appear at all) in a tsvector it will be present only once.
> This may or may not be a problem, not sure if such "squashing" of
> occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents.  In any case we are not really depending on Zipf's
law heavily with this approach.  The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.
        regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 29 May 2010, 12:09:25
Subject: Re: tsvector pg_stats seems quite a bit off.

From: Jan Urbański
Date: 29 May 2010, 12:16:33
Subject: Re: tsvector pg_stats seems quite a bit off.

Re: tsvector pg_stats seems quite a bit off. - Mailing list pgsql-hackers

Previous

Next