Home > mailing lists

Normalization in text search ranking - Mailing list pgsql-general

From	Tim van der Linden
Subject	Normalization in text search ranking
Date	May 4, 2014 04:27:10
Msg-id	20140504102643.76a0a82e655d767ee74c208d@shisaa.jp Whole thread Raw
List	pgsql-general

Tree view

Hi all

Another question regarding full text, this time about ranking.
The ts_ranking() and ts_ranking_cd() accept a normalization integer/bit mask.

In the documentation the different integers are somewhat laid out and it is said that some take into account the
documentlength (1 and 2) while others take into account the number of unique words (8 and 16). 

To illustrate my following questions, take this tsvector:

'ate':9 'cat':3 'fat':2,11

Now, I was wondering how document length and unique words are calculated (from a high level perspective).

I am correct in saying that, when counting the document length, the number of total pointers is summed up, meaning that
inthe above tsvector we have 4 words (resulting in an integer of 4 to use to divide the float). 

And when counting unique words, the calculation for the above tsvector would be 3, only counting the actual lexemes in
thereand not the amount of pointers? 

Also, final question, if you use integer 8 or 16 to influence the ranking float calculated, you would actual "punish"
documentswho are more unique? Meaning that this is just another way of giving shorter documents precedence over longer
ones?

Thanks again!

Cheers,
Tim

pgsql-general by date:

From: Andreas Heiduk
Date: 04 May 2014, 04:22:47
Subject: Re: Manipulating jsonb

From: Andreas Heiduk
Date: 04 May 2014, 05:28:41
Subject: Re: Manipulating jsonb

Normalization in text search ranking - Mailing list pgsql-general

Previous

Next