Thread: tsearch2 document and word limit

tsearch2 document and word limit

From
"David Beavan"
Date:
Hi

I have been toying with the implementation of tsearch2 to index some large
text documents. I have run into problems where I am up against limits:

no more than 255 occurrences of a particular word are indexed.
word positions greater than 16384 are added as position 16384 and end up as
one occurrence.

These are problematic because I need to rank based on number of word
occurrences, and these limits are preventing this.

Does anybody have any suggestions as to how this could be worked around, is
the limit due to gist? would openfts help (im guessing not)?

Failing that does anybody have experience of combining another text indexing
package with postgresql?

Dave



Re: tsearch2 document and word limit

From
Teodor Sigaev
Date:
Sorry, but no way except patching sources of tsearch2....

Tsearch2 (not GiST) has pointed limitations  to save storage size mainly and to
reduce rank calculation time. Our (oleg and me) expirience in search engines
shows, that full positions info for long document hasn't a big importance to
ranking.
Did you try normalize rank by length of document?

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html:
...
Both of these ranking functions take an integer normalization option that
specifies whether a document's length should impact its rank. This is often
desirable, since a hundred-word document with five instances of a search word is
probably more relevant than a thousand-word document with five instances. The
option can have the values:
     * 0 (the default) ignores document length.
     * 1 divides the rank by the logarithm of the length.
     * 2 divides the rank by the length itself.
...



David Beavan wrote:
> Hi
>
> I have been toying with the implementation of tsearch2 to index some
> large text documents. I have run into problems where I am up against
> limits:
>
> no more than 255 occurrences of a particular word are indexed.
> word positions greater than 16384 are added as position 16384 and end up
> as one occurrence.
>
> These are problematic because I need to rank based on number of word
> occurrences, and these limits are preventing this.
>
> Does anybody have any suggestions as to how this could be worked around,
> is the limit due to gist? would openfts help (im guessing not)?
>
> Failing that does anybody have experience of combining another text
> indexing package with postgresql?
>
> Dave
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2 document and word limit

From
"David Beavan"
Date:
Yep, I understand what you are saying.

I can live with a max of 255 occurrences, although it isnt desirable. My
main issue is the max position of the lexeme.

Take for example a document which is longer than 16384 words. Although words
beyond that point are seen, and enter the index they are marked as position
16384. As such if there are multiple occurrences beyond that point they are
seen as one entry, skewing the frequency.

I understand the decisions behind the limits, but would you consider
addressing this limit in future releases? I really dont think I have the
skills to make the necessary modifications my self.

Dave

>From: Teodor Sigaev <teodor@sigaev.ru>
>To: David Beavan <davidbeavan@hotmail.com>
>CC: pgsql-general@postgresql.org
>Subject: Re: [GENERAL] tsearch2 document and word limit
>Date: Thu, 27 Jan 2005 17:41:42 +0300
>
>Sorry, but no way except patching sources of tsearch2....
>
>Tsearch2 (not GiST) has pointed limitations  to save storage size mainly
>and to reduce rank calculation time. Our (oleg and me) expirience in search
>engines shows, that full positions info for long document hasn't a big
>importance to ranking.
>Did you try normalize rank by length of document?
>
>http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html:
>...
>Both of these ranking functions take an integer normalization option that
>specifies whether a document's length should impact its rank. This is often
>desirable, since a hundred-word document with five instances of a search
>word is probably more relevant than a thousand-word document with five
>instances. The option can have the values:
>     * 0 (the default) ignores document length.
>     * 1 divides the rank by the logarithm of the length.
>     * 2 divides the rank by the length itself.
>...
>
>
>
>David Beavan wrote:
>>Hi
>>
>>I have been toying with the implementation of tsearch2 to index some large
>>text documents. I have run into problems where I am up against limits:
>>
>>no more than 255 occurrences of a particular word are indexed.
>>word positions greater than 16384 are added as position 16384 and end up
>>as one occurrence.
>>
>>These are problematic because I need to rank based on number of word
>>occurrences, and these limits are preventing this.
>>
>>Does anybody have any suggestions as to how this could be worked around,
>>is the limit due to gist? would openfts help (im guessing not)?
>>
>>Failing that does anybody have experience of combining another text
>>indexing package with postgresql?
>>
>>Dave
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>--
>Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                    WWW:
>http://www.sigaev.ru/