Re: [HACKERS] Index greater than 8k - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: [HACKERS] Index greater than 8k
Date
Msg-id Pine.GSO.4.63.0611031151070.8413@ra.sai.msu.su
Whole thread Raw
In response to Re: [HACKERS] Index greater than 8k  ("Gregory S. Williamson" <gsw@globexplorer.com>)
Responses Re: [HACKERS] Index greater than 8k  ("Joshua D. Drake" <jd@commandprompt.com>)
List pgsql-general
Gregory,

All you described you do with fti is possible with tsearch2. Just need
some think, of course. If you don't need stemming, just don't use it,
if you need something like %txt%, just write simple dictionary, which
produce any substrings from input word.

Oleg
On Tue, 31 Oct 2006, Gregory S. Williamson wrote:

> I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not
scalewell for what you need etc., etc. 
>
> But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't
gowith tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't
followstemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit
ofa pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions
ofrows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this
dataset,partitioned into unequal grouings (about 90 total). 
>
> HTH (but doubt it for reasons that undoubtedly be made clear ;-)
>
> Greg Williamson
> DBA
> GlobeXplorer LLC
>
>
> -----Original Message-----
> From:    pgsql-general-owner@postgresql.org on behalf of Joshua D. Drake
> Sent:    Tue 10/31/2006 7:46 PM
> To:    Teodor Sigaev
> Cc:    Darcy Buskermolen; PgSQL General; PostgreSQL-development
> Subject:    Re: [HACKERS] [GENERAL] Index greater than 8k
>
> Teodor Sigaev wrote:
>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>>> sent a self contained test case directly to  Teodor  which shows the
>>> error.
>>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>> pg_trgm is designed to find similar words and use technique known as
>> trigrams. This will  work good on small pieces of text such as words or
>> set expression. But all big texts (on the same language) will be similar
>> :(. So, I didn't take care about guarantee that index tuple's size
>> limitation. In principle, it's possible to modify pg_trgm to have such
>> guarantee, but index becomes lossy - all tuples gotten  from index
>> should be checked by table's tuple evaluation.
>
> We are trying to get something faster than ~ '%foo%';
>
> Which Tsearch2 does not give us :)
>
> Joshua D. Drake
>
>
>
>>
>> If you want to search similar documents I can recommend to have a look
>> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
>> pretty close to trigrams and metrics of similarity is the same, but uses
>> another signature calculations. And, there are some tips and trics:
>> removing HTML marking,removing punctuation, lowercasing text and so on -
>> it's interesting and complex task.
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-general by date:

Previous
From: "surabhi.ahuja"
Date:
Subject: start up and shut down script
Next
From: Peter
Date:
Subject: Re: Weird double single quote issue