Re: [HACKERS] Index greater than 8k - Mailing list pgsql-general
From | Gregory S. Williamson |
---|---|
Subject | Re: [HACKERS] Index greater than 8k |
Date | |
Msg-id | 71E37EF6B7DCC1499CEA0316A256832802B3E7BA@loki.wc.globexplorer.net Whole thread Raw |
In response to | NOTICE: word is too long INSERT 0 3014 ("Joshua D. Drake" <jd@commandprompt.com>) |
Responses |
Re: [HACKERS] Index greater than 8k
|
List | pgsql-general |
I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scalewell for what you need etc., etc. But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go withtesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't followstemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit ofa pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions ofrows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset,partitioned into unequal grouings (about 90 total). HTH (but doubt it for reasons that undoubtedly be made clear ;-) Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: pgsql-general-owner@postgresql.org on behalf of Joshua D. Drake Sent: Tue 10/31/2006 7:46 PM To: Teodor Sigaev Cc: Darcy Buskermolen; PgSQL General; PostgreSQL-development Subject: Re: [HACKERS] [GENERAL] Index greater than 8k Teodor Sigaev wrote: >> The problem as I remember it is pg_tgrm not tsearch2 directly, I've >> sent a self contained test case directly to Teodor which shows the >> error. >> 'ERROR: index row requires 8792 bytes, maximum size is 8191' > Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > pg_trgm is designed to find similar words and use technique known as > trigrams. This will work good on small pieces of text such as words or > set expression. But all big texts (on the same language) will be similar > :(. So, I didn't take care about guarantee that index tuple's size > limitation. In principle, it's possible to modify pg_trgm to have such > guarantee, but index becomes lossy - all tuples gotten from index > should be checked by table's tuple evaluation. We are trying to get something faster than ~ '%foo%'; Which Tsearch2 does not give us :) Joshua D. Drake > > If you want to search similar documents I can recommend to have a look > to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's > pretty close to trigrams and metrics of similarity is the same, but uses > another signature calculations. And, there are some tips and trics: > removing HTML marking,removing punctuation, lowercasing text and so on - > it's interesting and complex task. -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ------------------------------------------------------- Click link below if it is SPAM gsw@globexplorer.com "https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=454815f5242304846743324&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1" !DSPAM:454815f5242304846743324! -------------------------------------------------------
pgsql-general by date: