Thread: full text indexing
Hi In my PostgreSQL database I have a lot of newspaper articles (size: 100mb now, growing beyond 1gb within few months). I wan't to use full text indexing so that users can search the articles with a keyword and have the results in less than one second. How do I accomplish that? Does PostgreSQL have this feature? Which 3rd party indexing tools are available that easily interoperate with PostgreSQL? Thanks, Poul L. Christiansen Dynamic Paper
-> In my PostgreSQL database I have a lot of newspaper articles (size: -> 100mb now, growing beyond 1gb within few months). -> I wan't to use full text indexing so that users can search the articles -> with a keyword and have the results in less than one second. -> -> How do I accomplish that? -> -> Does PostgreSQL have this feature? -> -> Which 3rd party indexing tools are available that easily interoperate -> with PostgreSQL? I wouldn't say "easy" -- you have to have a static image of your database on disk somewhere -- but 'glimpse' works quite well and is fast enough to do 1gb that quickly. http://glimpse.cs.arizona.edu/ I have a tool to let you do searches on a glimpse database from within AOLserver at http://www.idyll.org/~t/www-tools/. It's been working out quite well for me. cheers, --titus
> Hi > > In my PostgreSQL database I have a lot of newspaper articles (size: > 100mb now, growing beyond 1gb within few months). > I wan't to use full text indexing so that users can search the articles > with a keyword and have the results in less than one second. > > How do I accomplish that? It's complicated :-) > Does PostgreSQL have this feature? Nope. > Which 3rd party indexing tools are available that easily interoperate > with PostgreSQL? There is some code in contrib called FTI (Full Text Index - no doubt).. I re-wrote it for my uses but ended up not using because I was doing so many sorts and joins, it made a scan (even an index scan) of a table with 3 million rows in it very slow. However id you were just searching the keyword table itself it was very, very fast. The FTI trigger in the contrib breaks the words down to 2 letter bits (for substring searching) -- mine doesn't, it only indexes whole words without duplicates and looks at a list of words not to index (words like a, an, the, anything else you want -- I think it has the 300 most used English words in there already) It's drawback is speed, it does take a few seconds to index on INSERT (and UPDATE) -- that's if your text fields are 30ish k (very close to the max PG can store).. I would say you're going to run into the 32k limit pretty quick with newspaper articles -- I index resumes and I've run into it many times (32k of text really isn't all that much).. Anyway, I'll try and get that trigger together that I did and send it to the PG guys to see if it's worthy of being added to contrib -- I'll send you a copy in private if you'd like. Note: I think there have been more people to re-write that trigger, I haven't seen anything else though.. Good luck! <End of long-winded response> -Mitch
"Mitch Vincent" <mitch@venux.net> writes: > It's complicated :-) > Yup. :-) > I would say you're going to run into the 32k limit pretty quick with > newspaper articles -- I index resumes and I've run into it many times (32k > of text really isn't all that much).. Looking forward to TOAST... > > Anyway, I'll try and get that trigger together that I did and send it to the > PG guys to see if it's worthy of being added to contrib -- I'll send you a > copy in private if you'd like. Note: I think there have been more people to > re-write that trigger, I haven't seen anything else though.. > I would like to have a look on it. We have similar "simple" full text search with a keyword table and usage of stop words. But I would really like to try some thirdparty add-on. I used Verity full text search for awhile with Sybase. It let you do joins and such with proxy tables that were the interface to Verity full text index. Very nice from a usage perspective compared to what I'm doing now... regards, Gunnar
There was an article in the Nov '95 Dr. Dobb's Journal about a Canadian phone company that did a full text index on a gazillion phone records and retrieved them quickly enough to run an inquiry operation. It was an interesting departure in building indices in that it created a directory for a text instance, such as a phone number, word, street number, etc, then attached the record number/s where found. Couldn't retrieve the article (it's on a $99 CD) but found the program listings and have attached the file. I've never used it the idea was intriguing and thought I would someday put it to use. Good Luck Bill >===== Original Message From "Poul L. Christiansen" <poulc@cs.auc.dk> ===== >Hi > >In my PostgreSQL database I have a lot of newspaper articles (size: >100mb now, growing beyond 1gb within few months). >I wan't to use full text indexing so that users can search the articles >with a keyword and have the results in less than one second. > >How do I accomplish that? > >Does PostgreSQL have this feature? > >Which 3rd party indexing tools are available that easily interoperate >with PostgreSQL? > >Thanks, >Poul L. Christiansen >Dynamic Paper
Attachment
Interesting...I've used that very idea (with some variations) to implement the keyword search in the Genome channel. See the "Keyword Search" in http://compbio.ornl.gov/channel/ cheers, Morey Parang ORNL On Wed, Sep 27, 2000 at 08:33:31PM -0400, Bill Barnes wrote: > There was an article in the Nov '95 Dr. Dobb's Journal about a Canadian phone > company that did a full text index on a gazillion phone records and retrieved > them quickly enough to run an inquiry operation. > It was an interesting departure in building indices in that it created a > directory for a text instance, such as a phone number, word, street number, > etc, then attached the record number/s where found. > Couldn't retrieve the article (it's on a $99 CD) but found the program > listings and have attached the file. I've never used it the idea was > intriguing and thought I would someday put it to use. > > Good Luck > Bill > > >===== Original Message From "Poul L. Christiansen" <poulc@cs.auc.dk> ===== > >Hi > > > >In my PostgreSQL database I have a lot of newspaper articles (size: > >100mb now, growing beyond 1gb within few months). > >I wan't to use full text indexing so that users can search the articles > >with a keyword and have the results in less than one second. > > > >How do I accomplish that? > > > >Does PostgreSQL have this feature? > > > >Which 3rd party indexing tools are available that easily interoperate > >with PostgreSQL? > > > >Thanks, > >Poul L. Christiansen > >Dynamic Paper
Hi, After taking a look to many way from fast search into full text (flat field with regrep, UDMSearch, and other stuff like contrib FTI) i have developped my own indexer and fast search. It's still specific to my database but can be easily ported to any other database structure. It also use stop word The first load is very slow for big database but I have had incremental indexing and then following load are speeder. Here is the usage of the indexer : Usage: ./lincat-indexer.pl [-c -i file -h host -p port] -t table -f field -l language -u user -s file -d dbname -c => drop search tables and indexes, reset all indexing -i file => do incremental index with file -v => print a dot for each word processed -h host => database hostname [default: localhost] -p port => database service port [default: 5432] -t table => table to index -f field => field to index -l fr => language to use (can be fr, en, ...) -u user => postgres user who need read grant on search tables -s file => file with a list of all words not to be indexed -d dbname => name of the database -q => quiet mode, do not print anything else than error message I also have a search script that can be use to test the index and from you can cut and past some code to add to your programme. usage is ./lincat-search.pl "essai test". I don't remember exactly what it does with search (AND/OR) but this depend on what you want as result, so hack it. Of course it's Perl... Search is very fast ! And in any way you can easily hack it to match your needs. Let me know if your are interested I will post it to the list. Regards, Gilles DAROLD "Poul L. Christiansen" wrote: > Hi > > In my PostgreSQL database I have a lot of newspaper articles (size: > 100mb now, growing beyond 1gb within few months). > I wan't to use full text indexing so that users can search the articles > with a keyword and have the results in less than one second. > > How do I accomplish that? > > Does PostgreSQL have this feature? > > Which 3rd party indexing tools are available that easily interoperate > with PostgreSQL? > > Thanks, > Poul L. Christiansen > Dynamic Paper
On Thu, 28 Sep 2000, Gilles DAROLD wrote: > Hi, > > After taking a look to many way from fast search into full text (flat > field with regrep, UDMSearch, > and other stuff like contrib FTI) i have developped my own indexer and > fast search. > It's still specific to my database but can be easily ported to any other > database structure. > It also use stop word > > The first load is very slow for big database but I have had incremental > indexing and then > following load are speeder. > > Here is the usage of the indexer : > > Usage: ./lincat-indexer.pl [-c -i file -h host -p port] -t table -f field > -l language -u user -s file -d dbname > -c => drop search tables and indexes, reset all indexing > -i file => do incremental index with file > -v => print a dot for each word processed > -h host => database hostname [default: localhost] > -p port => database service port [default: 5432] > -t table => table to index > -f field => field to index > -l fr => language to use (can be fr, en, ...) > -u user => postgres user who need read grant on search tables > -s file => file with a list of all words not to be indexed > -d dbname => name of the database > -q => quiet mode, do not print anything else than error > message > > I also have a search script that can be use to test the index and from you > can cut and past > some code to add to your programme. > > usage is ./lincat-search.pl "essai test". > > I don't remember exactly what it does with search (AND/OR) but this depend > on what you want > as result, so hack it. > > Of course it's Perl... Search is very fast ! > And in any way you can easily hack it to match your needs. > > Let me know if your are interested I will post it to the list. I'm interested. I was starting to write a script that would store in another table the information of certain words with the location, so when I look, I do it on the table indexed and not with regex in a blob of TEXT. Anyway I would like to see what you have. -- "And I'm happy, because you make me feel good, about me." - Melvin Udall ----------------------------------------------------------------- Mart�n Marqu�s email: martin@math.unl.edu.ar Santa Fe - Argentina http://math.unl.edu.ar/~martin/ Administrador de sistemas en math.unl.edu.ar -----------------------------------------------------------------
> "Martin A. Marques" wrote: > > Anyway I would like to see what you have. Hi, Sorry, I was not able to reply the last week-end, here are the files. They must be reviewed to match your needs, because this is specific scripts for one of my programs. Especially to use incremental index where you must find a way to create an history file of new or updated tuples to index. I think there's much things to do to perform a speedest indexing process. If you have problem to know what it does let me know... Regards, Gilles DAROLD
Attachment
Mitch Vincent wrote: > > > Hi > > > > In my PostgreSQL database I have a lot of newspaper articles (size: > > 100mb now, growing beyond 1gb within few months). > > I wan't to use full text indexing so that users can search the articles > > with a keyword and have the results in less than one second. > > > > How do I accomplish that? > > It's complicated :-) > > > Does PostgreSQL have this feature? > > Nope. > > > Which 3rd party indexing tools are available that easily interoperate > > with PostgreSQL? > > There is some code in contrib called FTI (Full Text Index - no doubt).. I > re-wrote it for my uses but ended up not using because I was doing so many > sorts and joins, it made a scan (even an index scan) of a table with 3 > million rows in it very slow. However id you were just searching the keyword > table itself it was very, very fast. > > The FTI trigger in the contrib breaks the words down to 2 letter bits (for > substring searching) -- mine doesn't, it only indexes whole words without > duplicates and looks at a list of words not to index (words like a, an, the, > anything else you want -- I think it has the 300 most used English words in > there already) > > It's drawback is speed, it does take a few seconds to index on INSERT (and > UPDATE) -- that's if your text fields are 30ish k (very close to the max PG > can store).. > > I would say you're going to run into the 32k limit pretty quick with > newspaper articles -- I index resumes and I've run into it many times (32k > of text really isn't all that much).. I split the articles up. When pg 7.1 is out, I will merge them together again :) > > Anyway, I'll try and get that trigger together that I did and send it to the > PG guys to see if it's worthy of being added to contrib -- I'll send you a > copy in private if you'd like. Note: I think there have been more people to > re-write that trigger, I haven't seen anything else though.. > Where can i find the FTI code? I looked in ftp://ftp.postgresql.org/pub/contrib/ , but no luck :( Thanks for the reply, Poul L. Christiansen
Hi All, Someone from this list has been distributing kak.htm virus. Please scan your disk to ensure that all your files are free from this virus before continuing posting your messages. Thanks Regards