Thread: full text indexing

full text indexing

From
"Poul L. Christiansen"
Date:
Hi

In my PostgreSQL database I have a lot of newspaper articles (size:
100mb now, growing beyond 1gb within few months).
I wan't to use full text indexing so that users can search the articles
with a keyword and have the results in less than one second.

How do I accomplish that?

Does PostgreSQL have this feature?

Which 3rd party indexing tools are available that easily interoperate
with PostgreSQL?

Thanks,
Poul L. Christiansen
Dynamic Paper

Re: full text indexing

From
Titus Brown
Date:
-> In my PostgreSQL database I have a lot of newspaper articles (size:
-> 100mb now, growing beyond 1gb within few months).
-> I wan't to use full text indexing so that users can search the articles
-> with a keyword and have the results in less than one second.
->
-> How do I accomplish that?
->
-> Does PostgreSQL have this feature?
->
-> Which 3rd party indexing tools are available that easily interoperate
-> with PostgreSQL?

I wouldn't say "easy" -- you have to have a static image of your database
on disk somewhere -- but 'glimpse' works quite well and is fast enough
to do 1gb that quickly.

http://glimpse.cs.arizona.edu/

I have a tool to let you do searches on a glimpse database from within
AOLserver at http://www.idyll.org/~t/www-tools/.  It's been working out
quite well for me.

cheers,
--titus

Re: full text indexing

From
"Mitch Vincent"
Date:
> Hi
>
> In my PostgreSQL database I have a lot of newspaper articles (size:
> 100mb now, growing beyond 1gb within few months).
> I wan't to use full text indexing so that users can search the articles
> with a keyword and have the results in less than one second.
>
> How do I accomplish that?

It's complicated :-)

> Does PostgreSQL have this feature?

Nope.

> Which 3rd party indexing tools are available that easily interoperate
> with PostgreSQL?

There is some code in contrib called FTI (Full Text Index - no doubt).. I
re-wrote it for my uses but ended up not using because I was doing so many
sorts and joins, it made a scan (even an index scan) of a table with 3
million rows in it very slow. However id you were just searching the keyword
table itself it was very, very fast.

The FTI trigger in the contrib breaks the words down to 2 letter bits (for
substring searching) -- mine doesn't, it only indexes whole words without
duplicates and looks at a list of words not to index (words like a, an, the,
anything else you want -- I think it has the 300 most used English words in
there already)

It's drawback is speed, it does take a few seconds to index on INSERT (and
UPDATE) -- that's if your text fields are 30ish k (very close to the max PG
can store)..

I would say you're going to run into the 32k limit pretty quick with
newspaper articles -- I index resumes and I've run into it many times (32k
of text really isn't all that much)..

Anyway, I'll try and get that trigger together that I did and send it to the
PG guys to see if it's worthy of being added to contrib -- I'll send you a
copy in private if you'd like. Note: I think there have been more people to
re-write that trigger, I haven't seen anything else though..

Good luck!

<End of long-winded response>

-Mitch



Re: full text indexing

From
Gunnar R|nning
Date:
"Mitch Vincent" <mitch@venux.net> writes:

> It's complicated :-)
>

Yup. :-)

> I would say you're going to run into the 32k limit pretty quick with
> newspaper articles -- I index resumes and I've run into it many times (32k
> of text really isn't all that much)..

Looking forward to TOAST...

>
> Anyway, I'll try and get that trigger together that I did and send it to the
> PG guys to see if it's worthy of being added to contrib -- I'll send you a
> copy in private if you'd like. Note: I think there have been more people to
> re-write that trigger, I haven't seen anything else though..
>

I would like to have a look on it. We have similar "simple" full text
search with a keyword table and usage of stop words. But I would really
like to try some thirdparty add-on. I used Verity full text search for
awhile with Sybase.  It let you do joins and such with proxy tables that
were the interface to Verity full text index. Very nice from a usage
perspective compared to what I'm doing now...

regards,

    Gunnar

RE: full text indexing

From
Bill Barnes
Date:
There was an article in the Nov '95 Dr. Dobb's Journal about a Canadian phone
company that did a full text index on a gazillion phone records and retrieved
them quickly enough to run an inquiry operation.
It was an interesting departure in building indices in that it created a
directory for a text instance, such as a phone number, word, street number,
etc, then attached the record number/s where found.
Couldn't retrieve the article (it's on a $99 CD) but found the program
listings and have attached the file.  I've never used it the idea was
intriguing and thought I would someday put it to use.

Good Luck
Bill

>===== Original Message From "Poul L. Christiansen" <poulc@cs.auc.dk> =====
>Hi
>
>In my PostgreSQL database I have a lot of newspaper articles (size:
>100mb now, growing beyond 1gb within few months).
>I wan't to use full text indexing so that users can search the articles
>with a keyword and have the results in less than one second.
>
>How do I accomplish that?
>
>Does PostgreSQL have this feature?
>
>Which 3rd party indexing tools are available that easily interoperate
>with PostgreSQL?
>
>Thanks,
>Poul L. Christiansen
>Dynamic Paper

Attachment

Re: full text indexing

From
mjp@ornl.gov
Date:
Interesting...I've used that very idea (with some variations)
to implement the keyword search in the Genome channel. See
the "Keyword Search" in http://compbio.ornl.gov/channel/

cheers,
Morey Parang
ORNL

On Wed, Sep 27, 2000 at 08:33:31PM -0400, Bill Barnes wrote:
> There was an article in the Nov '95 Dr. Dobb's Journal about a Canadian phone
> company that did a full text index on a gazillion phone records and retrieved
> them quickly enough to run an inquiry operation.
> It was an interesting departure in building indices in that it created a
> directory for a text instance, such as a phone number, word, street number,
> etc, then attached the record number/s where found.
> Couldn't retrieve the article (it's on a $99 CD) but found the program
> listings and have attached the file.  I've never used it the idea was
> intriguing and thought I would someday put it to use.
>
> Good Luck
> Bill
>
> >===== Original Message From "Poul L. Christiansen" <poulc@cs.auc.dk> =====
> >Hi
> >
> >In my PostgreSQL database I have a lot of newspaper articles (size:
> >100mb now, growing beyond 1gb within few months).
> >I wan't to use full text indexing so that users can search the articles
> >with a keyword and have the results in less than one second.
> >
> >How do I accomplish that?
> >
> >Does PostgreSQL have this feature?
> >
> >Which 3rd party indexing tools are available that easily interoperate
> >with PostgreSQL?
> >
> >Thanks,
> >Poul L. Christiansen
> >Dynamic Paper



Re: full text indexing

From
Gilles DAROLD
Date:
Hi,

After taking a look to many way from fast search into full text (flat
field with regrep, UDMSearch,
and other stuff like contrib  FTI) i have developped my own indexer and
fast search.
It's still specific to my database but can be easily ported to any other
database structure.
It also use stop word

The first load is very slow for big database but I have had incremental
indexing and then
following load are speeder.

Here is the usage of the indexer :

Usage: ./lincat-indexer.pl [-c -i file -h host -p port] -t table -f field
-l language -u user -s file -d dbname
        -c        => drop search tables and indexes, reset all indexing
        -i file   => do incremental index with file
        -v        => print a dot for each word processed
        -h host   => database hostname [default: localhost]
        -p port   => database service port [default: 5432]
        -t table  => table to index
        -f field  => field to index
        -l fr     => language to use (can be fr, en, ...)
        -u user   => postgres user who need read grant on search tables
        -s file   => file with a list of all words not to be indexed
        -d dbname => name of the database
        -q        => quiet mode, do not print anything else than error
message

I also have a search script that can be use to test the index and from you
can cut and past
some code to add to your programme.

usage is ./lincat-search.pl "essai test".

I don't remember exactly what it does with search (AND/OR) but this depend
on what you want
as result, so hack it.

Of course it's Perl... Search is very fast !
And in any way you can easily hack it to match your needs.

Let me know if your are interested I will post it to the list.

Regards,

Gilles DAROLD

"Poul L. Christiansen" wrote:

> Hi
>
> In my PostgreSQL database I have a lot of newspaper articles (size:
> 100mb now, growing beyond 1gb within few months).
> I wan't to use full text indexing so that users can search the articles
> with a keyword and have the results in less than one second.
>
> How do I accomplish that?
>
> Does PostgreSQL have this feature?
>
> Which 3rd party indexing tools are available that easily interoperate
> with PostgreSQL?
>
> Thanks,
> Poul L. Christiansen
> Dynamic Paper


Re: full text indexing

From
"Martin A. Marques"
Date:
On Thu, 28 Sep 2000, Gilles DAROLD wrote:
> Hi,
>
> After taking a look to many way from fast search into full text (flat
> field with regrep, UDMSearch,
> and other stuff like contrib  FTI) i have developped my own indexer and
> fast search.
> It's still specific to my database but can be easily ported to any other
> database structure.
> It also use stop word
>
> The first load is very slow for big database but I have had incremental
> indexing and then
> following load are speeder.
>
> Here is the usage of the indexer :
>
> Usage: ./lincat-indexer.pl [-c -i file -h host -p port] -t table -f field
> -l language -u user -s file -d dbname
>         -c        => drop search tables and indexes, reset all indexing
>         -i file   => do incremental index with file
>         -v        => print a dot for each word processed
>         -h host   => database hostname [default: localhost]
>         -p port   => database service port [default: 5432]
>         -t table  => table to index
>         -f field  => field to index
>         -l fr     => language to use (can be fr, en, ...)
>         -u user   => postgres user who need read grant on search tables
>         -s file   => file with a list of all words not to be indexed
>         -d dbname => name of the database
>         -q        => quiet mode, do not print anything else than error
> message
>
> I also have a search script that can be use to test the index and from you
> can cut and past
> some code to add to your programme.
>
> usage is ./lincat-search.pl "essai test".
>
> I don't remember exactly what it does with search (AND/OR) but this depend
> on what you want
> as result, so hack it.
>
> Of course it's Perl... Search is very fast !
> And in any way you can easily hack it to match your needs.
>
> Let me know if your are interested I will post it to the list.

I'm interested. I was starting to write a script that would store in another
table the information of certain words with the location, so when I look, I
do it on the table indexed and not with regex in a blob of TEXT.

Anyway I would like to see what you have.

--
"And I'm happy, because you make me feel good, about me." - Melvin Udall
-----------------------------------------------------------------
Mart�n Marqu�s            email:     martin@math.unl.edu.ar
Santa Fe - Argentina        http://math.unl.edu.ar/~martin/
Administrador de sistemas en math.unl.edu.ar
-----------------------------------------------------------------

Re: full text indexing

From
Gilles DAROLD
Date:
> "Martin A. Marques" wrote:
>
> Anyway I would like to see what you have.

Hi,

Sorry, I was not able to reply the last week-end, here are the files.

They must be reviewed to match your needs, because this is specific
scripts for one of my programs. Especially to use incremental index
where you must find a way to create an history file of new or updated
tuples to index.

I think there's much things to do to perform a speedest indexing process.

If you have problem to know what it does let me know...

Regards,

Gilles DAROLD

Attachment

Re: full text indexing

From
"Poul L. Christiansen"
Date:
Mitch Vincent wrote:
>
> > Hi
> >
> > In my PostgreSQL database I have a lot of newspaper articles (size:
> > 100mb now, growing beyond 1gb within few months).
> > I wan't to use full text indexing so that users can search the articles
> > with a keyword and have the results in less than one second.
> >
> > How do I accomplish that?
>
> It's complicated :-)
>
> > Does PostgreSQL have this feature?
>
> Nope.
>
> > Which 3rd party indexing tools are available that easily interoperate
> > with PostgreSQL?
>
> There is some code in contrib called FTI (Full Text Index - no doubt).. I
> re-wrote it for my uses but ended up not using because I was doing so many
> sorts and joins, it made a scan (even an index scan) of a table with 3
> million rows in it very slow. However id you were just searching the keyword
> table itself it was very, very fast.
>
> The FTI trigger in the contrib breaks the words down to 2 letter bits (for
> substring searching) -- mine doesn't, it only indexes whole words without
> duplicates and looks at a list of words not to index (words like a, an, the,
> anything else you want -- I think it has the 300 most used English words in
> there already)
>
> It's drawback is speed, it does take a few seconds to index on INSERT (and
> UPDATE) -- that's if your text fields are 30ish k (very close to the max PG
> can store)..
>
> I would say you're going to run into the 32k limit pretty quick with
> newspaper articles -- I index resumes and I've run into it many times (32k
> of text really isn't all that much)..

I split the articles up. When pg 7.1 is out, I will merge them together
again :)

>
> Anyway, I'll try and get that trigger together that I did and send it to the
> PG guys to see if it's worthy of being added to contrib -- I'll send you a
> copy in private if you'd like. Note: I think there have been more people to
> re-write that trigger, I haven't seen anything else though..
>

Where can i find the FTI code?

I looked in ftp://ftp.postgresql.org/pub/contrib/ , but no luck :(

Thanks for the reply,
Poul L. Christiansen

kak.htm virus

From
"CSIB"
Date:
Hi All,

Someone from this list has been distributing kak.htm virus. Please scan your
disk to ensure that all your files are free from this virus before
continuing posting your messages.


Thanks


Regards