Thread: Question regarding contrib/fulltextindexing

Question regarding contrib/fulltextindexing

From
"Derek Barrett"
Date:
Hi Everybody,

I've been using Postgres for a few months now and have been very impressed with it so far. The user community is also
veryhelpful! 

I was wondering if the fulltextindex module would be useful in my situation.

On the document page,

http://techdocs.postgresql.org/techdocs/fulltextindexing.php

the explanation is that the module works well with like queries like this:

SELECT *
FROM table
WHERE
f1.string ~ '^perth
AND f2.string ~ '^stralia'

In my situation, I need to match exact words, so I've used regular expressions to search on a varchar(10000) field:

SELECT *
FROM table
WHERE description ~* ('[^a-zA-Z0-9]($keyword[$x])[^a-zA-Z0-9]');

Would this module still be useful in my situation?


Thanks,


Derek
--
_______________________________________________
Get your free email from http://www.graffiti.net

Powered by Outblaze

Re: Question regarding contrib/fulltextindexing

From
Andrew McMillan
Date:
On Sun, 2002-07-14 at 08:53, Derek Barrett wrote:
>
> In my situation, I need to match exact words, so I've used regular expressions to search on a varchar(10000) field:
>
> SELECT *
> FROM table
> WHERE description ~* ('[^a-zA-Z0-9]($keyword[$x])[^a-zA-Z0-9]');
>
> Would this module still be useful in my situation?

I was doing word search with a modified version of the fulltextindex
code, but in the end I found it easier to write a perl program to do
build the index table - the trigger approach was more work to manage
than it seemed it should be.

Regards,
                    Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
       Are you enrolled at http://schoolreunions.co.nz/ yet?


Re: Question regarding contrib/fulltextindexing

From
"Derek Barrett"
Date:
Thanks Andrew, that solution seems to be easier in my mind.


Okay, so I create a lookup table for the description field.

CREATE TABLE lookup_description (

pk            integer
id            integer
search_word   varchar(50)

)

(pk is the primary key, id being a foreign key)


(By the way, any recommendation in sizing the search_word field? Are there strings that are large enough that aren't
worthindexing?)  

My user INSERTS the following string:

"The quick brown fox jumped over the moon and another fox."

In my code, I use a for loop, and take the text string and put it into an array, and fill up the lookup_description
table.Of course, I will create a noisewords filter to remove words like the, a, an from this list. What about duplicate
words?Should those be filtered out as well? In this example, fox is duplicated. I would assume that leaving in the
duplicates,might be useful later if I decide to implement a relevency-type of search engine. (Rank the results based on
howmany times, "fox" is found). 

TABLE lookup_description

pk id search_word
-- -- -----------
1  1  quick
2  1  brown
3  1  fox
4  1  jumped
5  1  over
6  1  moon
7  1  another
8  1  fox

Then I can create an index on the search_word column. Later when I do my SELECT query, I will join this lookup query to
themain query. 

Is that the idea?


Derek

----- Original Message -----
From: Andrew McMillan <andrew@catalyst.net.nz>
Date: 14 Jul 2002 12:16:59 +1200
To: Derek Barrett <derekbarrett@graffiti.net>
Subject: Re: [NOVICE] Question regarding contrib/fulltextindexing


> On Sun, 2002-07-14 at 08:53, Derek Barrett wrote:
> >
> > In my situation, I need to match exact words, so I've used regular expressions to search on a varchar(10000) field:
> >
> > SELECT *
> > FROM table
> > WHERE description ~* ('[^a-zA-Z0-9]($keyword[$x])[^a-zA-Z0-9]');
> >
> > Would this module still be useful in my situation?
>
> I was doing word search with a modified version of the fulltextindex
> code, but in the end I found it easier to write a perl program to do
> build the index table - the trigger approach was more work to manage
> than it seemed it should be.
>
> Regards,
>                     Andrew.
> --
> --------------------------------------------------------------------
> Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
> WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
> DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
>        Are you enrolled at http://schoolreunions.co.nz/ yet?
>
>

--
_______________________________________________
Get your free email from http://www.graffiti.net

Powered by Outblaze

Re: Question regarding contrib/fulltextindexing

From
Andrew McMillan
Date:
On Mon, 2002-07-15 at 10:41, Derek Barrett wrote:
> Thanks Andrew, that solution seems to be easier in my mind.
>
>
> Okay, so I create a lookup table for the description field.
>
> CREATE TABLE lookup_description (
>
> pk            integer
> id            integer
> search_word   varchar(50)
>
> )
>
> (pk is the primary key, id being a foreign key)
>
>
> (By the way, any recommendation in sizing the search_word field? Are there strings that are large enough that aren't
worthindexing?)  
>
> My user INSERTS the following string:
>
> "The quick brown fox jumped over the moon and another fox."
>
> In my code, I use a for loop, and take the text string and put it into an array, and fill up the lookup_description
table.Of course, I will create a noisewords filter to remove words like the, a, an from this list. What about duplicate
words?Should those be filtered out as well? In this example, fox is duplicated. I would assume that leaving in the
duplicates,might be useful later if I decide to implement a relevency-type of search engine. (Rank the results based on
howmany times, "fox" is found). 
>
> TABLE lookup_description
>
> pk id search_word
> -- -- -----------
> 1  1  quick
> 2  1  brown
> 3  1  fox
> 4  1  jumped
> 5  1  over
> 6  1  moon
> 7  1  another
> 8  1  fox
>
> Then I can create an index on the search_word column. Later when I do my SELECT query, I will join this lookup query
tothe main query. 
>
> Is that the idea?

Yes, that's the basic idea.  A few points though:
1) the index table doesn't need a special 'primary key' field - what
value does that add?

2) the search_word can be defined as 'text', so you don't need to
specify the maximum length.

3) Realistically there is little advantage searching for words longer
than (say) 15 characters, but they also appear very infrequently.  You
have to ensure that your search functionality does the same sort of
transformations to searched for words as it did when inserting words
into the index table.  This means that if your insert truncates, or
doesn't insert some words in a stop list (if, of, a ...) then your
search functionality needs to do the same.

4) I usually have a stop_list of words that don't get indexed too:
CREATE TABLE stop_list (
  stop_word TEXT PRIMARY KEY
);
I usually load this into a hash in my perl script that is doing the
loading, but it can be more complex to deal with this appropriately in
the actual searching code - you need a first pass through the search
list to remove stop_list words, and a second pass to build the search
query.

5) you need to create an index on search_word:
CREATE INDEX lookup_description_srch
        ON lookup_description ( search_word );


Hope this helps,
                    Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
       Are you enrolled at http://schoolreunions.co.nz/ yet?


Re: Question regarding contrib/fulltextindexing

From
"Derek Barrett"
Date:
Excellent!

This solution will work out perfectly.

The good thing is that I already have the code that does the
noise word filter on the SELECT side, so just basically some cut
and paste there to add it to the INSERT side.

Thanks for taking the time out for the tips and for clearing that up Andrew. Golden.

Best,

Derek


----- Original Message -----
From: Andrew McMillan <andrew@catalyst.net.nz>
Date: 15 Jul 2002 11:39:12 +1200
To: Derek Barrett <derekbarrett@graffiti.net>
Subject: Re: [NOVICE] Question regarding contrib/fulltextindexing


> On Mon, 2002-07-15 at 10:41, Derek Barrett wrote:
> > Thanks Andrew, that solution seems to be easier in my mind.
> >
> >
> > Okay, so I create a lookup table for the description field.
> >
> > CREATE TABLE lookup_description (
> >
> > pk            integer
> > id            integer
> > search_word   varchar(50)
> >
> > )
> >
> > (pk is the primary key, id being a foreign key)
> >
> >
> > (By the way, any recommendation in sizing the search_word field? Are there strings that are large enough that
aren'tworth indexing?)  
> >
> > My user INSERTS the following string:
> >
> > "The quick brown fox jumped over the moon and another fox."
> >
> > In my code, I use a for loop, and take the text string and put it into an array, and fill up the lookup_description
table.Of course, I will create a noisewords filter to remove words like the, a, an from this list. What about duplicate
words?Should those be filtered out as well? In this example, fox is duplicated. I would assume that leaving in the
duplicates,might be useful later if I decide to implement a relevency-type of search engine. (Rank the results based on
howmany times, "fox" is found). 
> >
> > TABLE lookup_description
> >
> > pk id search_word
> > -- -- -----------
> > 1  1  quick
> > 2  1  brown
> > 3  1  fox
> > 4  1  jumped
> > 5  1  over
> > 6  1  moon
> > 7  1  another
> > 8  1  fox
> >
> > Then I can create an index on the search_word column. Later when I do my SELECT query, I will join this lookup
queryto the main query. 
> >
> > Is that the idea?
>
> Yes, that's the basic idea.  A few points though:
> 1) the index table doesn't need a special 'primary key' field - what
> value does that add?
>
> 2) the search_word can be defined as 'text', so you don't need to
> specify the maximum length.
>
> 3) Realistically there is little advantage searching for words longer
> than (say) 15 characters, but they also appear very infrequently.  You
> have to ensure that your search functionality does the same sort of
> transformations to searched for words as it did when inserting words
> into the index table.  This means that if your insert truncates, or
> doesn't insert some words in a stop list (if, of, a ...) then your
> search functionality needs to do the same.
>
> 4) I usually have a stop_list of words that don't get indexed too:
> CREATE TABLE stop_list (
>   stop_word TEXT PRIMARY KEY
> );
> I usually load this into a hash in my perl script that is doing the
> loading, but it can be more complex to deal with this appropriately in
> the actual searching code - you need a first pass through the search
> list to remove stop_list words, and a second pass to build the search
> query.
>
> 5) you need to create an index on search_word:
> CREATE INDEX lookup_description_srch
>         ON lookup_description ( search_word );
>
>
> Hope this helps,
>                     Andrew.
> --
> --------------------------------------------------------------------
> Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
> WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
> DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
>        Are you enrolled at http://schoolreunions.co.nz/ yet?
>
>

--
_______________________________________________
Get your free email from http://www.graffiti.net

Powered by Outblaze