Thread: Full text search strategy for names

Full text search strategy for names

From

Rick Schumeyer

Date:

16 April 2009, 22:43:19

I want to be able to search a list of articles for title words as well as author names. I understand how to do the title words with the full text searching. But I'm not sure the best strategy for the names. The full text parser "parses" the names giving undesirable results.

For example,

select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1

Is there a way to tell the parser to index the words in a column without trying to determine word roots?

Or what is the best way to index names for efficient searching?

Re: Full text search strategy for names

From

Ron Mayer

Date:

17 April 2009, 00:57:23

Rick Schumeyer wrote:
> I want to be able to search a list of articles for title words as well
> as author names.... I'm not sure the best strategy for the names.  The
> full text parser "parses" the names giving undesirable results.
>
> For example,
>
> select to_tsvector('claude Jones');
>     to_tsvector
> --------------------
>  'jone':2 'claud':1
>
>
> Is there a way to tell the parser to index the words in a column without
> trying to determine word roots?

> Or what is the best way to index names for efficient searching?

I've got a similar question; but would love a dictionary that
could give me an efficient index that considers that
Bill/William and Bob/Rob/Robert and Khadaffi/Qaddafi might
share the same roots.  Ideally it'd return exact matches
first, followed by the similar terms.

I kludged up some many step queries to try to do this; but
wonder if this would work better as a tsearch dictionary,
and wonder even more if I'm re-inventing something that's
already out there.

Anyone know of such a tsearch dictionary that is aware of
the roots of names?

Re: Full text search strategy for names

From

Tom Lane

Date:

17 April 2009, 02:05:29

Rick Schumeyer <rschumeyer@gmail.com> writes:
> I want to be able to search a list of articles for title words as well as
> author names.  I understand how to do the title words with the full text
> searching.  But I'm not sure the best strategy for the names.  The full text
> parser "parses" the names giving undesirable results.

> For example,

> select to_tsvector('claude Jones');
>     to_tsvector
> --------------------
>  'jone':2 'claud':1

Er ... why is that a problem?  As long as the index stems the words in
the same way that the query does, why should you care?

            regards, tom lane

Re: Full text search strategy for names

From

Rick Schumeyer

Date:

17 April 2009, 08:02:54

You can get extra (undesirable) results, depending on the name. For example, if you are searching for the last name of "Ricks", you will also find all authors whose first name is "Rick".

I also noticed that the directions for indexing multiple columns don't seem to be quite right.

In section 12.2.2:

UPDATE pgweb SET textsearchable_index_col =
     to_tsvector('english', coalesce(title,'') || coalesce(body,''));

I found that the last word of title is joined with the first word of body, which gives strange results. I ended up added a space which gave better results:

 to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));

On Fri, Apr 17, 2009 at 1:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Schumeyer <rschumeyer@gmail.com> writes:
> I want to be able to search a list of articles for title words as well as
> author names. I understand how to do the title words with the full text
> searching. But I'm not sure the best strategy for the names. The full text
> parser "parses" the names giving undesirable results.

> For example,

> select to_tsvector('claude Jones');
> to_tsvector
> --------------------
> 'jone':2 'claud':1

Er ... why is that a problem? As long as the index stems the words in
the same way that the query does, why should you care?

regards, tom lane

Re: Full text search strategy for names

From

Grzegorz Jaśkiewicz

Date:

17 April 2009, 08:15:17

so it looks to me, like you are trying to use wrong tool for the job.
Why not just normalize names to #3 ?

Re: Full text search strategy for names

From

John DeSoi

Date:

18 April 2009, 09:54:17

On Apr 17, 2009, at 7:02 AM, Rick Schumeyer wrote:

> You can get extra (undesirable) results, depending on the name.  For
> example, if you are searching for the last name of "Ricks", you will
> also find all authors whose first name is "Rick"

If you can process the names separately from the rest of the text, try

select to_tsvector('simple', 'claude Jones');

      to_tsvector
----------------------
  'jones':2 'claude':1





John DeSoi, Ph.D.

Re: Full text search strategy for names

From

Tom Lane

Date:

19 April 2009, 17:38:57

Rick Schumeyer <rschumeyer@gmail.com> writes:
> I also noticed that the directions for indexing multiple columns don't seem
> to be quite right.

> In section 12.2.2:

> UPDATE pgweb SET textsearchable_index_col =
>      to_tsvector('english', coalesce(title,'') || coalesce(body,''));

Yeah, this would be better done with a space in the middle.  I've
corrected the documentation --- thanks for the comment!

(Note that you could do it in other ways that don't require this,
ie feed the columns to to_tsvector() separately; but in the terms of
this particular example, you do want a space to avoid running words
together.)

            regards, tom lane