Thread: Full text search strategy for names
I want to be able to search a list of articles for title words as well as author names. I understand how to do the title words with the full text searching. But I'm not sure the best strategy for the names. The full text parser "parses" the names giving undesirable results.
For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1
Is there a way to tell the parser to index the words in a column without trying to determine word roots?
Or what is the best way to index names for efficient searching?
For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1
Is there a way to tell the parser to index the words in a column without trying to determine word roots?
Or what is the best way to index names for efficient searching?
Rick Schumeyer wrote: > I want to be able to search a list of articles for title words as well > as author names.... I'm not sure the best strategy for the names. The > full text parser "parses" the names giving undesirable results. > > For example, > > select to_tsvector('claude Jones'); > to_tsvector > -------------------- > 'jone':2 'claud':1 > > > Is there a way to tell the parser to index the words in a column without > trying to determine word roots? > Or what is the best way to index names for efficient searching? I've got a similar question; but would love a dictionary that could give me an efficient index that considers that Bill/William and Bob/Rob/Robert and Khadaffi/Qaddafi might share the same roots. Ideally it'd return exact matches first, followed by the similar terms. I kludged up some many step queries to try to do this; but wonder if this would work better as a tsearch dictionary, and wonder even more if I'm re-inventing something that's already out there. Anyone know of such a tsearch dictionary that is aware of the roots of names?
Rick Schumeyer <rschumeyer@gmail.com> writes: > I want to be able to search a list of articles for title words as well as > author names. I understand how to do the title words with the full text > searching. But I'm not sure the best strategy for the names. The full text > parser "parses" the names giving undesirable results. > For example, > select to_tsvector('claude Jones'); > to_tsvector > -------------------- > 'jone':2 'claud':1 Er ... why is that a problem? As long as the index stems the words in the same way that the query does, why should you care? regards, tom lane
You can get extra (undesirable) results, depending on the name. For example, if you are searching for the last name of "Ricks", you will also find all authors whose first name is "Rick".
I also noticed that the directions for indexing multiple columns don't seem to be quite right.
In section 12.2.2:
I also noticed that the directions for indexing multiple columns don't seem to be quite right.
In section 12.2.2:
UPDATE pgweb SET textsearchable_index_col =I found that the last word of title is joined with the first word of body, which gives strange results. I ended up added a space which gave better results:
to_tsvector('english', coalesce(title,'') || coalesce(body,''));
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
On Fri, Apr 17, 2009 at 1:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Schumeyer <rschumeyer@gmail.com> writes:Er ... why is that a problem? As long as the index stems the words in
> I want to be able to search a list of articles for title words as well as
> author names. I understand how to do the title words with the full text
> searching. But I'm not sure the best strategy for the names. The full text
> parser "parses" the names giving undesirable results.
> For example,
> select to_tsvector('claude Jones');
> to_tsvector
> --------------------
> 'jone':2 'claud':1
the same way that the query does, why should you care?
regards, tom lane
so it looks to me, like you are trying to use wrong tool for the job. Why not just normalize names to #3 ?
On Apr 17, 2009, at 7:02 AM, Rick Schumeyer wrote: > You can get extra (undesirable) results, depending on the name. For > example, if you are searching for the last name of "Ricks", you will > also find all authors whose first name is "Rick" If you can process the names separately from the rest of the text, try select to_tsvector('simple', 'claude Jones'); to_tsvector ---------------------- 'jones':2 'claude':1 John DeSoi, Ph.D.
Rick Schumeyer <rschumeyer@gmail.com> writes: > I also noticed that the directions for indexing multiple columns don't seem > to be quite right. > In section 12.2.2: > UPDATE pgweb SET textsearchable_index_col = > to_tsvector('english', coalesce(title,'') || coalesce(body,'')); Yeah, this would be better done with a space in the middle. I've corrected the documentation --- thanks for the comment! (Note that you could do it in other ways that don't require this, ie feed the columns to to_tsvector() separately; but in the terms of this particular example, you do want a space to avoid running words together.) regards, tom lane