Home > mailing lists

how to improve this similarity query? - Mailing list pgsql-general

From	hamann.w@t-online.de
Subject	how to improve this similarity query?
Date	September 30, 2011 09:22:37
Msg-id	4E85B483.mail8Z81C29Y9@amadeus3.local Whole thread Raw
Responses	Re: how to improve this similarity query?
List	pgsql-general

Tree view

Hi,

I have a table of names, and searches are usually performed on prefix match.
This could nicely translate into an index search
Suppose first name is stored as either 'Jim' or 'Jimmy', searching
... where firstname ~* '^jim';
gets proper result. I had hoped that creating a functional
index on lower(firstname) and using a query like
.... where lower(firstname) ~ '^jim'
would improve the search, but it does not.
I ended up with adding a lowercased column for matching

Now a few names (a few percent of the entire lot) have alternates, like 'James'.
These could be nicknames, informal variants, language variants, alternate spellings

I have already split off these few percent into a separate table and can query that like
... where 'jim' ~* firstname_pattern;'jim' ~* firstname_pattern;'jim' ~* firstname_pattern;

There are two problems with this approach: when I use 'Jimbo' for the plain query,
I do not get any 'Jim' or 'Jimmy', as expected. To achieve the same effect on the pattern
query, I need to "decorate" the pattern somewhat. Actually, when I specify 'J(im|ames)' for
the pattern, it gets preprocessed - and is stored in the database as - 'J(im|am($|e($|s)))$'
Unfortunately there are regex patterns which the preprocessing script cannot handle, so
I might have to try a different regex.
The other, bigger, problem: the search cannot make use of  an index, and it has to compile
a regex for every entry in the table. I am considering a change to that part: in the Jim/James
case it is obvious that I could speed up the query with
.... where firstname_pattern ~* '^j' and 'jim' ~* firstname_pattern;
If the pattern was 'Bob|Robert' instead, I would have to change the preprocessing so the
'B' and 'R' parts would be separate.

So, I wonder whether there is any better way of doing these. I have looked into tsquery
resp. fulltext, but they do not seem to support prefix matches, only exact ones.

Regards
Wolfgang Hamann

pgsql-general by date:

From: "Albe Laurenz"
Date: 30 September 2011, 09:11:35
Subject: Re: Problem dbi_link with postgresql 9.04

From: "Igor Neyman"
Date: 30 September 2011, 09:35:07
Subject: Re: Rules going away

how to improve this similarity query? - Mailing list pgsql-general

Previous

Next