how to improve this similarity query? - Mailing list pgsql-general

From hamann.w@t-online.de
Subject how to improve this similarity query?
Date
Msg-id 4E85B483.mail8Z81C29Y9@amadeus3.local
Whole thread Raw
Responses Re: how to improve this similarity query?  (pasman pasmański <pasman.p@gmail.com>)
List pgsql-general
Hi,

I have a table of names, and searches are usually performed on prefix match.
This could nicely translate into an index search
Suppose first name is stored as either 'Jim' or 'Jimmy', searching
... where firstname ~* '^jim';
gets proper result. I had hoped that creating a functional
index on lower(firstname) and using a query like
.... where lower(firstname) ~ '^jim'
would improve the search, but it does not.
I ended up with adding a lowercased column for matching

Now a few names (a few percent of the entire lot) have alternates, like 'James'.
These could be nicknames, informal variants, language variants, alternate spellings

I have already split off these few percent into a separate table and can query that like
... where 'jim' ~* firstname_pattern;'jim' ~* firstname_pattern;'jim' ~* firstname_pattern;

There are two problems with this approach: when I use 'Jimbo' for the plain query,
I do not get any 'Jim' or 'Jimmy', as expected. To achieve the same effect on the pattern
query, I need to "decorate" the pattern somewhat. Actually, when I specify 'J(im|ames)' for
the pattern, it gets preprocessed - and is stored in the database as - 'J(im|am($|e($|s)))$'
Unfortunately there are regex patterns which the preprocessing script cannot handle, so
I might have to try a different regex.
The other, bigger, problem: the search cannot make use of  an index, and it has to compile
a regex for every entry in the table. I am considering a change to that part: in the Jim/James
case it is obvious that I could speed up the query with
.... where firstname_pattern ~* '^j' and 'jim' ~* firstname_pattern;
If the pattern was 'Bob|Robert' instead, I would have to change the preprocessing so the
'B' and 'R' parts would be separate.

So, I wonder whether there is any better way of doing these. I have looked into tsquery
resp. fulltext, but they do not seem to support prefix matches, only exact ones.

Regards
Wolfgang Hamann

pgsql-general by date:

Previous
From: "Albe Laurenz"
Date:
Subject: Re: Problem dbi_link with postgresql 9.04
Next
From: "Igor Neyman"
Date:
Subject: Re: Rules going away