Re: how to improve this similarity query? - Mailing list pgsql-general

From pasman pasmański
Subject Re: how to improve this similarity query?
Date
Msg-id CAOWY8=bx+t=XHy8T3am5qhK3VaYRxzSZcibL4E1VtHctrxbEUQ@mail.gmail.com
Whole thread Raw
In response to how to improve this similarity query?  (hamann.w@t-online.de)
List pgsql-general
Look at doc chapter II.12

2011/9/30, hamann.w@t-online.de <hamann.w@t-online.de>:
> Hi,
>
> I have a table of names, and searches are usually performed on prefix match.
> This could nicely translate into an index search
> Suppose first name is stored as either 'Jim' or 'Jimmy', searching
> ... where firstname ~* '^jim';
> gets proper result. I had hoped that creating a functional
> index on lower(firstname) and using a query like
> .... where lower(firstname) ~ '^jim'
> would improve the search, but it does not.
> I ended up with adding a lowercased column for matching
>
> Now a few names (a few percent of the entire lot) have alternates, like
> 'James'.
> These could be nicknames, informal variants, language variants, alternate
> spellings
>
> I have already split off these few percent into a separate table and can
> query that like
> ... where 'jim' ~* firstname_pattern;'jim' ~* firstname_pattern;'jim' ~*
> firstname_pattern;
>
> There are two problems with this approach: when I use 'Jimbo' for the plain
> query,
> I do not get any 'Jim' or 'Jimmy', as expected. To achieve the same effect
> on the pattern
> query, I need to "decorate" the pattern somewhat. Actually, when I specify
> 'J(im|ames)' for
> the pattern, it gets preprocessed - and is stored in the database as -
> 'J(im|am($|e($|s)))$'
> Unfortunately there are regex patterns which the preprocessing script cannot
> handle, so
> I might have to try a different regex.
> The other, bigger, problem: the search cannot make use of  an index, and it
> has to compile
> a regex for every entry in the table. I am considering a change to that
> part: in the Jim/James
> case it is obvious that I could speed up the query with
> .... where firstname_pattern ~* '^j' and 'jim' ~* firstname_pattern;
> If the pattern was 'Bob|Robert' instead, I would have to change the
> preprocessing so the
> 'B' and 'R' parts would be separate.
>
> So, I wonder whether there is any better way of doing these. I have looked
> into tsquery
> resp. fulltext, but they do not seem to support prefix matches, only exact
> ones.
>
> Regards
> Wolfgang Hamann
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


--
------------
pasman

pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: stored procs
Next
From: Cody Caughlan
Date:
Subject: Change server encoding after the fact