Thread: Accent insensitive search

Accent insensitive search

From

Diego Manilla Suárez

Date:

21 June 2007, 06:29:03

Hi. I have a few databases created with UNICODE encoding, and I would
like to be able to search with accent insensitivity. There's something
in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do
this, but I found nothing in PostgreSQL, just the 'to_ascii' function,
which AFAIK, doesn't work with UNICODE.

Thanks in advance.

Re: Accent insensitive search

From

PFC

Date:

21 June 2007, 06:57:40

> Hi. I have a few databases created with UNICODE encoding, and I would
> like to be able to search with accent insensitivity. There's something
> in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do
> this, but I found nothing in PostgreSQL, just the 'to_ascii' function,
> which AFAIK, doesn't work with UNICODE.

    The easiest way is to create an extra column which will hold a copy of
your text, with all accents removed. You can also convert it to lowercase
and remove apostrophes, punctuation etc. Said column is kept up to date
with a trigger.
    Python is suitable for this (use unicodedata.normalize).
    Keeping a copy of the processed data will speed up search versus WHERE
remove_accents( blah ) = 'text', even with a function index.
    Note that this function could be written in C and use a table on the
first 64K unicode symbols for speedup.

    See attached file.

Attachment

create_ft_functions.sql

Re: Accent insensitive search

From

Gregory Stark

Date:

21 June 2007, 08:05:34

"PFC" <lists@peufeu.com> writes:

>> Hi. I have a few databases created with UNICODE encoding, and I would like to
>> be able to search with accent insensitivity. There's something  in Oracle
>> (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do  this, but I found
>> nothing in PostgreSQL, just the 'to_ascii' function,  which AFAIK, doesn't
>> work with UNICODE.

Postgres supports localized collation orders but what it doesn't currently
support is having multiple collation orders within the same server.

So when you initialize the server with initdb it takes the setting of
LC_COLLATE (usually from LC_ALL) and stores that permanently. If you initdb
with LC_COLLATE set to a locale like en_GB.UTF-8 or something like that you
may find one that has the behaviour you want. I think they won't be entirely
accent insensitive but they'll consider accents only if the rest of the string
is identical.

You can test the sort order of a locale by writing a file with sample words
and sorting with something like:

LC_ALL=en_GB.UTF-8 sort /tmp/data

>     The easiest way is to create an extra column which will hold a copy of
> your text, with all accents removed. You can also convert it to lowercase  and
> remove apostrophes, punctuation etc. Said column is kept up to date  with a
> trigger.

That's another alternative which is useful if you need multiple collations in
your database. This gives you control over which collation is used when and
exactly what the rules are. The downside is that you have to reinvent the
collation rules which the localized collations already provide.

You don't necessarily have to keep a column in your table with the normalized
strings. You can normalize "on-the-fly" using an expression index as long as
your function always returns the same data given the same inputs (and is
therefore marked "immutable").

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Accent insensitive search

From

"Albe Laurenz"

Date:

21 June 2007, 10:19:48

PFC wrote:
>> Hi. I have a few databases created with UNICODE encoding, and I would
>> like to be able to search with accent insensitivity. There's something
>> in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do
>> this, but I found nothing in PostgreSQL, just the 'to_ascii' function,
>> which AFAIK, doesn't work with UNICODE.
>
>     The easiest way is to create an extra column which will hold a copy of
> your text, with all accents removed. You can also convert it to lowercase
> and remove apostrophes, punctuation etc. Said column is kept up to date
> with a trigger.

Creating an extra column which holds a copy of the data is certainly not
a very good idea, as it will create unnecessary redundancy (breaking
normal forms), bloat your table, and the trigger will have a certain
performance impact.

My suggestion is to write a function that removes the accents in a string
for your language, let's call it noaccents(text).

Then you can index the column atext with

CREATE INDEX atable_atext_idx ON atable ((noaccents(atext)))

Then every query of the form

... WHERE noaccents(atext) = noaccents('SOMÉTHING')

can use the index.

Yours,
Laurenz Albe

Re: Accent insensitive search

From

Alvaro Herrera

Date:

22 June 2007, 13:09:46

Diego Manilla Suárez wrote:
> Hi. I have a few databases created with UNICODE encoding, and I would
> like to be able to search with accent insensitivity. There's something
> in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do
> this, but I found nothing in PostgreSQL, just the 'to_ascii' function,
> which AFAIK, doesn't work with UNICODE.

to_ascii() doesn't work with UTF-8, but you can use convert() to turn
the UTF8 text into Latin-1 and then use to_ascii() to remove the funny
bits.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.