Thread: Accent insensitive search
Hi. I have a few databases created with UNICODE encoding, and I would like to be able to search with accent insensitivity. There's something in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do this, but I found nothing in PostgreSQL, just the 'to_ascii' function, which AFAIK, doesn't work with UNICODE. Thanks in advance.
> Hi. I have a few databases created with UNICODE encoding, and I would > like to be able to search with accent insensitivity. There's something > in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do > this, but I found nothing in PostgreSQL, just the 'to_ascii' function, > which AFAIK, doesn't work with UNICODE. The easiest way is to create an extra column which will hold a copy of your text, with all accents removed. You can also convert it to lowercase and remove apostrophes, punctuation etc. Said column is kept up to date with a trigger. Python is suitable for this (use unicodedata.normalize). Keeping a copy of the processed data will speed up search versus WHERE remove_accents( blah ) = 'text', even with a function index. Note that this function could be written in C and use a table on the first 64K unicode symbols for speedup. See attached file.
Attachment
"PFC" <lists@peufeu.com> writes: >> Hi. I have a few databases created with UNICODE encoding, and I would like to >> be able to search with accent insensitivity. There's something in Oracle >> (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do this, but I found >> nothing in PostgreSQL, just the 'to_ascii' function, which AFAIK, doesn't >> work with UNICODE. Postgres supports localized collation orders but what it doesn't currently support is having multiple collation orders within the same server. So when you initialize the server with initdb it takes the setting of LC_COLLATE (usually from LC_ALL) and stores that permanently. If you initdb with LC_COLLATE set to a locale like en_GB.UTF-8 or something like that you may find one that has the behaviour you want. I think they won't be entirely accent insensitive but they'll consider accents only if the rest of the string is identical. You can test the sort order of a locale by writing a file with sample words and sorting with something like: LC_ALL=en_GB.UTF-8 sort /tmp/data > The easiest way is to create an extra column which will hold a copy of > your text, with all accents removed. You can also convert it to lowercase and > remove apostrophes, punctuation etc. Said column is kept up to date with a > trigger. That's another alternative which is useful if you need multiple collations in your database. This gives you control over which collation is used when and exactly what the rules are. The downside is that you have to reinvent the collation rules which the localized collations already provide. You don't necessarily have to keep a column in your table with the normalized strings. You can normalize "on-the-fly" using an expression index as long as your function always returns the same data given the same inputs (and is therefore marked "immutable"). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
PFC wrote: >> Hi. I have a few databases created with UNICODE encoding, and I would >> like to be able to search with accent insensitivity. There's something >> in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do >> this, but I found nothing in PostgreSQL, just the 'to_ascii' function, >> which AFAIK, doesn't work with UNICODE. > > The easiest way is to create an extra column which will hold a copy of > your text, with all accents removed. You can also convert it to lowercase > and remove apostrophes, punctuation etc. Said column is kept up to date > with a trigger. Creating an extra column which holds a copy of the data is certainly not a very good idea, as it will create unnecessary redundancy (breaking normal forms), bloat your table, and the trigger will have a certain performance impact. My suggestion is to write a function that removes the accents in a string for your language, let's call it noaccents(text). Then you can index the column atext with CREATE INDEX atable_atext_idx ON atable ((noaccents(atext))) Then every query of the form ... WHERE noaccents(atext) = noaccents('SOMÉTHING') can use the index. Yours, Laurenz Albe
Diego Manilla Suárez wrote: > Hi. I have a few databases created with UNICODE encoding, and I would > like to be able to search with accent insensitivity. There's something > in Oracle (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do > this, but I found nothing in PostgreSQL, just the 'to_ascii' function, > which AFAIK, doesn't work with UNICODE. to_ascii() doesn't work with UTF-8, but you can use convert() to turn the UTF8 text into Latin-1 and then use to_ascii() to remove the funny bits. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.