Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents) - Mailing list pgsql-sql

From Patrice Hédé
Subject Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)
Date
Msg-id Pine.LNX.3.96.980617124207.2514F-100000@paris.ivo.fr
Whole thread Raw
In response to Internationalisation: SELECT str (ignoring Umlauts/Accents)  (Benedikt Eric Heinen <beh@icemark.ch>)
Responses Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)  (Benedikt Eric Heinen <beh@icemark.ch>)
List pgsql-sql
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote:

>   has anyone done a good implementation of a search algorithm, that would
> allow searching for a string, with automatic use of alternate forms for
> Umlauts ('ae' in a search string matches the a-Umlaut ), or allowing to
> search without knowing the correct accents ( 'a' in a search string also
> matches 'à', 'e' also matches 'é', ... )?
>
>   This should - of course - work one-way only, i.e. if a string in the
> database contains "ae" and the search string contains the German a-Umlaut,
> then the string containing ae should not be matched, while the other way
> around it should.

I don't know what you exactly looking for : a specific solution, or a
general one. If this is the second case, you have to take care that
different languages have different ways for dealing with crippled texts (
= without accents...).

 For example, in French, you just forget the marks for all characters
(i.e. é <-> e, ç <-> c, but also ü <-> u and not ue, and for foreign words
ß <-> s, ø <-> o, etc. ), and in other languages, it could be different
(Icelandic ð <-> d *and* dh, æ <-> ae, Norwegian å -> aa), etc.  Maybe
this is a point which should be considered in the discussion about i18n
and l10n and collation sequences that is floating around.


For a specific solution, you can try to use regex, with Perl substitutions
first to have a good pattern. For example, if you're looking for Schädel
(word taken randomly in the dictionnary :) ) :

select * from text where wort ~* 'sch(Ä|ä|ae)del';

if you want an insensitive search... I think using [] would be nicer but
you can't do that, because 'ae' is two characters... Moreover, you have to
specify both Ä and ä, because they're not considered upper- and lowercase
of the same letter (which I consider a bug... as with most languages, most
of the accented characters pairs should be matched as upper/lower in an
insensitive search (the only exception is ß/ÿ (0xdf, 0xff) and maybe ×/÷
(0xd7, 0xf7) ;) ).

Something strange : ~* '^sch(Ä|ä|ae)del' doesn't return anything, though I
had "schädel", "schaedel" as words.... Bug ! Bug !

Anyway, re-reading your mail, I see that you considered only the case
where the database has the right spelling, but someone may enter something
misspelled (as far as forgetting accents/umlauts is called
misspelling)....  then, what you want is that someone writing "schaedel"
should match both "schaedel" and "schädel", and writing "idee" should
match "idée"... then you have to work on the original string... if you
want to do that inside a postgres query, you have to do your own function
(does PL/Perl exists somewhere ??), which rewrite the string according to
your needs...

I can think about it further if you need... (I've done it in Perl, for
French words, but in your case, you have to consider two characters
matching one :
  (ae|ä)  -> ä
  ae only -> ae
  a       -> (a|à|á|â|ä)   (but not ae -> (a|à|á|â|ä)(e|é|è|ê|ë) ?? )

All these rules means you really can't do successive global substitutions,
which is what I did since in French, it's always a one to one mapping.

It's not trivial, and should be dealt language by language, but maybe with
a common base. Actually, I would really like that to be dealt in the i18n
discussion, maybe as an extension to regexes, or as a function which is
aware of the language of the field queried (which means that if national
chars are implanted, functions should be able to know what language is in
use for the particular field).

As I will have to do multilingual "wordlists" tables in some time, I am
very interested in this subject (though I don't know what it would imply
with non-iso-8859-1 character sets).

I realize also that I don't know much about the specificities of Postgres
regexes as compared to perl regexes for example, I will have to look about
that (are they fully described in the manuals ?).

Patrice Hédé

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
... Ásólfr hljóp upp á skip Hrúts  ok varð fjögurra manna bani,  áðr
Hrútr varð varr við. Sneri hann þá í móti honum. En er þeir fundust,
lagði Ásólfr í skjöld Hrúts ok í gegnum, en Hrútr hjó til Ásólfs, ok
varð þat banahögg.                                    --- Njáls Saga
----- http://www.idf.net/patrice/ ----------------------------------


pgsql-sql by date:

Previous
From: infotechsys@pivot.net
Date:
Subject: http://www.postgresql.org/supp-mlists.shtml
Next
From: Benedikt Eric Heinen
Date:
Subject: Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)