Thread: Internationalisation: SELECT str (ignoring Umlauts/Accents)

Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Benedikt Eric Heinen
Date:
Hi there,


  has anyone done a good implementation of a search algorithm, that would
allow searching for a string, with automatic use of alternate forms for
Umlauts ('ae' in a search string matches the a-Umlaut ), or allowing to
search without knowing the correct accents ( 'a' in a search string also
matches 'à', 'e' also matches 'é', ... )?

  This should - of course - work one-way only, i.e. if a string in the
database contains "ae" and the search string contains the German a-Umlaut,
then the string containing ae should not be matched, while the other way
around it should.



  Ideas anyone?


      Benedikt

Windows 95: n.
    32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit
    operating system originally coded for a 4-bit microprocessor,  written
         by a 2-bit company that can't stand for 1 bit of competition.


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Patrice Hédé
Date:
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote:

>   has anyone done a good implementation of a search algorithm, that would
> allow searching for a string, with automatic use of alternate forms for
> Umlauts ('ae' in a search string matches the a-Umlaut ), or allowing to
> search without knowing the correct accents ( 'a' in a search string also
> matches 'à', 'e' also matches 'é', ... )?
>
>   This should - of course - work one-way only, i.e. if a string in the
> database contains "ae" and the search string contains the German a-Umlaut,
> then the string containing ae should not be matched, while the other way
> around it should.

I don't know what you exactly looking for : a specific solution, or a
general one. If this is the second case, you have to take care that
different languages have different ways for dealing with crippled texts (
= without accents...).

 For example, in French, you just forget the marks for all characters
(i.e. é <-> e, ç <-> c, but also ü <-> u and not ue, and for foreign words
ß <-> s, ø <-> o, etc. ), and in other languages, it could be different
(Icelandic ð <-> d *and* dh, æ <-> ae, Norwegian å -> aa), etc.  Maybe
this is a point which should be considered in the discussion about i18n
and l10n and collation sequences that is floating around.


For a specific solution, you can try to use regex, with Perl substitutions
first to have a good pattern. For example, if you're looking for Schädel
(word taken randomly in the dictionnary :) ) :

select * from text where wort ~* 'sch(Ä|ä|ae)del';

if you want an insensitive search... I think using [] would be nicer but
you can't do that, because 'ae' is two characters... Moreover, you have to
specify both Ä and ä, because they're not considered upper- and lowercase
of the same letter (which I consider a bug... as with most languages, most
of the accented characters pairs should be matched as upper/lower in an
insensitive search (the only exception is ß/ÿ (0xdf, 0xff) and maybe ×/÷
(0xd7, 0xf7) ;) ).

Something strange : ~* '^sch(Ä|ä|ae)del' doesn't return anything, though I
had "schädel", "schaedel" as words.... Bug ! Bug !

Anyway, re-reading your mail, I see that you considered only the case
where the database has the right spelling, but someone may enter something
misspelled (as far as forgetting accents/umlauts is called
misspelling)....  then, what you want is that someone writing "schaedel"
should match both "schaedel" and "schädel", and writing "idee" should
match "idée"... then you have to work on the original string... if you
want to do that inside a postgres query, you have to do your own function
(does PL/Perl exists somewhere ??), which rewrite the string according to
your needs...

I can think about it further if you need... (I've done it in Perl, for
French words, but in your case, you have to consider two characters
matching one :
  (ae|ä)  -> ä
  ae only -> ae
  a       -> (a|à|á|â|ä)   (but not ae -> (a|à|á|â|ä)(e|é|è|ê|ë) ?? )

All these rules means you really can't do successive global substitutions,
which is what I did since in French, it's always a one to one mapping.

It's not trivial, and should be dealt language by language, but maybe with
a common base. Actually, I would really like that to be dealt in the i18n
discussion, maybe as an extension to regexes, or as a function which is
aware of the language of the field queried (which means that if national
chars are implanted, functions should be able to know what language is in
use for the particular field).

As I will have to do multilingual "wordlists" tables in some time, I am
very interested in this subject (though I don't know what it would imply
with non-iso-8859-1 character sets).

I realize also that I don't know much about the specificities of Postgres
regexes as compared to perl regexes for example, I will have to look about
that (are they fully described in the manuals ?).

Patrice Hédé

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
... Ásólfr hljóp upp á skip Hrúts  ok varð fjögurra manna bani,  áðr
Hrútr varð varr við. Sneri hann þá í móti honum. En er þeir fundust,
lagði Ásólfr í skjöld Hrúts ok í gegnum, en Hrútr hjó til Ásólfs, ok
varð þat banahögg.                                    --- Njáls Saga
----- http://www.idf.net/patrice/ ----------------------------------


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Benedikt Eric Heinen
Date:
> >   has anyone done a good implementation of a search algorithm, that would
> > allow searching for a string, with automatic use of alternate forms for
> > Umlauts ('ae' in a search string matches the a-Umlaut ), or allowing to
> > search without knowing the correct accents ( 'a' in a search string also
> > matches 'à', 'e' also matches 'é', ... )?
> >   This should - of course - work one-way only, i.e. if a string in the
> > database contains "ae" and the search string contains the German a-Umlaut,
> > then the string containing ae should not be matched, while the other way
> > around it should.
> I don't know what you exactly looking for : a specific solution, or a
> general one. If this is the second case, you have to take care that
> different languages have different ways for dealing with crippled texts (
> = without accents...).

Oh well, let me extend the question then, what I am looking for is a
solution that works for Switzerland, e.g. a country with 4 official
languages (one of which basically gets ignored) and a 5th "major"
language. So, I need a search function to look for German, French, Italian
and English names (I am not doing Rumantsch [the 4th official language in
Switzerland], as I don't know anything about the language except for that
only a few thousand people in Switzerland are left actually using it).




    Benedikt

Windows 95: n.
    32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit
    operating system originally coded for a 4-bit microprocessor,  written
         by a 2-bit company that can't stand for 1 bit of competition.


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Patrice Hédé
Date:
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote:

> > I don't know what you exactly looking for : a specific solution, or a
> > general one. If this is the second case, you have to take care that
> > different languages have different ways for dealing with crippled texts (
> > = without accents...).
>
> Oh well, let me extend the question then, what I am looking for is a
> solution that works for Switzerland, e.g. a country with 4 official
> languages (one of which basically gets ignored) and a 5th "major"
> language. So, I need a search function to look for German, French, Italian
> and English names (I am not doing Rumantsch [the 4th official language in
> Switzerland], as I don't know anything about the language except for that
> only a few thousand people in Switzerland are left actually using it).

Do you mean you have a field with German *and* French *and* Italian *and*
English words in it, and you want people, be they german-, french-,
italian-, english-speaking, to be able to access this field, without
putting accents and all ?

As I said earlier, you may have problems, since `ae' doesn't mean `ä' for
most of these people (except the german-speaking ones), and they may put
`a' instead. As the rules are different among the languages, it's
difficult to have a single solution. However, you *need* a solution.
Maybe I, or others ;) , may help though. Some questions : what is your
interface language (if it's perl, it can be much easier :) ) ? Can it be a
client-side solution, or do you absolutely need a server-side one (which
would then have to be a C function, I think) ?

And then, what kind of conversions do you need ? For example, for French,
I decided that all a, e, i, o, u, y to be equal, which meant :

any of a,A,à,À,æ,Æ,å,Å,â,Â,á,Á,ä,Ä => a,A,à,À,æ,Æ,å,Å,â,Â,á,Á,ä,Ä
etc.

Obviously, in your case, it will be more complex, since `ae' *may* have a
special meaning... (that's where it's getting difficult :( )...

You need to know the rules you want (and can they be different for
different people, different fields, and all...)....

--------------------------
[HACKERS]
By the way, I've looked at the regex source, which had an interesting
concept : [ä[.ae.]] should match ä and ae equally, meaning "ae" as a
single entity.... however, this does not work... maybe because the server
doesn't use any locale... It would be really helpful if we can do
something about l10n for the next release ! At least have something which
can set locales from psql "set locale to 'xx';" (that wouldn't help
for multi-locales queries, but it would be better than nothing).

Patrice




Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Benedikt Eric Heinen
Date:
> Do you mean you have a field with German *and* French *and* Italian *and*
> English words in it, and you want people, be they german-, french-,
> italian-, english-speaking, to be able to access this field, without
> putting accents and all ?

Right - basically, I am building a web database with addresses of a group
of people all over Switzerland, who are members of the same club. The
problem is just, that for a Mr. "á Porta" I (can't speak French or
Italian) doesn't know what the right spelling with accents is. Which is
much the same way that a French native speaker of the western part of
Switzerland possibly doesn't know which/whether an Umlaut will have to be
used in a German name...



> As I said earlier, you may have problems, since `ae' doesn't mean `ä' for
> most of these people (except the german-speaking ones), and they may put
> `a' instead. As the rules are different among the languages, it's
> difficult to have a single solution. However, you *need* a solution.
> Maybe I, or others ;) , may help though. Some questions : what is your
> interface language (if it's perl, it can be much easier :) ) ? Can it be a
> client-side solution, or do you absolutely need a server-side one (which
> would then have to be a C function, I think) ?

The program is a server-side C++ CGI  (Can't program perl).

I just thought - I am certainly not the first to have had this kind of
problem...



> And then, what kind of conversions do you need ? For example, for French,
> I decided that all a, e, i, o, u, y to be equal, which meant :
>
> any of a,A,à,À,æ,Æ,å,Å,â,Â,á,Á,ä,Ä => a,A,à,À,æ,Æ,å,Å,â,Â,á,Á,ä,Ä
> etc.

Let's say - only just the search string should ever be modified, so an "ä"
in the search string should never match "ae" in a string in the database.

The modifications should be:

part of search string        can match in database side string

    a            a, a umlaut,
                a with acute/grave/circumflex accent
    ae            ae, a umlaut
    c            c, c cedilla
    e            e, e with acute/grave/circumflex accent
    i            i, i with acute/grave/circumflex accent
    o            o, o umlaut,
                o with acute/grave/circumflex accent
    oe            oe, o umlaut
    u            u, u umlaut,
                u with acute/grave/circumflex accent
    ue            ue, u umlaut


    [all searches will be case insensitive]



> Obviously, in your case, it will be more complex, since `ae' *may* have a
> special meaning... (that's where it's getting difficult :( )...

I hope the above description is somewhat useful to you (unfortunately I am
lacking the matching characters on my US keyboard - so I described which
ones should be matched).

I guess, the ideal way would be to try and build a general pluggable
module for postgresql, so that it can handle this somewhat transparently.


   Benedikt

Windows 95: n.
    32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit
    operating system originally coded for a 4-bit microprocessor,  written
         by a 2-bit company that can't stand for 1 bit of competition.


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Patrice Hédé
Date:
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote:

> Right - basically, I am building a web database with addresses of a group
> of people all over Switzerland, who are members of the same club. The
> problem is just, that for a Mr. "á Porta" I (can't speak French or
> Italian) doesn't know what the right spelling with accents is. Which is
> much the same way that a French native speaker of the western part of
> Switzerland possibly doesn't know which/whether an Umlaut will have to be
> used in a German name...

Do you need to take care of the fact that people may use a wrong accent...
For example, if you have a french word with a `è', some people (trying to
do their best, but unfamiliar with french conventions) might write it as a
`é' (which is something I have seen quite often, like writing "fin de
siécle" instead of "fin de siècle" in US), etc.

> The program is a server-side C++ CGI  (Can't program perl).

Well, it would be easier to do it in Perl, but I may provide you with a C
function (I don't know C++, never cared to learn it :( ).

The nicer thing would be a backend user function, but in another thread, I
noticed that a function is evaluated for every row, which is bad, since it
could be expensive.... or then, maybe do a subquery :

select * from towns where town_name ~* (select accents( 'xxxxxxx' ) );

> I just thought - I am certainly not the first to have had this kind of
> problem...

No, but I don't think other people would have *exactly* the same problem
(because of languages specificities). Anyway, what I propose is to do a
user function in C, taking a text type string, giving back another string,
formatted as a nice regex :) Then it could be used by others to implement
their own need :) [Maybe as a contrib module then :) ]


> Let's say - only just the search string should ever be modified, so an "ä"
> in the search string should never match "ae" in a string in the database.
>
> The modifications should be:
>
> part of search string        can match in database side string
>
>     a            a, a umlaut,
>                 a with acute/grave/circumflex accent
>     ae            ae, a umlaut
>     c            c, c cedilla
>     e            e, e with acute/grave/circumflex accent
>     i            i, i with acute/grave/circumflex accent
>     o            o, o umlaut,
>                 o with acute/grave/circumflex accent
>     oe            oe, o umlaut
>     u            u, u umlaut,
>                 u with acute/grave/circumflex accent
>     ue            ue, u umlaut
>

I would have to add cases where you want an é,è,ê,ë behaveing like an e.

s -> s, ß
ss -> ss, ß

(but not ss -> ss, ß, ßß ! )

> I hope the above description is somewhat useful to you (unfortunately I am
> lacking the matching characters on my US keyboard - so I described which
> ones should be matched).

Yes, it is, since it defines the rules you need... It would be much easier
in perl, but since it's only meant to be used for simple strings (not too
long strings), we can just iterate over the characters, and output fixed
patterns, as there are just few substitutions...

I will be able to do it this weekend if this is ok with you, but if
someone has a better idea, please tell me :)

> I guess, the ideal way would be to try and build a general pluggable
> module for postgresql, so that it can handle this somewhat transparently.

Yes, this is the best way, I will make two versions : one which matches
only *your* rules, and another one which adds new rules, to cover more
languages (I would like to have input from people speaking other european
languages about this...).

Patrice

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
Nous sommes au monde. [...] La croyance en un esprit absolu ou en un
monde en soi détaché de nous n'est qu'une rationalisation  de  cette
foi primordiale.  --- Merleau-Ponty, Phénoménologie de la Perception
----- http://www.idf.net/patrice/ ----------------------------------


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Patrice Hédé
Date:
> Oh well, let me extend the question then, what I am looking for is a
> solution that works for Switzerland, e.g. a country with 4 official
> languages (one of which basically gets ignored) and a 5th "major"
> language. So, I need a search function to look for German, French, Italian
> and English names (I am not doing Rumantsch [the 4th official language in
> Switzerland], as I don't know anything about the language except for that
> only a few thousand people in Switzerland are left actually using it).

  Ok, I've done it...

  This is a C function, which takes a text argument, and returns a text
argument. It is created as :

  CREATE FUNCTION accents( text )
    RETURNS text
    AS '/usr/local/pgsql/lib/accents.so'
    LANGUAGE 'c';

  And you can use it this way, for example :

  SELECT * FROM town WHERE town_name ~* (SELECT accents('xxxxx') );

  For example, a typical use gives :

patrice=> select accents( 'zurich' );
accents
------------------------------
z[uúÚùÙûÛüÜ]r[iíÍìÌîÎïÏ][cçÇ]h
(1 row)

patrice=> select accents( 'GENEVE' );
accents
------------------------------------
G[eéÉèÈêÊëË]N[eéÉèÈêÊëË]V[eéÉèÈêÊëË]
(1 row)

patrice=> select accents( 'munchen' );
accents
-------------------------------
m[uúÚùÙûÛüÜ]n[cçÇ]h[eéÉèÈêÊëË]n
(1 row)

patrice=> select accents( 'koeln' );
accents
-------------------
k(oe\|ö\|Ö\|ø\|Ø)ln
(1 row)


  I don't put the function itself in an attachment, since it was
considered bad, some time ago. So you can fetch the tar.gz at this
place later in the day :

  http://www.idf.net/patrice/programmes/accents.tar.gz


  I would like other people to look at it also. If there is interest,
I will make it nicer, maybe to put it under contrib.


Hope it helps

Patrice

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
Nous sommes au monde. [...] La croyance en un esprit absolu ou en un
monde en soi détaché de nous n'est qu'une rationalisation  de  cette
foi primordiale.  --- Merleau-Ponty, Phénoménologie de la Perception
----- http://www.idf.net/patrice/ ----------------------------------


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Benedikt Eric Heinen
Date:
> patrice=> select accents( 'zurich' );
> accents
> ------------------------------
> z[uúÚùÙûÛüÜ]r[iíÍìÌîÎïÏ][cçÇ]h

This brings up another interesting question -- will
"upper(accentuated-string)" upcase the string properly (including upcasing
accentuated characters)? e.g. will upper("zürich") become "ZüRICH" or
"ZÜRICH"?



I'll have a look at the accents source later today...


   Benedikt

Windows 95: n.
    32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit
    operating system originally coded for a 4-bit microprocessor,  written
         by a 2-bit company that can't stand for 1 bit of competition.


Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)

From
Patrice Hédé
Date:
> This brings up another interesting question -- will
> "upper(accentuated-string)" upcase the string properly (including upcasing
> accentuated characters)? e.g. will upper("zürich") become "ZüRICH" or
> "ZÜRICH"?

No, it won't. The locale support is such that only letters in your
default locale (if it's not broken) will be considered upper/lowercase,
and even letters at all. So upper('zürich') is 'ZüRICH'. This is why you
have to use the case insensitive form of the match operation : ~*

It is very easy to make a function which returns the correct
upper/lowercase of stressed letters, however. Maybe I could do one, and
add it to the package :)


I know this is not locale friendly, but locale support design is flawed
anyway (you can't support foreign words in a said locale easily), and I
don't know of a language using 8859-1, for which a lowercase and uppercase
forms of a letter aren't considered as a pair.

Basically, there is only *one* problem, which is the uppercase «ÿ», which
is not much used anyway (in the languages I know, at least !).

Patrice

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
Nous sommes au monde. [...] La croyance en un esprit absolu ou en un
monde en soi détaché de nous n'est qu'une rationalisation  de  cette
foi primordiale.  --- Merleau-Ponty, Phénoménologie de la Perception
----- http://www.idf.net/patrice/ ----------------------------------