Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents) - Mailing list pgsql-sql

From Patrice Hédé
Subject Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)
Date
Msg-id Pine.LNX.3.96.980619140523.7825A-100000@paris.ivo.fr
Whole thread Raw
In response to Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)  (Benedikt Eric Heinen <beh@icemark.ch>)
List pgsql-sql
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote:

> Right - basically, I am building a web database with addresses of a group
> of people all over Switzerland, who are members of the same club. The
> problem is just, that for a Mr. "á Porta" I (can't speak French or
> Italian) doesn't know what the right spelling with accents is. Which is
> much the same way that a French native speaker of the western part of
> Switzerland possibly doesn't know which/whether an Umlaut will have to be
> used in a German name...

Do you need to take care of the fact that people may use a wrong accent...
For example, if you have a french word with a `è', some people (trying to
do their best, but unfamiliar with french conventions) might write it as a
`é' (which is something I have seen quite often, like writing "fin de
siécle" instead of "fin de siècle" in US), etc.

> The program is a server-side C++ CGI  (Can't program perl).

Well, it would be easier to do it in Perl, but I may provide you with a C
function (I don't know C++, never cared to learn it :( ).

The nicer thing would be a backend user function, but in another thread, I
noticed that a function is evaluated for every row, which is bad, since it
could be expensive.... or then, maybe do a subquery :

select * from towns where town_name ~* (select accents( 'xxxxxxx' ) );

> I just thought - I am certainly not the first to have had this kind of
> problem...

No, but I don't think other people would have *exactly* the same problem
(because of languages specificities). Anyway, what I propose is to do a
user function in C, taking a text type string, giving back another string,
formatted as a nice regex :) Then it could be used by others to implement
their own need :) [Maybe as a contrib module then :) ]


> Let's say - only just the search string should ever be modified, so an "ä"
> in the search string should never match "ae" in a string in the database.
>
> The modifications should be:
>
> part of search string        can match in database side string
>
>     a            a, a umlaut,
>                 a with acute/grave/circumflex accent
>     ae            ae, a umlaut
>     c            c, c cedilla
>     e            e, e with acute/grave/circumflex accent
>     i            i, i with acute/grave/circumflex accent
>     o            o, o umlaut,
>                 o with acute/grave/circumflex accent
>     oe            oe, o umlaut
>     u            u, u umlaut,
>                 u with acute/grave/circumflex accent
>     ue            ue, u umlaut
>

I would have to add cases where you want an é,è,ê,ë behaveing like an e.

s -> s, ß
ss -> ss, ß

(but not ss -> ss, ß, ßß ! )

> I hope the above description is somewhat useful to you (unfortunately I am
> lacking the matching characters on my US keyboard - so I described which
> ones should be matched).

Yes, it is, since it defines the rules you need... It would be much easier
in perl, but since it's only meant to be used for simple strings (not too
long strings), we can just iterate over the characters, and output fixed
patterns, as there are just few substitutions...

I will be able to do it this weekend if this is ok with you, but if
someone has a better idea, please tell me :)

> I guess, the ideal way would be to try and build a general pluggable
> module for postgresql, so that it can handle this somewhat transparently.

Yes, this is the best way, I will make two versions : one which matches
only *your* rules, and another one which adds new rules, to cover more
languages (I would like to have input from people speaking other european
languages about this...).

Patrice

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
Nous sommes au monde. [...] La croyance en un esprit absolu ou en un
monde en soi détaché de nous n'est qu'une rationalisation  de  cette
foi primordiale.  --- Merleau-Ponty, Phénoménologie de la Perception
----- http://www.idf.net/patrice/ ----------------------------------


pgsql-sql by date:

Previous
From: Marin D
Date:
Subject: Re: [SQL] FOREIGN KEY ...
Next
From: "Jackson, DeJuan"
Date:
Subject: RE: [SQL] 'now - 4 hours ago' ?