Thread: Is there a similarity-function that minds national charsets?

Is there a similarity-function that minds national charsets?

From

Andreas

Date:

20 June 2012, 13:30:55

Hi,

Is there a similarity-function that minds national charsets?

Over here we've got some special cases that screw up the results on 
similarity().

Our characters: ä, ö, ü, ß
could as well be written as:  ae, oe, ue, ss

e.g.

select similarity ( 'Müller', 'Mueller' )
results to:  0.363636

In normal cases everything below 0.5 would be to far apart to be 
considered a match.

As it is, I had to transfer the contents of the table into a temporary 
table where I translate every ambigous char to it's 2 char representation.

Is there a solution so that detour is not necessary?

Re: Is there a similarity-function that minds national charsets?

From

Craig Ringer

Date:

20 June 2012, 23:53:22

On 06/21/2012 12:30 AM, Andreas wrote:
> Hi,
>
> Is there a similarity-function that minds national charsets?
>
> Over here we've got some special cases that screw up the results on 
> similarity().
>
> Our characters: ä, ö, ü, ß
> could as well be written as:  ae, oe, ue, ss
>
> e.g.
>
> select similarity ( 'Müller', 'Mueller' )
> results to:  0.363636
>
> In normal cases everything below 0.5 would be to far apart to be 
> considered a match.

That's not just charset aware, that's looking for awareness of 
language-and-dialect specific transliteration rules for representing 
accented chars in 7-bit ASCII. My understanding was that these rules and 
conventions vary and are specific to each language - or even region.

tsearch2 has big language dictionaries to try to handle some issues like 
this (though I don't know about this issue specifically). It's possible 
you could extend the tsearch2 dictionaries with synonyms, possibly 
algorithmically generated.

If you have what you consider to be an acceptable 1:1 translation rule 
you could build a functional index on it and test against that, eg:

CREATE INDEX blah ON thetable ( (flatten_accent(target_column) );
SELECT similarity( flatten_accent('Müller'), target_column );

Note that the flatten_accent function must be IMMUTABLE and can't access 
or refer to data in other tables, columns, etc nor SET (GUC) variables 
that might change at runtime.
--
Craig Ringer