Thread: Locale and pattern matching

Locale and pattern matching

From
Hélder M. Vieira
Date:
A question about encoding, some days ago, led me to concentrate my attention 
on the subject and perform a few tests on two different 8.1 installations, 
configured as follows:

Machine 1 -  pt_PT(Portuguese_Portugal.28591)/LATIN1
Machine 2 -  pt_PT(Portuguese_Portugal.28605)/LATIN9

In both machines, case insensitive pattern matching designed to match the 
letter 'A' will indeed match 'A' and 'a' but not the accented forms.
Generalizing, an accented or unaccented vowel in the pattern will only match 
that vowel in upper or lowercase forms if the accent is the same as 
specified  in the pattern.
For instance, if the pattern specifies an uppercase 'A' with an accute 
accent, then it will match a lowercase 'A' with an accute accent, but not an 
upper or lower case 'A' with a tilde.
This behaviour seems inconsistent with that of the ORDER BY clause, which 
considers all forms of a vowel as equal (uppercase/lowercase,accented/not 
accented).
Shouldn't case insensitive pattern matching follow the same collation 
weights that ORDER BY uses ?
Can someone please elaborate on this subject ?

Another question, for european users:
Until now, I've been selecting LATIN1 encoding, but after a few tests, I 
came to think that LATIN9 is a better option (the euro sign...).
For those who regularly use LATIN9, what is your opinion ? Is it indeed a 
better option ?


Thank you.

Hélder M. Vieira









LATIN CAPITAL LETTER A WITH GRAVE





Re: Locale and pattern matching

From
Peter Eisentraut
Date:
Hélder M. Vieira wrote:
> For instance, if the pattern specifies an uppercase 'A' with an
> accute accent, then it will match a lowercase 'A' with an accute
> accent, but not an upper or lower case 'A' with a tilde.
> This behaviour seems inconsistent with that of the ORDER BY clause,
> which considers all forms of a vowel as equal

This is not exactly true.  Sorting just goes through the strings in 
multiple passes, looking at the "general" letter first, then at the 
case, then at the accents, until it finds a difference.

> Shouldn't case insensitive pattern matching follow the same collation
> weights that ORDER BY uses ?

Even if it did, it would not help you, as explained above.  Moreover, 
the collation rules of some locales are sufficiently weird that a 
consistent behavior between sorting and pattern matching is impossible 
to achieve.

> Until now, I've been selecting LATIN1 encoding, but after a few
> tests, I came to think that LATIN9 is a better option (the euro
> sign...). For those who regularly use LATIN9, what is your opinion ?
> Is it indeed a better option ?

Yes.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/