Thread: spanish locale question

spanish locale question

From
jbiskofski
Date:
Hello everyone,

I have a lc_collate problem. Im in Mexico and I need the following three lastnames to be sorted this way :

álvarez ( accent on first a )
chavez
cota

Using the default locale on my mac ( en_US ) I end up with :

chavez
cota
álvarez

So I switched to es_ES.ISO8859-15 and that gives me : 

álvarez
cota
chavez

... There was a time when the "Real Academia Española" considered "CH", "LL" and "SH" as letters. They changed that in 1994 :

In 1994, the RAE ruled that the Spanish consonants "CH" (ché) and "LL" (elle) would hence be alphabetized under "C" and under "L", respectively, and not as separate, discrete letters, as in the past. The RAE eliminated monosyllabic accented vowels where the accent did not serve in changing the word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both had an acutely-accented vowel "ó"; yet the monosyllabic word "sé" ("I know", the first person, singular, present of "saber", "to know"; and the singular imperative of "ser", "to be") retains its acutely-accented vowel in order to differentiate it from the reflexive pronoun "se".



I think thats where the problem comes from.

Anyway, any hints/clues/suicide-method-suggestions would be greatly appreciated!

Thanks!

- Jose Biskofski

Re: spanish locale question

From
"Albe Laurenz"
Date:
jbiskofski wrote:
> I have a lc_collate problem. Im in Mexico and I need the following three lastnames to be sorted this
> way :
>
> álvarez ( accent on first a )
> chavez
> cota
>
> Using the default locale on my mac ( en_US ) I end up with :
>
> chavez
> cota
> álvarez
>
> So I switched to es_ES.ISO8859-15 and that gives me :
>
> álvarez
> cota
> chavez
>
>
> ... There was a time when the "Real Academia Española" considered "CH", "LL" and "SH" as letters. They
> changed that in 1994 :
>
> In 1994, the RAE ruled that the Spanish consonants "CH" (ché) and "LL" (elle) would hence be
> alphabetized under "C" and under "L", respectively, and not as separate, discrete letters, as in the
> past. The RAE eliminated monosyllabic accented vowels where the accent did not serve in changing the
> word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both had an acutely-accented vowel
> "ó"; yet the monosyllabic word "sé" ("I know", the first person, singular, present of "saber", "to
> know"; and the singular imperative of "ser", "to be") retains its acutely-accented vowel in order to
> differentiate it from the reflexive pronoun "se".
>
> http://en.wikipedia.org/wiki/Real_Academia_Espa%C3%B1ola
>
>
> I think thats where the problem comes from.
>
> Anyway, any hints/clues/suicide-method-suggestions would be greatly appreciated!

PostgreSQL uses the operating system's collations.
Ask your operating system provider.

On my RHEL 3 Linux system it works as you want it to:

CREATE TABLE mexico(id integer PRIMARY KEY, val text NOT NULL COLLATE "es_ES.utf8");

INSERT INTO mexico VALUES (1, 'cota'), (2, 'álvarez'), (3, 'chavez');

SELECT * FROM mexico ORDER BY val;

 id |   val
----+---------
  2 | álvarez
  3 | chavez
  1 | cota
(3 rows)

Yours,
Laurenz Albe

Re: spanish locale question

From
jbiskofski
Date:
Laurenz, thank you so much for your help. I had a hard time getting the es_MX locale installed on my freebsd system, but once I did it worked perfectly. Thanks for your help again.

- Jose

On Thu, Apr 19, 2012 at 2:30 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
jbiskofski wrote:
> I have a lc_collate problem. Im in Mexico and I need the following three lastnames to be sorted this
> way :
>
> álvarez ( accent on first a )
> chavez
> cota
>
> Using the default locale on my mac ( en_US ) I end up with :
>
> chavez
> cota
> álvarez
>
> So I switched to es_ES.ISO8859-15 and that gives me :
>
> álvarez
> cota
> chavez
>
>
> ... There was a time when the "Real Academia Española" considered "CH", "LL" and "SH" as letters. They
> changed that in 1994 :
>
> In 1994, the RAE ruled that the Spanish consonants "CH" (ché) and "LL" (elle) would hence be
> alphabetized under "C" and under "L", respectively, and not as separate, discrete letters, as in the
> past. The RAE eliminated monosyllabic accented vowels where the accent did not serve in changing the
> word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both had an acutely-accented vowel
> "ó"; yet the monosyllabic word "sé" ("I know", the first person, singular, present of "saber", "to
> know"; and the singular imperative of "ser", "to be") retains its acutely-accented vowel in order to
> differentiate it from the reflexive pronoun "se".
>
> http://en.wikipedia.org/wiki/Real_Academia_Espa%C3%B1ola
>
>
> I think thats where the problem comes from.
>
> Anyway, any hints/clues/suicide-method-suggestions would be greatly appreciated!

PostgreSQL uses the operating system's collations.
Ask your operating system provider.

On my RHEL 3 Linux system it works as you want it to:

CREATE TABLE mexico(id integer PRIMARY KEY, val text NOT NULL COLLATE "es_ES.utf8");

INSERT INTO mexico VALUES (1, 'cota'), (2, 'álvarez'), (3, 'chavez');

SELECT * FROM mexico ORDER BY val;

 id |   val
----+---------
 2 | álvarez
 3 | chavez
 1 | cota
(3 rows)

Yours,
Laurenz Albe

Re: spanish locale question

From
Tulio
Date:
Hi there,

Let me expand the collate situation. I´m from Perú and I have turned
everything in postgresql.conf as 'es_PE.UTF-8' even the
default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
works in English I have es_PE locale too.

In Spanish vowels as "u" is equal as "ú" or even "ü". So let me build an
example:

CREATE TABLE pru(id integer PRIMARY KEY, dad text, mum text, name text);

INSERT INTO pru VALUES (1, 'león','valencia', 'josé'), (2, 'leon',
'mendoza', 'juan'), (3, 'león', 'valárd', 'jose'), (4, 'león','válencia',
'jos'), (5, 'león', 'mendoza', 'jua'), (6, 'leon', 'valencia', 'josie'), (7,
'león', 'valencia', 'josie');

if I do

SELECT * FROM pru order by dad,mum,name;

I get:

 id |  dad |   mum     |  name
----+------+----------+-------
  2 | leon | mendoza  | juan
  6 | leon | valencia   | josie
  5 | león | mendoza  | jua
  3 | león | valárd     | jose
  1 | león | valencia  | josé
  7 | león | valencia  | josie
  4 | león | válencia  | jos
(7 rows)

Which is a wrong order (collation) in Spanish and I don´t understand why.
But, I noticed that if I do:

SELECT * FROM pru order by dad || mum || name;

I get the correct order:

 id |  dad |   mum     |  name
----+------+----------+-------
  5 | león | mendoza  | jua
  2 | leon | mendoza  | juan
  3 | león | valárd     | jose
  4 | león | válencia  | jos
  1 | león | valencia  | josé
  6 | leon | valencia  | josie
  7 | león | valencia  | josie
(7 rows)


Is this the correct way to order in Postgresql and if it´s not Does anyone
have an idea and could please explain it to me?
Thanks in advance.

Tulio


--
View this message in context: http://postgresql.1045698.n5.nabble.com/spanish-locale-question-tp5650043p5681819.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: spanish locale question

From
"Albe Laurenz"
Date:
Tulio wrote:
> Let me expand the collate situation. I´m from Perú and I have turned
> everything in postgresql.conf as 'es_PE.UTF-8' even the
> default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
> works in English I have es_PE locale too.

> if I do
> 
> SELECT * FROM pru order by dad,mum,name;
> 
> I get:
> 
>  id |  dad |   mum     |  name
> ----+------+----------+-------
>   2 | leon | mendoza  | juan
>   6 | leon | valencia   | josie
>   5 | león | mendoza  | jua
>   3 | león | valárd     | jose
>   1 | león | valencia  | josé
>   7 | león | valencia  | josie
>   4 | león | válencia  | jos
> (7 rows)
> 
> Which is a wrong order (collation) in Spanish and I don´t understand why.

Maybe you misunderstood what it means to ORDER BY multiple
columns.  In your query, the rows are ordered by "dad",
then all rows where "dad" is the same are ordered by "mum",
and finally all rows where "dad" and "mum" are the same
are ordered by "name".

It is explained in the documentation:
http://www.postgresql.org/docs/current/static/queries-order.html

"When more than one expression is specified, the later values
 are used to sort rows that are equal according to the earlier values."

> But, I noticed that if I do:
> 
> SELECT * FROM pru order by dad || mum || name;
> 
> I get the correct order:
> 
>  id |  dad |   mum     |  name
> ----+------+----------+-------
>   5 | león | mendoza  | jua
>   2 | leon | mendoza  | juan
>   3 | león | valárd     | jose
>   4 | león | válencia  | jos
>   1 | león | valencia  | josé
>   6 | leon | valencia  | josie
>   7 | león | valencia  | josie
> (7 rows)
> 
> 
> Is this the correct way to order in Postgresql and if it´s not Does anyone
> have an idea and could please explain it to me?

This is not PostgreSQL-specific behaviour, it is defined in the
SQL standard and works like this on all database systems I know.

You can use the ORDER BY clause you propose if you prefer this
ordering.

But would you really order 'leon', 'mendoza', 'juan' before
'leo', 'zara', 'juan'?

Yours,
Laurenz Albe

Re: spanish locale question

From
Al Eridani
Date:
On Fri, May 4, 2012 at 1:22 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

> But would you really order 'leon', 'mendoza', 'juan' before
> 'leo', 'zara', 'juan'?

No, I guess that Tulio would not do that, because (usually) 'leo' goes
before 'leon'.

What Tulio is saying is that 'leon' and 'león' are the same thing from
the point of view of sorting in Spanish, but his PostgreSQL seems to
think that 'leon' goes before 'león'.

Al

Re: spanish locale question

From
Tom Lane
Date:
Al Eridani <al.eridani@gmail.com> writes:
> What Tulio is saying is that 'leon' and 'le�n' are the same thing from
> the point of view of sorting in Spanish, but his PostgreSQL seems to
> think that 'leon' goes before 'le�n'.

Postgres never considers that two distinct strings are "equal".  If the
locale setting considers these equal (which isn't entirely clear from
the given evidence), PG would then sort them on the basis of their
character code values.

A possible workaround if you need to consider them equal is to strip the
accents before sorting (ie, something like "ORDER BY to_ascii(col)") but
this may well throw away more information than you want ...

            regards, tom lane

Re: spanish locale question

From
Andreas Joseph Krogh
Date:
On 05/04/2012 07:31 PM, Tom Lane wrote:
> Al Eridani<al.eridani@gmail.com>  writes:
>> What Tulio is saying is that 'leon' and 'león' are the same thing from
>> the point of view of sorting in Spanish, but his PostgreSQL seems to
>> think that 'leon' goes before 'león'.
> Postgres never considers that two distinct strings are "equal".  If the
> locale setting considers these equal (which isn't entirely clear from
> the given evidence), PG would then sort them on the basis of their
> character code values.
>
> A possible workaround if you need to consider them equal is to strip the
> accents before sorting (ie, something like "ORDER BY to_ascii(col)") but
> this may well throw away more information than you want ...

Note that to_ascii barfs on unicode-input:

ERROR:  encoding conversion from UTF8 to ASCII not supported

Better install unaccent:

cd ./postgresql-9.1.2/contrib/unaccent
make install
psql
CREATE EXTENSION unaccent;
andreak=# select unaccent('león');
  unaccent
----------
  leon
(1 row)

--
Andreas Joseph Krogh<andreak@officenet.no>  - mob: +47 909 56 963
Senior Software Developer / CEO - OfficeNet AS - http://www.officenet.no
Public key: http://home.officenet.no/~andreak/public_key.asc


Re: spanish locale question

From
Tulio
Date:
Thank you Laurenz for your answer.

I think I understand the meaning of ORDER BY  my problem is the
understanding of COLLATE functionality. I thought that if I collate in es.PE
(Spanish) the "o" has the exact value of "ó" so if I

INSERT INTO pru VALUES (8, 'leo','zara', 'juan');

How can I obtain this order:

  id |  dad |   mum     |  name
 ----+------+----------+-------
   8 | leo   | zara       | juan
   5 | león | mendoza  | jua
   2 | leon | mendoza  | juan
   3 | león | valárd     | jose
   4 | león | válencia  | jos
   1 | león | valencia  | josé
   6 | leon | valencia  | josie
   7 | león | valencia  | josie

In the understanding (please forgive I repeat) that "o" and "ó" has no
difference in weight in the Spanish collation. As you can see below in the
first table of your answer the collation doesn´t work. It assumes that leon
is different from león (with accent). I was assuming the collation
eliminates such difference. It isn´t that way it works?

Thanks in advance for your help.

Tulio



Albe Laurenz *EXTERN* wrote
>
> Tulio wrote:
>> Let me expand the collate situation. I´m from Perú and I have turned
>> everything in postgresql.conf as 'es_PE.UTF-8' even the
>> default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
>> works in English I have es_PE locale too.
>
>> if I do
>>
>> SELECT * FROM pru order by dad,mum,name;
>>
>> I get:
>>
>>  id |  dad |   mum     |  name
>> ----+------+----------+-------
>>   2 | leon | mendoza  | juan
>>   6 | leon | valencia   | josie
>>   5 | león | mendoza  | jua
>>   3 | león | valárd     | jose
>>   1 | león | valencia  | josé
>>   7 | león | valencia  | josie
>>   4 | león | válencia  | jos
>> (7 rows)
>>
>> Which is a wrong order (collation) in Spanish and I don´t understand why.
>
> Maybe you misunderstood what it means to ORDER BY multiple
> columns.  In your query, the rows are ordered by "dad",
> then all rows where "dad" is the same are ordered by "mum",
> and finally all rows where "dad" and "mum" are the same
> are ordered by "name".
>
> It is explained in the documentation:
> http://www.postgresql.org/docs/current/static/queries-order.html
>
> "When more than one expression is specified, the later values
>  are used to sort rows that are equal according to the earlier values."
>
>> But, I noticed that if I do:
>>
>> SELECT * FROM pru order by dad || mum || name;
>>
>> I get the correct order:
>>
>>  id |  dad |   mum     |  name
>> ----+------+----------+-------
>>   5 | león | mendoza  | jua
>>   2 | leon | mendoza  | juan
>>   3 | león | valárd     | jose
>>   4 | león | válencia  | jos
>>   1 | león | valencia  | josé
>>   6 | leon | valencia  | josie
>>   7 | león | valencia  | josie
>> (7 rows)
>>
>>
>> Is this the correct way to order in Postgresql and if it´s not Does
>> anyone
>> have an idea and could please explain it to me?
>
> This is not PostgreSQL-specific behaviour, it is defined in the
> SQL standard and works like this on all database systems I know.
>
> You can use the ORDER BY clause you propose if you prefer this
> ordering.
>
> But would you really order 'leon', 'mendoza', 'juan' before
> 'leo', 'zara', 'juan'?
>
> Yours,
> Laurenz Albe
>
> --
> Sent via pgsql-general mailing list (pgsql-general@)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/spanish-locale-question-tp5650043p5687242.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.