Thread: BUG #6642: Spanish collation fault
The following bug has been logged on the website: Bug reference: 6642 Logged by: T J Carrasco Email address: tulio.carrasco@gmail.com PostgreSQL version: 9.1.3 Operating system: Ubuntu 12.04 LTS Description:=20=20=20=20=20=20=20=20 In Spanish every vowel has the same precedence value even with or without any type of accent, so: a =3D =C3=A1 =3D =C3=A4 =3D A =3D =C3=81 =3D =C3=84 With OS locale =3D lc_collate =3D lc_ctype =3D es_ES.UTF-8 then: CREATE TABLE test(id integer PRIMARY KEY, dad text, mum text, name text); INSERT INTO test VALUES (1, 'le=C3=B3n','valencia', 'jos=C3=A9'), (2, 'leon= ', 'mendoza', 'juan'), (3, 'le=C3=B3n', 'val=C3=A1rd', 'jose'), (4, 'le=C3=B3n= ','v=C3=A1lencia', 'jos'), (5, 'le=C3=B3n', 'mendoza', 'jua'), (6, 'leon', 'valencia', 'josie'= ), (7, 'le=C3=B3n', 'valencia', 'josie'), (8, 'leo','zara', 'juan'), (9, 'le=C3=B3n','Valencia', 'jos'); SELECT * FROM test order by dad, mum, name;=20 We obtain: id | dad | mum | name ---+------+----------+------- 8 | leo | zara | juan 2 | leon | mendoza | juan 6 | leon | valencia | josie 5 | le=C3=B3n | mendoza | jua 3 | le=C3=B3n | val=C3=A1rd | jose 1 | le=C3=B3n | valencia | jos=C3=A9 7 | le=C3=B3n | valencia | josie 9 | le=C3=B3n | Valencia | jos 4 | le=C3=B3n | v=C3=A1lencia | jos (9 rows) Which doesn=C2=B4t accomplish with Spanish collation. It should be somethin= g like this: id | dad | mum | name ---+------+----------+------- 8 | leo | zara | juan 2 | leon | mendoza | juan 5 | le=C3=B3n | mendoza | jua 3 | le=C3=B3n | val=C3=A1rd | jose 4 | le=C3=B3n | v=C3=A1lencia | jos 9 | le=C3=B3n | Valencia | jos 1 | le=C3=B3n | valencia | jos=C3=A9 6 | leon | valencia | josie 7 | le=C3=B3n | valencia | josie (9 rows) In which due dad column has the same value it uses the mum column and if they also have the same value then it should use the name column. Thanks in advance for your help.
Excerpts from tulio.carrasco's message of mar may 15 21:45:00 -0400 2012: > In Spanish every vowel has the same precedence value even with or without > any type of accent, so: a =3D =C3=A1 =3D =C3=A4 =3D A =3D =C3=81 =3D =C3= =84 > In which due dad column has the same value it uses the mum column and if > they also have the same value then it should use the name column. I see your point, and it makes sense, but it just doesn't work that way. The way it works is that it goes over the first column and does the two passes (first ignoring accents, second considering them) over it using the collating rules you describe; and once that is complete, it goes to look at the second column. I don't think it's possible at all to implement what you suggest, which is to make the first pass over all the columns and then the second pass over all the columns. I didn't try, but if you were to do an ORDER BY dad || ' ' || mum ' ' || na= me I think it should do what you want. (Not really sure the spaces are necessary). Because then you're sorting on a single string and it can do the first pass on the whole thing and then the second pass. --=20 =C3=81lvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Thank you very much =C1lvaro, As far as I understood from someone who knows about ;) "...the SQL standard indicates to order according the first column and consider the second column only for those records which first column has identical order values..." The bug I refer to is that first column is FINALLY ordered for character value -not the order value which the collation requests. "Le=F3n", "leon", "leOn" has identical order values: none of them weights more for Spanish collation and PG is not working according it if PG makes a "second pass" that ignores collation and reorders by its character value. Best Regards Tulio On Wed, May 16, 2012 at 2:20 PM, Alvaro Herrera <alvherre@commandprompt.com>wrote: > > Excerpts from tulio.carrasco's message of mar may 15 21:45:00 -0400 2012: > > > In Spanish every vowel has the same precedence value even with or witho= ut > > any type of accent, so: a =3D =E1 =3D =E4 =3D A =3D =C1 =3D =C4 > > > In which due dad column has the same value it uses the mum column and if > > they also have the same value then it should use the name column. > > I see your point, and it makes sense, but it just doesn't work that way. > > The way it works is that it goes over the first column and does the two > passes (first ignoring accents, second considering them) over it using > the collating rules you describe; and once that is complete, it goes to > look at the second column. I don't think it's possible at all to > implement what you suggest, which is to make the first pass over all the > columns and then the second pass over all the columns. > > I didn't try, but if you were to do an ORDER BY dad || ' ' || mum ' ' || > name > I think it should do what you want. (Not really sure the spaces are > necessary). Because then you're sorting on a single string and it can > do the first pass on the whole thing and then the second pass. > > -- > =C1lvaro Herrera <alvherre@commandprompt.com> > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support >
On ons, 2012-05-16 at 19:55 -0500, Tulio J Carrasco wrote: > "León", "leon", "leOn" has identical order values: none of them > weights more for Spanish collation and PG is not working according it > if PG makes a "second pass" that ignores collation and reorders by its > character value. PostgreSQL uses the operating system's C library for sorting. Since you are using Ubuntu, that is glibc for you. And a look into the glibc sources shows that ó is sorted after o, the way Ãlvaro has explained. If you disagree with that, you need to take it up with the glibc maintainers.