Thread: BUG #6642: Spanish collation fault

BUG #6642: Spanish collation fault

From

tulio.carrasco@gmail.com

Date:

15 May 2012, 22:45:21

The following bug has been logged on the website:

Bug reference:      6642
Logged by:          T J Carrasco
Email address:      tulio.carrasco@gmail.com
PostgreSQL version: 9.1.3
Operating system:   Ubuntu 12.04 LTS
Description:=20=20=20=20=20=20=20=20

In Spanish every vowel has the same precedence value even with or without
any type of accent, so: a =3D =C3=A1 =3D =C3=A4 =3D A =3D =C3=81 =3D =C3=84

With OS locale =3D lc_collate =3D lc_ctype =3D es_ES.UTF-8 then:

CREATE TABLE test(id integer PRIMARY KEY, dad text, mum text, name text);

INSERT INTO test VALUES (1, 'le=C3=B3n','valencia', 'jos=C3=A9'), (2, 'leon=
',
'mendoza', 'juan'), (3, 'le=C3=B3n', 'val=C3=A1rd', 'jose'), (4, 'le=C3=B3n=
','v=C3=A1lencia',
'jos'), (5, 'le=C3=B3n', 'mendoza', 'jua'), (6, 'leon', 'valencia', 'josie'=
), (7,
'le=C3=B3n', 'valencia', 'josie'), (8, 'leo','zara', 'juan'), (9,
'le=C3=B3n','Valencia', 'jos');

SELECT * FROM test order by dad, mum, name;=20

We obtain:

id | dad  | mum      | name
---+------+----------+-------
 8 | leo  | zara     | juan
 2 | leon | mendoza  | juan
 6 | leon | valencia | josie
 5 | le=C3=B3n | mendoza  | jua
 3 | le=C3=B3n | val=C3=A1rd   | jose
 1 | le=C3=B3n | valencia | jos=C3=A9
 7 | le=C3=B3n | valencia | josie
 9 | le=C3=B3n | Valencia | jos
 4 | le=C3=B3n | v=C3=A1lencia | jos
(9 rows)

Which doesn=C2=B4t accomplish with Spanish collation. It should be somethin=
g like
this:

id | dad  | mum      | name
---+------+----------+-------
 8 | leo  | zara     | juan
 2 | leon | mendoza  | juan
 5 | le=C3=B3n | mendoza  | jua
 3 | le=C3=B3n | val=C3=A1rd   | jose
 4 | le=C3=B3n | v=C3=A1lencia | jos
 9 | le=C3=B3n | Valencia | jos
 1 | le=C3=B3n | valencia | jos=C3=A9
 6 | leon | valencia | josie
 7 | le=C3=B3n | valencia | josie
(9 rows)

In which due dad column has the same value it uses the mum column and if
they also have the same value then it should use the name column.

Thanks in advance for your help.

Re: BUG #6642: Spanish collation fault

From

Alvaro Herrera

Date:

16 May 2012, 16:21:15

Excerpts from tulio.carrasco's message of mar may 15 21:45:00 -0400 2012:

> In Spanish every vowel has the same precedence value even with or without
> any type of accent, so: a =3D =C3=A1 =3D =C3=A4 =3D A =3D =C3=81 =3D =C3=
=84

> In which due dad column has the same value it uses the mum column and if
> they also have the same value then it should use the name column.

I see your point, and it makes sense, but it just doesn't work that way.

The way it works is that it goes over the first column and does the two
passes (first ignoring accents, second considering them) over it using
the collating rules you describe; and once that is complete, it goes to
look at the second column.  I don't think it's possible at all to
implement what you suggest, which is to make the first pass over all the
columns and then the second pass over all the columns.

I didn't try, but if you were to do an ORDER BY dad || ' ' || mum ' ' || na=
me
I think it should do what you want.  (Not really sure the spaces are
necessary).  Because then you're sorting on a single string and it can
do the first pass on the whole thing and then the second pass.

--=20
=C3=81lvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BUG #6642: Spanish collation fault

From

Tulio J Carrasco

Date:

16 May 2012, 21:56:30

Thank you very much =C1lvaro,

As far as I understood from someone who knows about ;)  "...the SQL
standard indicates to order according the first column and consider the
second column only for those records which first column has identical order
values..."

The bug I refer to is that first column is FINALLY ordered for character
value -not the order value which the collation requests.

"Le=F3n", "leon", "leOn" has identical order values: none of them weights
more for Spanish collation and PG is not working according it if PG makes a
"second pass" that ignores collation and reorders by its character value.
Best Regards

Tulio


On Wed, May 16, 2012 at 2:20 PM, Alvaro Herrera
<alvherre@commandprompt.com>wrote:

>
> Excerpts from tulio.carrasco's message of mar may 15 21:45:00 -0400 2012:
>
> > In Spanish every vowel has the same precedence value even with or witho=
ut
> > any type of accent, so: a =3D =E1 =3D =E4 =3D A =3D =C1 =3D =C4
>
> > In which due dad column has the same value it uses the mum column and if
> > they also have the same value then it should use the name column.
>
> I see your point, and it makes sense, but it just doesn't work that way.
>
> The way it works is that it goes over the first column and does the two
> passes (first ignoring accents, second considering them) over it using
> the collating rules you describe; and once that is complete, it goes to
> look at the second column.  I don't think it's possible at all to
> implement what you suggest, which is to make the first pass over all the
> columns and then the second pass over all the columns.
>
> I didn't try, but if you were to do an ORDER BY dad || ' ' || mum ' ' ||
> name
> I think it should do what you want.  (Not really sure the spaces are
> necessary).  Because then you're sorting on a single string and it can
> do the first pass on the whole thing and then the second pass.
>
> --
> =C1lvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

Re: BUG #6642: Spanish collation fault

From

Peter Eisentraut

Date:

19 May 2012, 20:11:22

On ons, 2012-05-16 at 19:55 -0500, Tulio J Carrasco wrote:
> "LeÃ³n", "leon", "leOn" has identical order values: none of them
> weights more for Spanish collation and PG is not working according it
> if PG makes a "second pass" that ignores collation and reorders by its
> character value.

PostgreSQL uses the operating system's C library for sorting.  Since you
are using Ubuntu, that is glibc for you.  And a look into the glibc
sources shows that Ã³ is sorted after o, the way Ãlvaro has explained.
If you disagree with that, you need to take it up with the glibc
maintainers.