Thread: LC_COLLATE=es_MX in PgSQL 7.3.2

LC_COLLATE=es_MX in PgSQL 7.3.2

From

"Octavio Alvarez"

Date:

11 June 2003, 20:40:52

Ohh! Look at this... I think query #4 does not handle the accents
correctly, and it should behave almost like #5 does, except for the record
with p = 'R' which should be at the top (if they were #4's results).

#5 is only shown as a sample. #5 works correctly.

I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized
with --lc-collate=es_MX.

1. This is the table:
testdb=# \d t
              Table "public.t"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 p      | character varying(20) |
 m      | character varying(20) |

2. This is the data:
testdb=# SELECT * FROM t;
 p  |  m
----+------
 ra | ab
 re | cc
 ri | cd
 ré | gg
 ré | zz
 RE | t3
 R  | tÉ3
 RÉ | t3
 RÉ | t4
 RÍ | am4
 Rí | am 5
(11 rows)

3. I think the following is correct:
testdb=# SELECT * FROM t ORDER BY p;
 p  |  m
----+------
 R  | tÉ3
 ra | ab
 re | cc
 RE | t3
 ré | gg
 ré | zz
 RÉ | t3
 RÉ | t4
 ri | cd
 Rí | am 5
 RÍ | am4
(11 rows)

4. The following is NOT correct. Take a look at #5 for the almost-correct
behavior.
testdb=# SELECT * FROM t ORDER BY p, m;
 p  |  m
----+------
 R  | tÉ3
 ra | ab
 re | cc
 RE | t3
 ré | gg
 ré | zz
 RÉ | t3
 RÉ | t4
 ri | cd
 Rí | am 5
 RÍ | am4
(11 rows)

5. The following shows the result somewhat like they should be. Actually,
the "R" alone should be at the beginning (not in this query, but if these
were the results of the previous one. This one is working ok.)
testdb=# SELECT * FROM t ORDER BY p || m;
 p  |  m
----+------
 ra | ab
 re | cc
 ré | gg
 RE | t3
 RÉ | t3
 RÉ | t4
 ré | zz
 RÍ | am4
 Rí | am 5
 ri | cd
 R  | tÉ3
(11 rows)

--
Octavio Alvarez Piza.
E-mail: alvarezp@octavio.ods.org

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

Tom Lane

Date:

12 June 2003, 01:20:41

"Octavio Alvarez" <alvarezp@octavio.ods.org> writes:
> Ohh! Look at this... I think query #4 does not handle the accents
> correctly, and it should behave almost like #5 does, except for the record
> with p = 'R' which should be at the top (if they were #4's results).
> I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized
> with --lc-collate=es_MX.

How about --lc-ctype?  I think that accent handling would be driven by
LC_CTYPE not LC_COLLATE.  In any case, this is not a Postgres bug unless
you can show that other programs using the same LC_foo settings behave
differently.  We punt pretty much all locale-related processing to
subroutines in libc.

            regards, tom lane

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

Alvaro Herrera

Date:

12 June 2003, 01:34:48

On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote:
>
> Ohh! Look at this... I think query #4 does not handle the accents
> correctly, and it should behave almost like #5 does, except for the record
> with p = 'R' which should be at the top (if they were #4's results).

I don't understand.  Are you saying that "ré" should sort the same as
"RÉ", and that the ambiguity should be solved by the next column?  Well
if it's that, I'm afraid it's not the way the locale is defined.  I wish
I could tell you how it is defined, but I don't really understand it.
You can see for yourself in:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/localedata/locales/es_ES?cvsroot=glibc

The rules seem pretty convoluted.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El destino baraja y nosotros jugamos" (A. Schopenhauer)

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

"Octavio Alvarez"

Date:

12 June 2003, 20:10:52

Tom Lane said:
>> I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was
>> initialized
>> with --lc-collate=es_MX.
>
> How about --lc-ctype?  I think that accent handling would be driven by
> LC_CTYPE not LC_COLLATE.

May be it's not the accents after all. I did the following tests without
accents.

Okay. Now, I tried several combinations, including --locale=es_MX and
--lc-collate=es_MX --lc-ctype=es_MX, and got the same result.

I would like to point out something: (still PG 7.3.2)

I tried the following with --locale=es_MX, with --locale=en_US, with
--locale=en_US.UTF-8.

alvarezp=# select * from t order by p asc, m asc;
   p   |   m
-------+-------
 octav | alvar
 OCTAV | ALVAA
 OCTAV | ALVAZ
 octia | alvra
 OCTIa | ALVAa
 OCTIb | ALVZa
 OCTIb | ALVZa
 octic | alvra
 OCTIc | ALVAa
 octvi | alvra
 OCTVI | ALVAa
 OCTVI | ALVZa
(12 rows)

No accents here. I would have expected:
   p   |   m
-------+-------
 OCTAV | ALVAA
 octav | alvar
 OCTAV | ALVAZ
 OCTIa | ALVAa
 octia | alvra
 OCTIb | ALVZa
 OCTIb | ALVZa
 OCTIc | ALVAa
 octic | alvra
 OCTVI | ALVAa
 octvi | alvra
 OCTVI | ALVZa
(12 rows)


--locale=C gives out
   p   |   m
-------+-------
 OCTAV | ALVAA
 OCTAV | ALVAZ
 OCTIa | ALVAa
 OCTIb | ALVZa
 OCTIb | ALVZa
 OCTIc | ALVAa
 OCTVI | ALVAa
 OCTVI | ALVZa
 octav | alvar
 octia | alvra
 octic | alvra
 octvi | alvra
(12 rows)

which I thnk is correct for that locale. Well, whatever.

> In any case, this is not a Postgres bug unless
> you can show that other programs using the same LC_foo settings behave
> differently.  We punt pretty much all locale-related processing to
> subroutines in libc.

How could I test that? I tried the following. Notice how the "octav"
values are correctly sorted, but I don't know if SORT is actually
separating the fields or understanding the whole line as 1 key.

[alvarezp@pgsql alvarezp]$ sort -t : < o
OCTAV:ALVAA
octav:alvar
OCTAV:ALVAZ
OCTIa:ALVAa
octia:alvra
OCTIb:ALVZa
OCTIb:ALVZa
OCTIc:ALVAa
octic:alvra
OCTVI:ALVAa
octvi:alvra
OCTVI:ALVZa

Whatever. Take a look at this one:

[alvarezp@pgsql alvarezp]$ sort -k 1,1 < o
octav alvar
OCTAV ALVAA
OCTAV ALVAZ
octia alvra
OCTIa ALVAa
OCTIb ALVZa
OCTIb ALVZa
octic alvra
OCTIc ALVAa
octvi alvra
OCTVI ALVAa
OCTVI ALVZa

I don't know if detection of which keys are equal (in this case
octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know
if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be
case insensitive.

Octavio.

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

Tom Lane

Date:

12 June 2003, 22:10:32

"Octavio Alvarez" <alvarezp@octavio.ods.org> writes:
> I don't know if detection of which keys are equal (in this case
> octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know
> if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be
> case insensitive.

I believe that the usual locale definitions set things up so that
strings of different case are not "equal", but case differences are
the last tiebreaker when sorting.  So "octav" sorts before "OCTAV"
but after everything else that's less than "OCTAV".

I haven't studied your results to see if they are consistent with
that viewpoint, but I'm really not ready to believe that Postgres
is sorting in a way that's not what the active locale says to do.
We have not heard such a report before.  (We have, on the other
hand, heard from lots of people who were surprised at the properties
of their default locale ...)

            regards, tom lane

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

Octavio Alvarez

Date:

13 June 2003, 15:11:01

On Wednesday 11 June 2003 21:34, Alvaro Herrera wrote:
> On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote:
> > Ohh! Look at this... I think query #4 does not handle the accents
> > correctly, and it should behave almost like #5 does, except for the
> > record with p = 'R' which should be at the top (if they were #4's
> > results).
>
> I don't understand.  Are you saying that "ré" should sort the same as
> "RÉ", and that the ambiguity should be solved by the next column?  Well
> if it's that, I'm afraid it's not the way the locale is defined.

Yes. In fact, after doing some tests, I noticed that that doesn't even work
forn non-accented characters! For example, my name, OCTAVIO is different than
octavio. As of strcoll and strcmp, they will of course be different. If I
were only to sort by name, it would be correct to consider them as the
different values, but if sorting by name and last name ("alvarez", "rivera",
for instance), "OCTAVIO rivera" should be sorted AFTER "octavio ALVAREZ"
(which is not being done like that because "OCTAVIO" > "octavio" in the
sorting of the first column).

One approach would be sorting by name || ' ' || lastname, but I don't know how
that could slow down my queries. I will try that today.

Octavio.

Re: LC_COLLATE=es_MX in PgSQL 7.3.2

From

Peter Eisentraut

Date:

16 June 2003, 09:41:28

Octavio Alvarez writes:

> No accents here. I would have expected:
>    p   |   m
> -------+-------
>  OCTAV | ALVAA
>  octav | alvar
>  OCTAV | ALVAZ
>  OCTIa | ALVAa
>  octia | alvra
>  OCTIb | ALVZa
>  OCTIb | ALVZa
>  OCTIc | ALVAa
>  octic | alvra
>  OCTVI | ALVAa
>  octvi | alvra
>  OCTVI | ALVZa
> (12 rows)

That's not the way it works.  First it sorts by p, then by m -- there is
no cross-column magic.  Maybe for your application you will be happier if
you sort by lower(p), lower(m).

--
Peter Eisentraut   peter_e@gmx.net