Thread: LC_COLLATE=es_MX in PgSQL 7.3.2
Ohh! Look at this... I think query #4 does not handle the accents correctly, and it should behave almost like #5 does, except for the record with p = 'R' which should be at the top (if they were #4's results). #5 is only shown as a sample. #5 works correctly. I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized with --lc-collate=es_MX. 1. This is the table: testdb=# \d t Table "public.t" Column | Type | Modifiers --------+-----------------------+----------- p | character varying(20) | m | character varying(20) | 2. This is the data: testdb=# SELECT * FROM t; p | m ----+------ ra | ab re | cc ri | cd ré | gg ré | zz RE | t3 R | tÉ3 RÉ | t3 RÉ | t4 RÍ | am4 Rí | am 5 (11 rows) 3. I think the following is correct: testdb=# SELECT * FROM t ORDER BY p; p | m ----+------ R | tÉ3 ra | ab re | cc RE | t3 ré | gg ré | zz RÉ | t3 RÉ | t4 ri | cd Rí | am 5 RÍ | am4 (11 rows) 4. The following is NOT correct. Take a look at #5 for the almost-correct behavior. testdb=# SELECT * FROM t ORDER BY p, m; p | m ----+------ R | tÉ3 ra | ab re | cc RE | t3 ré | gg ré | zz RÉ | t3 RÉ | t4 ri | cd Rí | am 5 RÍ | am4 (11 rows) 5. The following shows the result somewhat like they should be. Actually, the "R" alone should be at the beginning (not in this query, but if these were the results of the previous one. This one is working ok.) testdb=# SELECT * FROM t ORDER BY p || m; p | m ----+------ ra | ab re | cc ré | gg RE | t3 RÉ | t3 RÉ | t4 ré | zz RÍ | am4 Rí | am 5 ri | cd R | tÉ3 (11 rows) -- Octavio Alvarez Piza. E-mail: alvarezp@octavio.ods.org
"Octavio Alvarez" <alvarezp@octavio.ods.org> writes: > Ohh! Look at this... I think query #4 does not handle the accents > correctly, and it should behave almost like #5 does, except for the record > with p = 'R' which should be at the top (if they were #4's results). > I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized > with --lc-collate=es_MX. How about --lc-ctype? I think that accent handling would be driven by LC_CTYPE not LC_COLLATE. In any case, this is not a Postgres bug unless you can show that other programs using the same LC_foo settings behave differently. We punt pretty much all locale-related processing to subroutines in libc. regards, tom lane
On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote: > > Ohh! Look at this... I think query #4 does not handle the accents > correctly, and it should behave almost like #5 does, except for the record > with p = 'R' which should be at the top (if they were #4's results). I don't understand. Are you saying that "ré" should sort the same as "RÉ", and that the ambiguity should be solved by the next column? Well if it's that, I'm afraid it's not the way the locale is defined. I wish I could tell you how it is defined, but I don't really understand it. You can see for yourself in: http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/localedata/locales/es_ES?cvsroot=glibc The rules seem pretty convoluted. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "El destino baraja y nosotros jugamos" (A. Schopenhauer)
Tom Lane said: >> I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was >> initialized >> with --lc-collate=es_MX. > > How about --lc-ctype? I think that accent handling would be driven by > LC_CTYPE not LC_COLLATE. May be it's not the accents after all. I did the following tests without accents. Okay. Now, I tried several combinations, including --locale=es_MX and --lc-collate=es_MX --lc-ctype=es_MX, and got the same result. I would like to point out something: (still PG 7.3.2) I tried the following with --locale=es_MX, with --locale=en_US, with --locale=en_US.UTF-8. alvarezp=# select * from t order by p asc, m asc; p | m -------+------- octav | alvar OCTAV | ALVAA OCTAV | ALVAZ octia | alvra OCTIa | ALVAa OCTIb | ALVZa OCTIb | ALVZa octic | alvra OCTIc | ALVAa octvi | alvra OCTVI | ALVAa OCTVI | ALVZa (12 rows) No accents here. I would have expected: p | m -------+------- OCTAV | ALVAA octav | alvar OCTAV | ALVAZ OCTIa | ALVAa octia | alvra OCTIb | ALVZa OCTIb | ALVZa OCTIc | ALVAa octic | alvra OCTVI | ALVAa octvi | alvra OCTVI | ALVZa (12 rows) --locale=C gives out p | m -------+------- OCTAV | ALVAA OCTAV | ALVAZ OCTIa | ALVAa OCTIb | ALVZa OCTIb | ALVZa OCTIc | ALVAa OCTVI | ALVAa OCTVI | ALVZa octav | alvar octia | alvra octic | alvra octvi | alvra (12 rows) which I thnk is correct for that locale. Well, whatever. > In any case, this is not a Postgres bug unless > you can show that other programs using the same LC_foo settings behave > differently. We punt pretty much all locale-related processing to > subroutines in libc. How could I test that? I tried the following. Notice how the "octav" values are correctly sorted, but I don't know if SORT is actually separating the fields or understanding the whole line as 1 key. [alvarezp@pgsql alvarezp]$ sort -t : < o OCTAV:ALVAA octav:alvar OCTAV:ALVAZ OCTIa:ALVAa octia:alvra OCTIb:ALVZa OCTIb:ALVZa OCTIc:ALVAa octic:alvra OCTVI:ALVAa octvi:alvra OCTVI:ALVZa Whatever. Take a look at this one: [alvarezp@pgsql alvarezp]$ sort -k 1,1 < o octav alvar OCTAV ALVAA OCTAV ALVAZ octia alvra OCTIa ALVAa OCTIb ALVZa OCTIb ALVZa octic alvra OCTIc ALVAa octvi alvra OCTVI ALVAa OCTVI ALVZa I don't know if detection of which keys are equal (in this case octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be case insensitive. Octavio.
"Octavio Alvarez" <alvarezp@octavio.ods.org> writes: > I don't know if detection of which keys are equal (in this case > octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know > if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be > case insensitive. I believe that the usual locale definitions set things up so that strings of different case are not "equal", but case differences are the last tiebreaker when sorting. So "octav" sorts before "OCTAV" but after everything else that's less than "OCTAV". I haven't studied your results to see if they are consistent with that viewpoint, but I'm really not ready to believe that Postgres is sorting in a way that's not what the active locale says to do. We have not heard such a report before. (We have, on the other hand, heard from lots of people who were surprised at the properties of their default locale ...) regards, tom lane
On Wednesday 11 June 2003 21:34, Alvaro Herrera wrote: > On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote: > > Ohh! Look at this... I think query #4 does not handle the accents > > correctly, and it should behave almost like #5 does, except for the > > record with p = 'R' which should be at the top (if they were #4's > > results). > > I don't understand. Are you saying that "ré" should sort the same as > "RÉ", and that the ambiguity should be solved by the next column? Well > if it's that, I'm afraid it's not the way the locale is defined. Yes. In fact, after doing some tests, I noticed that that doesn't even work forn non-accented characters! For example, my name, OCTAVIO is different than octavio. As of strcoll and strcmp, they will of course be different. If I were only to sort by name, it would be correct to consider them as the different values, but if sorting by name and last name ("alvarez", "rivera", for instance), "OCTAVIO rivera" should be sorted AFTER "octavio ALVAREZ" (which is not being done like that because "OCTAVIO" > "octavio" in the sorting of the first column). One approach would be sorting by name || ' ' || lastname, but I don't know how that could slow down my queries. I will try that today. Octavio.
Octavio Alvarez writes: > No accents here. I would have expected: > p | m > -------+------- > OCTAV | ALVAA > octav | alvar > OCTAV | ALVAZ > OCTIa | ALVAa > octia | alvra > OCTIb | ALVZa > OCTIb | ALVZa > OCTIc | ALVAa > octic | alvra > OCTVI | ALVAa > octvi | alvra > OCTVI | ALVZa > (12 rows) That's not the way it works. First it sorts by p, then by m -- there is no cross-column magic. Maybe for your application you will be happier if you sort by lower(p), lower(m). -- Peter Eisentraut peter_e@gmx.net