Thread: illegal sort order

illegal sort order

From

"Andrus"

Date:

12 July 2005, 14:46:56

I insalled Postgres 8 in Windows XP with default settings
I xreated database with encoding unicode.

I noticed that the sort order of accented characters is

B
�
�
C.
�
�
C

this is totally incorrect!  It is interesting that names beginning with C
are not contiguous: between C. and C are accented characters.

The correct sort order (by using character code values) shoud be

B
C
C.
�
�
�
�

How to force the correct sort order or at least move accented characters
���� to end of sorted list ?

Andrus.

Re: illegal sort order

From

Bruno Wolff III

Date:

12 July 2005, 15:29:07

On Tue, Jul 12, 2005 at 20:45:37 +0300,
  Andrus <eetasoft@online.ee> wrote:
>
> How to force the correct sort order or at least move accented characters
> ÕÄÖÜ to end of sorted list ?

Sort order depends on the locale used in initdb. If you want data sorted
by the codes used to represent the data, then you might want to initdb
with a locale of "C". Doing an initdb will require a dump and reload.

Re: illegal sort order

From

"Andrus"

Date:

12 July 2005, 16:16:04

> Sort order depends on the locale used in initdb. If you want data sorted
> by the codes used to represent the data, then you might want to initdb
> with a locale of "C". Doing an initdb will require a dump and reload.

Bruno, thank you.

SHOW ALL command returns the following:

"client_encoding";"UNICODE"
"lc_collate";"Estonian_Estonia.1257"
"lc_ctype";"Estonian_Estonia.1257"
"lc_messages";"Estonian_Estonia.1257"
"lc_monetary";"Estonian_Estonia.1257"
"lc_numeric";"Estonian_Estonia.1257"
"lc_time";"Estonian_Estonia.1257"

Unfortunately, the sort order is incorrect:

"A"
"S"
"B"
"C"
"�"
"�"
"�"
"�"
"D"
"E"
"F"
"G"

accented charactes must be at the end of alphabet.

Why Postgres uses VERY stange sort order ? I do'nt believe that this order
exists in any locale.

Also, UPPER() function causes error

ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

Is it possible to  fix the sort order and use UPPER() function in this
locale without dump and reload ?

Andrus.

Re: illegal sort order

From

Bruno Wolff III

Date:

12 July 2005, 17:13:47

On Tue, Jul 12, 2005 at 22:09:40 +0300,
  Andrus <eetasoft@online.ee> wrote:
> > Sort order depends on the locale used in initdb. If you want data sorted
> > by the codes used to represent the data, then you might want to initdb
> > with a locale of "C". Doing an initdb will require a dump and reload.
>
> Bruno, thank you.
>
> SHOW ALL command returns the following:
>
> "client_encoding";"UNICODE"
> "lc_collate";"Estonian_Estonia.1257"
> "lc_ctype";"Estonian_Estonia.1257"
> "lc_messages";"Estonian_Estonia.1257"
> "lc_monetary";"Estonian_Estonia.1257"
> "lc_numeric";"Estonian_Estonia.1257"
> "lc_time";"Estonian_Estonia.1257"

Are you running this on Windows? There is a problem with Unicode there.
I don't know what the exact symptoms are, so I don't know whether or not
that explains your problem.

>
> Unfortunately, the sort order is incorrect:
>
> "A"
> "S"
> "B"
> "C"
> "Ü"
> "Ö"
> "Ä"
> "Õ"
> "D"
> "E"
> "F"
> "G"
>
> accented charactes must be at the end of alphabet.
>
> Why Postgres uses VERY stange sort order ? I do'nt believe that this order
> exists in any locale.

Sort order is determined by the locale. If the locale doesn't have the
correct sort order you need to fix the locale definition in your OS or
use another locale.

>
> Also, UPPER() function causes error
>
> ERROR:  invalid multibyte character for locale
> HINT:  The server's LC_CTYPE locale is probably incompatible with the
> database encoding.
>
> Is it possible to  fix the sort order and use UPPER() function in this
> locale without dump and reload ?
>
> Andrus.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

Re: illegal sort order

From

Tom Lane

Date:

12 July 2005, 17:40:17

Bruno Wolff III <bruno@wolff.to> writes:
>   Andrus <eetasoft@online.ee> wrote:
>> SHOW ALL command returns the following:
>>
>> "client_encoding";"UNICODE"
>> "lc_collate";"Estonian_Estonia.1257"
>> "lc_ctype";"Estonian_Estonia.1257"
>> "lc_messages";"Estonian_Estonia.1257"
>> "lc_monetary";"Estonian_Estonia.1257"
>> "lc_numeric";"Estonian_Estonia.1257"
>> "lc_time";"Estonian_Estonia.1257"

> Are you running this on Windows? There is a problem with Unicode there.

Windows or not, the locale name suggests *very* strongly that it is not
a Unicode-using locale.  If the database encoding doesn't match what the
locale is expecting then strange sort order is exactly what you are
likely to get.

We should defend against these sorts of configuration problems more
thoroughly, but it seems to be hard in general to determine what
character set a locale is actually using :-(

            regards, tom lane

Re: illegal sort order

From

"Magnus Hagander"

Date:

13 July 2005, 05:50:15

> I insalled Postgres 8 in Windows XP with default settings I
> xreated database with encoding unicode.

Unicode is not currently supported when the server runs on Win32, see
http://www.postgresql.org/docs/faqs.FAQ_windows.html#2.6. You'll need to
pick a different encoding if you need locale-aware sorting.

This is something we hope to have fixed in 8.1.

//Magnus

Re: illegal sort order

From

"Andrus"

Date:

13 July 2005, 08:07:05

> Unicode is not currently supported when the server runs on Win32, see
> http://www.postgresql.org/docs/faqs.FAQ_windows.html#2.6. You'll need to
> pick a different encoding if you need locale-aware sorting.

I need to upsize tables from other dbms. This dbms does not support unicode.
I have tables which contain estonian language characters in one column
(using windows-1257 encoding) and russian langugage characters in other
column (using windows-1251 encoding) as single byte characters.
I use Postgres through ODBC.

I tried the following encodings for upsize this table with the following
results:

SQL_ASCII - non ascii characters are converted to question marks. This is
problably by ODBC driver since pgAdmin shows them.

LATIN1, LATIN4 - some characters cause error if I try to upsize my data to
Postgres (I use odbc for upsize).

If unicode is not working, I need fully transparent access through ODBC: all
bytes in range 32 .. 255 must be stored in unmodified form in CHARACTER and
TEXT fields and returned in unmodified form to my ODBC client.
This does not allow UPPER() function to work but at least sorting is
somewhat meaningful (when sorted by unsigned binary values.)

Unfortunately it seems that ODBC driver replaces some charactes to question
marks in case on SQL_ASCII .
 So I need SQL_UNSIGNED_BINARY  or SQL_TRANSPARENT encoding.
I tried to use binary data types insted of char and text types but pgADMIN
show them as octal strings.
I have no idea, meybe it is possible configure or modify odbc driver. I
looked into odbc drivers sources but havent found place which converts
chars>0x80 to question marks.

Any idea ?

When fixed unicode version (8.1?) will be available ?

Andrus.