Thread: Mis-interpreted extended character

Mis-interpreted extended character

From

Andrew Biagioni

Date:

13 December 2003, 11:19:05

Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding:

    egreek=# \l
             List of databases
        Name     | Owner   | Encoding
    -------------+----------+----------
     egreek      | andrew   | UNICODE

For some reason, If I try to use an extended character (ASCII code > 127) in a string, I get this peculiar result:

    egreek=# SELECT LOWER ( 'Niño' );
    egreek'#

(note: it fails to see the closing single-quote character; I hit CTRL-C to quit the operation)

and:

    egreek=# SELECT LOWER ( 'Niño ' );
    ERROR: Unicode >= 0x10000 is not supported

(note: I added a space before the closing quote).

It looks to me like the 'ñ' character (spanish n with tilde - ASCII 164, as far as I can tell) is messing with the parser somehow.

Furthermore -- this is data coming from a form in a web site used globally, so I have little control over what is passed to me (me = PostgreSQL DB). I can't (without lotsa hassles) "demand" that higher-ASCII characters be encoded on the way in, and decoded on the way out.

Any suggestions/info/hints?

Thanks,

                 Andrew

Re: Mis-interpreted extended character

From

Chris

Date:

13 December 2003, 11:39:35

> Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding:
> [...]

> For some reason, If I try to use an extended character (ASCII code >
> 127) in a string, I get this peculiar result:

> [...]

Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.

Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.

Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.

According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).


Short answer: not PostgreSQL's fault.


Bye, Chris.

Re: Mis-interpreted extended character

From

Andrew Biagioni

Date:

15 December 2003, 21:26:15

Chris,

Thanks -- your answer is 90% of what I need! As for the other 10%:

Chris wrote:

Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding: 
[...]

For some reason, If I try to use an extended character (ASCII code >
127) in a string, I get this peculiar result:

[...]

Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.

Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127 
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.

Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.

According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).


Short answer: not PostgreSQL's fault.

Agreed -- and thanks for the above info.

It ALSO turns out that Java has its own issues with >127 characters, which I'm going to look into -- but it was nice to prove you right (and solve half my problem!) by setting the web page encoding...

Bye, Chris.



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your     joining column's datatypes do not match

Thanks again,

Andrew