Thread: Mis-interpreted extended character

Mis-interpreted extended character

From
Andrew Biagioni
Date:
Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding:

    egreek=# \l
             List of databases
        Name     |  Owner   | Encoding
    -------------+----------+----------
     egreek      | andrew   | UNICODE

For some reason, If I try to use an extended character (ASCII code > 127) in a string, I get this peculiar result:

    egreek=# SELECT  LOWER ( 'Niño' );
    egreek'#

(note:  it fails to see the closing single-quote character;  I hit CTRL-C to quit the operation)

and:

    egreek=# SELECT  LOWER ( 'Niño ' );
    ERROR:  Unicode >= 0x10000 is not supported

(note:  I added a space before the closing quote).

It looks to me like the '
ñ' character (spanish n with tilde - ASCII 164, as far as I can tell) is messing with the parser somehow.

Furthermore -- this is data coming from a form in a web site used globally, so I have little control over what is passed to me (me = PostgreSQL DB).  I can't (without lotsa hassles) "demand" that higher-ASCII characters be encoded on the way in, and decoded on the way out.

Any suggestions/info/hints?

Thanks,

                 Andrew

Re: Mis-interpreted extended character

From
Chris
Date:
> Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding:
> [...]

> For some reason, If I try to use an extended character (ASCII code >
> 127) in a string, I get this peculiar result:

> [...]

Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.

Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.

Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.

According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).


Short answer: not PostgreSQL's fault.


Bye, Chris.



Re: Mis-interpreted extended character

From
Andrew Biagioni
Date:
Chris,

Thanks -- your answer is 90% of what I need!  As for the other 10%:

Chris wrote:
Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding: 
[...]   
 
For some reason, If I try to use an extended character (ASCII code >
127) in a string, I get this peculiar result:   
 
[...]   
Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.
Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127 
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.

Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.

According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).


Short answer: not PostgreSQL's fault.
Agreed -- and thanks for the above info.

It ALSO turns out that Java has its own issues with >127 characters, which I'm going to look into -- but it was nice to prove you right (and solve half my problem!) by setting the web page encoding...
Bye, Chris.



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your     joining column's datatypes do not match

Thanks again,

        Andrew