Thread: Character encoding

Character encoding

From
Dennis Björklund
Date:
I've been playing with character encodings and found a problem/bug. I
still use 7.3.2, so it's possible (but I think not) that some of this have
been fixed.

When you run psql with a different language then english the strings are
usually in a character set that is not pure ascii. For example to
represent swedish you need either latin1 or unicode. Therefor the po file
for swedish is in latin1.

Now, these strings are used to create queries that are sent to postgres. 
For example if you perform \d in a swedish psql you get

# \d        Lista med relationerSchema |   Namn    |   Typ   | Ägare
--------+-----------+---------+--------public | boz       | tabell  | dennispublic | boz_a_seq | sekvens | dennis

where Owner is translated to "Ägare". The problem is now if
the database uses utf-8. Then psql still seems to create queries with
latin1 characters in them which is invalid in utf-8. So I get this:

# \d
ERROR:  Invalid UNICODE character sequence found (0xe47273)

It has to be translated to utf-8 before it's sent to the backend.

Actually, in the example above it's not the string "Ägare" that gives the
error message but the value that maps relkind 's' to 'särskild' in
swedish. Seems like column names and column values are treated different 

My guess is that the backend don't care what the column name is and
just sends it back. Which is broken if there are different character 
encodings at play.

I have also another problem with character sets. I have a unicode
database, and when I set the client encoding to unicode I get nice utf-8
strings back. However, my terminal can not show them so when I run psql I
get strings like "armbåge" (which is what a utf-8 string looks like in
latin1). My client program written using libpq works fine and I get good
utf-8 back.

However, I tried to set the client encoding in psql to latin1 so that it
would show the strings correctly. Then the string above really should be
showed as "armbåge", but it is showed as "armbge".

It should work fine since I know that my strings really are latin1 strings
even when represented as utf-8. Also, the manual says that it should work
for also characters where there is no conversion, it should then become
the hexdecimal value in parentheses.

-- 
/Dennis



Re: Character encoding

From
Peter Eisentraut
Date:
Dennis Björklund writes:

> When you run psql with a different language then english the strings are
> usually in a character set that is not pure ascii. For example to
> represent swedish you need either latin1 or unicode. Therefor the po file
> for swedish is in latin1.

Yes, you need to set your client encoding to match the PO files.  Maybe we
should try to keep the (translated) column headers within the client, to
side-step this issue.  Do you want to investigate that?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Character encoding

From
Dennis Björklund
Date:
On Tue, 10 Jun 2003, Peter Eisentraut wrote:

> we should try to keep the (translated) column headers within the client,
> to side-step this issue.  Do you want to investigate that?

That is the obvious solution, there is no real need to send the strings to
the server in the first place.

The problem is not just with column headers though as it's also used in
the data in the table. (which of course has the same solution, examin the
data when it get to the client and substitute for the translated string).

I'll take a look at it and probably fix it by the weekend (if not sooner).

-- 
/Dennis