Thread: Locale/encoding problem/question

Locale/encoding problem/question

From
henka@cityweb.co.za
Date:
Hello all,

I somehow managed to stuff up the encoding (or locale or something) in a
transfer of a database from one machine to another (also different linux
distribution).

The problem is this:  the origional database was created and populated
with data using whatever default locale/encoding was installed on the
first machine.

Words would appear correctly, as in:  Hoë

After dumping the database and restoring on a different one, the word
incorrectly appears as:  Hoë

Before I do any more damage or waste time, I'd appreciate any ideas or
pointers on correctly performing the initdb -E<encoding>
--locale=<locale>... (or any other ideas on resolving this issue) , before
I try and restore the data again.

Thanks
henk


Re: Locale/encoding problem/question

From
Martijn van Oosterhout
Date:
On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote:
> Hello all,
>
> I somehow managed to stuff up the encoding (or locale or something) in a
> transfer of a database from one machine to another (also different linux
> distribution).
>
> The problem is this:  the origional database was created and populated
> with data using whatever default locale/encoding was installed on the
> first machine.

Two big questions:

1. What encoding are the two database (\l will tell you)?
2. What encoding are the clients expecting?

It is entirely possible that the databases have got the right encoding,
but the client you're accessing it doesn't understand. For example, if
you've got both your databases on UTF-8, then the transfer went fine,
but perhaps the client on the new machine doesn't display UTF-8.

This happens because psql and other clients default to the same
encoding as the server and don't check to see if the terminal actually
supports that. So what you're decribing could also be a result of the
server sending you UTF-8, the client displaying that, but the xterm or
whatever you're using thinking it's Latin1.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Locale/encoding problem/question

From
henka@cityweb.co.za
Date:

> On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote:
>> Hello all,
>>
>> I somehow managed to stuff up the encoding (or locale or something) in a
>> transfer of a database from one machine to another (also different linux
>> distribution).
>>
>> The problem is this:  the origional database was created and populated
>> with data using whatever default locale/encoding was installed on the
>> first machine.
>
> Two big questions:
>
> 1. What encoding are the two database (\l will tell you)?
> 2. What encoding are the clients expecting?


Thanks for the response, Martijn.

I *think* the client_encoding origionally in the db was UTF-8 (but I could
be wrong, it might have been LATIN1).  I would imagine that LATIN1 would
be the right one, since it needs to display standard english, plus some
others (such as é ä ë è etc).

The multibyte chars show up in xterm (putty) -and- when the data is
displayed using php in a browser - both incorrectly.

I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
initdb, and then restoring the database after changing the 'utf-8' strings
in the dump data to 'latin1').  This still yields the funny chars.

To be honest, I have no idea what the origional encoding was.

Can you suggest any other approaches I can try to restore the database so
that those chars display correctly?

All comments are welcome.

Regards
Henk




Re: Locale/encoding problem/question

From
Martijn van Oosterhout
Date:
On Fri, Aug 04, 2006 at 11:58:22AM +0200, henka@cityweb.co.za wrote:
> > On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote:
> > Two big questions:
> >
> > 1. What encoding are the two database (\l will tell you)?
> > 2. What encoding are the clients expecting?

> I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
> initdb, and then restoring the database after changing the 'utf-8' strings
> in the dump data to 'latin1').  This still yields the funny chars.

Wait, so the dump is in utf-8? You shouldn't need to edit the dump,
postgresql will convert the encodings on the fly while loading.

> To be honest, I have no idea what the origional encoding was.

It should be in the dump file, almost the first line. Locale is of no
interest to pg_dump, you'll have to decide how you want it.

> Can you suggest any other approaches I can try to restore the database so
> that those chars display correctly?

Well, at the very least, does it go away if you type:

set client_encoding=latin1;

Please provide more details about your setup too, your client is on
windows? The server is ...?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Locale/encoding problem/question

From
henka@cityweb.co.za
Date:

> On Fri, Aug 04, 2006 at 11:58:22AM +0200, henka@cityweb.co.za wrote:
>> > On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote:
>> > Two big questions:
>> >
>> > 1. What encoding are the two database (\l will tell you)?
>> > 2. What encoding are the clients expecting?
>
>> I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
>> initdb, and then restoring the database after changing the 'utf-8'
>> strings
>> in the dump data to 'latin1').  This still yields the funny chars.
>
> Wait, so the dump is in utf-8? You shouldn't need to edit the dump,
> postgresql will convert the encodings on the fly while loading.

I've actually found two versions - one with UTF-8, and the other LATIN1.

>> To be honest, I have no idea what the origional encoding was.
>
> It should be in the dump file, almost the first line. Locale is of no
> interest to pg_dump, you'll have to decide how you want it.

Yes:  UTF-8 and the other is LATIN1

> Well, at the very least, does it go away if you type:
>
> set client_encoding=latin1;

No it doesn't.  That was one of the first things I tried after reading the
docs.

> Please provide more details about your setup too, your client is on
> windows? The server is ...?

Server:
Linux debian sarge
PG:  8.1.4
show all:
lc_collate                      C
lc_ctype                        C
lc_messages                  C
lc_monetary                   C
lc_numeric                     C
lc_time                         C
client_encoding              LATIN1  (or UTF-8)


Clients:
Windows using PuTTY (ie, for psql), and dynamic web content with PHP/Pg
(on any browser).




Re: Locale/encoding problem/question

From
Tom Lane
Date:
henka@cityweb.co.za writes:
>> It should be in the dump file, almost the first line. Locale is of no
>> interest to pg_dump, you'll have to decide how you want it.

> Yes:  UTF-8 and the other is LATIN1

Note that this represents what the original server *thought* the
encoding was.  But it's not at all impossible that the server thought
the data was LATIN1 when it was really UTF8.  (The other way around is
less plausible because the server would have been able to detect
encoding errors.)  If you were using clients that treated the data
as UTF8 without paying attention to what the server thought, you'd
not have realized you were mislabeling the data.

But, if you tried to load data marked as LATIN1 into a server using
UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then
everything's hosed.

I'd suggest actually inspecting the data in the dump file: it's not that
hard to tell UTF8 from LATIN1 if you look at the byte sequences.

Or you could just take the file marked LATIN1, edit it to change the
client_encoding setting to say the data is UTF8, and see if you can
load it.  If it's not UTF8, 8.1.4 will almost certainly detect a ton of
encoding errors.

            regards, tom lane

Re: Locale/encoding problem/question

From
henka@cityweb.co.za
Date:

> henka@cityweb.co.za writes:
>>> It should be in the dump file, almost the first line. Locale is of no
>>> interest to pg_dump, you'll have to decide how you want it.
>
>> Yes:  UTF-8 and the other is LATIN1
>
> Note that this represents what the original server *thought* the
> encoding was.  But it's not at all impossible that the server thought
> the data was LATIN1 when it was really UTF8.  (The other way around is
> less plausible because the server would have been able to detect
> encoding errors.)  If you were using clients that treated the data
> as UTF8 without paying attention to what the server thought, you'd
> not have realized you were mislabeling the data.
>
> But, if you tried to load data marked as LATIN1 into a server using
> UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then
> everything's hosed.
>
> I'd suggest actually inspecting the data in the dump file: it's not that
> hard to tell UTF8 from LATIN1 if you look at the byte sequences.
>
> Or you could just take the file marked LATIN1, edit it to change the
> client_encoding setting to say the data is UTF8, and see if you can
> load it.  If it's not UTF8, 8.1.4 will almost certainly detect a ton of
> encoding errors.


Thanks Tom, your suggestion worked.

Just to document this for others, this is what I did:

-  created a new empty DB:  initdb -ELATIN1 -D data.
-  edited dump file with UTF8 encoding and changed to LATIN1 (doing the
reverse resulted in encoding errors during restore).
-  restored database

So, it looks like it was the reverse:  the db thought it was UTF8, when in
fact it was LATIN1.

Regards