Thread: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

From
Bianchi Quota Leonardo
Date:
Hi, I'm trying to move a db from postgres 8.1 encoded LATIN9 from a debian 4.0 box to postgres 8.4 encoded UTF8 on a
rh6.6(the whole job is to dismiss the old server, migrate and upgrade bugzilla application) 
I would like to restore dumped data in the new utf8 db solving the problem of chars like "è,à,é,ò,ù" which are not seen
bythe application correctly. About the application I can say it is set for using utf8. 
About my skills, I have very little experience with db and postgres.

I "SOLVED" it doing this way but don't know what I did and I don't know which consequences would have in future, then I
needto know if it's ok... 

Starting on BOX1
$pg_dump --no-privileges --no-owner -h localhost -U bugs -f DB.sql   (dump in latin9)

$vi DB.sql and changed the first string with the last.
>SET client_encoding = 'LATIN9';
<SET client_encoding = 'utf8';

moved via scp to box2
$scp DB.sql 172.25.98.141:/home/deploy

on BOX2
via psql create new db "bugzilla-database"
$psql bugzilla-database < DB.sql

Then browsing the restored data via the web application everything is ok...
-
Here below I put few notes, it may can help:

DB is not big (bugzilla bugs table contains ~5000records) and dump file not compressed (no -Fc) is about 450MB
BOX1 debian variables LANG=it_IT@euro      LC_CTYPE="it_IT@euro" (which means LATIN9)
BOX2 redhat variables LANG=en_US.UTF-8   LC_CTYPE="en_US.UTF-8"
---------------------------------------------
example of words seen via "vi" on debian  (the words are: attività = activity ; è = è ; l'attuale = the current)
LATIN9 dump file: attivitÃ|            è           lâ~@~Yattuale
UTF8   dump file:  attivitÃ~CÂ|     Ã~CÅ¡       lâÂ~@Â~Yattuale
UTF8 ...and on RH ie "è" is            ÃÅ¡
----------------------------------------------
Other notes:
- I tryed pg_dump using --encoding UTF8 and -Fc and -f (to avoid redirect) but it wasn't successful.
- I successfully dumped in Latin9, create a db in latin9 and restore dumped file (meaning I could see correctly the
charsvia the application) 

I tried to be concise (hope not too much).

Thank you in advance for any help!

Bye,
Leonardo

AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute nel messaggio o nei suoi allegati. Se non siete
idestinatari indicati nel messaggio, o responsabili per la sua consegna alla persona, o se avete ricevuto il messaggio
pererrore, siete pregati di non trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a cancellare il
messaggioed i suoi allegati. Grazie. CONFIDENTIALITY NOTICE Confidential information may be contained in this message
orin its attachments. If you are not the addressee indicated in this message, or responsible for message delivering to
thatperson, or if you have received this message in error, you may not transcribe, copy or deliver this message to
anyone.In that case, you should delete this message and its attachments. Thank you. 


Re: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

From
Adrian Klaver
Date:
On 08/12/2015 06:46 AM, Bianchi Quota Leonardo wrote:
> Hi, I'm trying to move a db from postgres 8.1 encoded LATIN9 from a debian 4.0 box to postgres 8.4 encoded UTF8 on a
rh6.6(the whole job is to dismiss the old server, migrate and upgrade bugzilla application) 

FYI, 8.4 is no longer community supported. The oldest supported version
is 9.0 and its support will in September. See here for more details:

http://www.postgresql.org/support/versioning/


> I would like to restore dumped data in the new utf8 db solving the problem of chars like "è,à,é,ò,ù" which are not
seenby the application correctly. About the application I can say it is set for using utf8. 
> About my skills, I have very little experience with db and postgres.
>
> I "SOLVED" it doing this way but don't know what I did and I don't know which consequences would have in future, then
Ineed to know if it's ok... 
>
> Starting on BOX1
> $pg_dump --no-privileges --no-owner -h localhost -U bugs -f DB.sql   (dump in latin9)
>
> $vi DB.sql and changed the first string with the last.
>> SET client_encoding = 'LATIN9';
> <SET client_encoding = 'utf8';
>
> moved via scp to box2
> $scp DB.sql 172.25.98.141:/home/deploy
>
> on BOX2
> via psql create new db "bugzilla-database"
> $psql bugzilla-database < DB.sql
>
> Then browsing the restored data via the web application everything is ok...
> -
> Here below I put few notes, it may can help:
>
> DB is not big (bugzilla bugs table contains ~5000records) and dump file not compressed (no -Fc) is about 450MB
> BOX1 debian variables LANG=it_IT@euro      LC_CTYPE="it_IT@euro" (which means LATIN9)
> BOX2 redhat variables LANG=en_US.UTF-8   LC_CTYPE="en_US.UTF-8"
> ---------------------------------------------
> example of words seen via "vi" on debian  (the words are: attività = activity ; è = è ; l'attuale = the current)
> LATIN9 dump file: attivitÃ|            è           lâ~@~Yattuale
> UTF8   dump file:  attivitÃ~CÂ|     Ã~CÅ¡       lâÂ~@Â~Yattuale
> UTF8 ...and on RH ie "è" is            ÃÅ¡
> ----------------------------------------------
> Other notes:
> - I tryed pg_dump using --encoding UTF8 and -Fc and -f (to avoid redirect) but it wasn't successful.
> - I successfully dumped in Latin9, create a db in latin9 and restore dumped file (meaning I could see correctly the
charsvia the application) 
>
> I tried to be concise (hope not too much).
>
> Thank you in advance for any help!

The reason this:

SET client_encoding = 'LATIN9';

worked can be found here:

http://www.postgresql.org/docs/8.4/interactive/multibyte.html

Postgres can convert latin9 to UTF8, if it knows that is what it getting.

>
> Bye,
> Leonardo
>
> AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute nel messaggio o nei suoi allegati. Se non
sietei destinatari indicati nel messaggio, o responsabili per la sua consegna alla persona, o se avete ricevuto il
messaggioper errore, siete pregati di non trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a
cancellareil messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY NOTICE Confidential information may be contained in
thismessage or in its attachments. If you are not the addressee indicated in this message, or responsible for message
deliveringto that person, or if you have received this message in error, you may not transcribe, copy or deliver this
messageto anyone. In that case, you should delete this message and its attachments. Thank you. 
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Adrian Klaver <adrian.klaver@aklaver.com> writes:
> On 08/12/2015 06:46 AM, Bianchi Quota Leonardo wrote:
>> I "SOLVED" it doing this way but don't know what I did and I don't know which consequences would have in future,
thenI need to know if it's ok... 
>>
>> Starting on BOX1
>> $pg_dump --no-privileges --no-owner -h localhost -U bugs -f DB.sql   (dump in latin9)
>>
>> $vi DB.sql and changed the first string with the last.
>> >SET client_encoding = 'LATIN9';
>> <SET client_encoding = 'utf8';

It does not seem likely to me that this would work at all.  You're taking
a dump file that is full of LATIN9 data and simply asserting that it's
UTF8 data.  That doesn't make it so.  If it seemed to work, maybe that's
because your editor changed the encoding?  Not to be relied on, for sure.

The right way to do something like this is not to change the dump file at
all, just create a new database with the encoding you want and restore
the dump file into it.  PG will handle the encoding conversion
automatically, as long as you don't misinform it about what encoding the
incoming data is in.

            regards, tom lane


Re: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

From
"Martín Marqués"
Date:
El 12/08/15 a las 11:12, Tom Lane escribió:
> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>> On 08/12/2015 06:46 AM, Bianchi Quota Leonardo wrote:
>> Hi, I'm trying to move a db from postgres 8.1 encoded LATIN9 from a
>> debian 4.0 box to postgres 8.4 encoded UTF8 on a rh6.6 (the whole job
>> is to dismiss the old server, migrate and upgrade bugzilla application)
>
> FYI, 8.4 is no longer community supported. The oldest supported version
> is 9.0 and its support will in September. See here for more details:
>
> http://www.postgresql.org/support/versioning/

I think you should try moving even further ahead from 9.0 or 9.1, as to
avoid the trouble of having to plan a new upgrade any time soon.

>>> I "SOLVED" it doing this way but don't know what I did and I don't know which consequences would have in future,
thenI need to know if it's ok... 
>>>
>>> Starting on BOX1
>>> $pg_dump --no-privileges --no-owner -h localhost -U bugs -f DB.sql   (dump in latin9)
>>>
>>> $vi DB.sql and changed the first string with the last.
>>>> SET client_encoding = 'LATIN9';
>>> <SET client_encoding = 'utf8';

client_encoding tells the postgres server which encoding the data he
will be receiving is in.

If the data was correctly inserted in the old database with LATIN9, and
you didn't change the client_encoding, then pg_dump will output data
with LATIN9 encoding.

> It does not seem likely to me that this would work at all.  You're taking
> a dump file that is full of LATIN9 data and simply asserting that it's
> UTF8 data.  That doesn't make it so.  If it seemed to work, maybe that's
> because your editor changed the encoding?  Not to be relied on, for sure.

Well, IIRC a LATIN9 encoding char which is interpreted as UTF8 will get
inserted with no error on a UTF8 server (although the final data will be
bogus).

IMO Leonardo is confused with the meaning of client_encoding, and should
maybe take a look here before continuing:

http://www.postgresql.org/docs/9.4/static/multibyte.html

And while reading that, they can switch to 9.4. ;)

Regadrs,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


"=?UTF-8?Q?Mart=c3=adn_Marqu=c3=a9s?=" <martin.marques@gmail.com> writes:
> El 12/08/15 a las 11:12, Tom Lane escribi�:
>> It does not seem likely to me that this would work at all.  You're taking
>> a dump file that is full of LATIN9 data and simply asserting that it's
>> UTF8 data.  That doesn't make it so.  If it seemed to work, maybe that's
>> because your editor changed the encoding?  Not to be relied on, for sure.

> Well, IIRC a LATIN9 encoding char which is interpreted as UTF8 will get
> inserted with no error on a UTF8 server (although the final data will be
> bogus).

I'd believe the other way around: if you tell the database that you're
using LATIN9, but what you send is really UTF8, it will not reject it
because the individual bytes are perfectly valid LATIN9 characters and
there are no cross-byte checks to make in LATIN9.  But it seems highly
unlikely that LATIN9-encoded data would get past the UTF8 validity
checker with any consistency.

It's possible that the problem is one of mislabeling, ie the database
was claimed to use LATIN9 but what was actually sent was always UTF8.
If that was *always* the case then the OP's fix of changing the label
in the dump file was actually the right thing to do.  But we haven't
been given enough information to be sure of that --- and if that's
what was happening, then some client software fixes would be in order
anyway, because the client code was using the wrong client_encoding.

            regards, tom lane