Thread: problem converting database to UTF-8

problem converting database to UTF-8

From
David Goodenough
Date:
I have a database which was created as LATIN1 (the machine has the
wrong locales installed when I set up  PG).  It is running 8.3.

So I found various places which said the way to do this was to do
a pg_dumpall -f dump_file, get rid of the entire database, init_db -E UTF-8,
and then psql -f dumpfile.

But the psql fails saying:-

psql:dumpfile:49: ERROR:  encoding LATIN1 does not match server's locale
en_GB.UTF-8
DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.

I have en_GB.UTF-8 now as my primary locale, and en_GB.ISO8859-1 is
also generated.

So I looked around again and found people saying I needed to use iconv,
but that does not help, I get the same error.

Is there a definative HOWTO that I can follow, if not does someone
have a set of instructions that will work?

If it matters I am running under Debian.

David

Re: problem converting database to UTF-8

From
Vladimir Konrad
Date:
> Is there a definative HOWTO that I can follow, if not does someone
> have a set of instructions that will work?

What about running "iconv" command on the dumped .sql file and transform
it to the utf8?

Vlad

PS: man iconv for manual

Re: problem converting database to UTF-8

From
David Goodenough
Date:
On Thursday 22 January 2009, Vladimir Konrad wrote:
> > Is there a definative HOWTO that I can follow, if not does someone
> > have a set of instructions that will work?
>
> What about running "iconv" command on the dumped .sql file and transform
> it to the utf8?
>
> Vlad
>
> PS: man iconv for manual

iconv does not change the database encodings embedded in the file
(and it is quite large).

Is there no automated procedure.

David

Re: problem converting database to UTF-8

From
Vladimir Konrad
Date:
> iconv does not change the database encodings embedded in the file
> (and it is quite large).

Have you read the manual?

       file   A pathname of an input file. If no file operands are
       specified, or if a file operand is '-', the standard input shall
       be used.


cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql

Size should not matter in this case...

V

Re: problem converting database to UTF-8

From
Alan Hodgson
Date:
On Thursday 22 January 2009, Vladimir Konrad <vk@dsl.pipex.com> wrote:
> > iconv does not change the database encodings embedded in the file
> > (and it is quite large).
>
> Have you read the manual?
>
>        file   A pathname of an input file. If no file operands are
>        specified, or if a file operand is '-', the standard input shall
>        be used.
>
>
> cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql
>
> Size should not matter in this case...

Yeah it does. iconv buffers everything in memory, as I recall.

However, you can "split" the file into manageable pieces, run each through
iconv, and recombine afterwards.


--
Current Peeve: The mindset that the Internet is some sort of school for
novice sysadmins and that everyone -not- doing stupid dangerous things
should act like patient teachers with the ones who are. -- Bill Cole, NANAE

Re: problem converting database to UTF-8

From
David Goodenough
Date:
On Thursday 22 January 2009, Vladimir Konrad wrote:
> > iconv does not change the database encodings embedded in the file
> > (and it is quite large).
>
> Have you read the manual?
>
>        file   A pathname of an input file. If no file operands are
>        specified, or if a file operand is '-', the standard input shall
>        be used.
>
>
> cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql
>
> Size should not matter in this case...
>
> V

You have not understood what I said.  I ran iconv, and it changes the
encoding of the data, but not the ENCODING= statements that are
embedded in the datastream.  Yes I can change those with sed, but
I do not know what else I need to change.  There must be an easier
way.

David

Re: problem converting database to UTF-8

From
Vladimir Konrad
Date:
> You have not understood what I said.  I ran iconv, and it changes the
> encoding of the data, but not the ENCODING= statements that are
> embedded in the datastream.  Yes I can change those with sed, but
> I do not know what else I need to change.  There must be an easier
> way.

Oops, please accept my apologies... Also, it looks that iconv tries to
buffer everything in RAM :-(.

Vlad

Re: problem converting database to UTF-8

From
Alan Hodgson
Date:
On Thursday 22 January 2009, David Goodenough
<david.goodenough@btconnect.com> wrote:
>
> You have not understood what I said.  I ran iconv, and it changes the
> encoding of the data, but not the ENCODING= statements that are
> embedded in the datastream.  Yes I can change those with sed, but
> I do not know what else I need to change.  There must be an easier
> way.

There isn't. I'm pretty sure that iconv plus changing the ENCODING
statement(s) is all I did when we converted, though.


--
Current Peeve: The mindset that the Internet is some sort of school for
novice sysadmins and that everyone -not- doing stupid dangerous things
should act like patient teachers with the ones who are. -- Bill Cole, NANAE

Re: problem converting database to UTF-8

From
"Daniel Verite"
Date:
    Alan Hodgson wrote:

> Yeah it does. iconv buffers everything in memory, as I recall.
>
> However, you can "split" the file into manageable pieces, run each
through
> iconv, and recombine afterwards.

Another way is to just use GNU recode for large files instead of iconv.
It's slower but doesn't need much memory, contrary to iconv that indeed
seems to want the entire stream in memory before proceeding.

 Best regards,
--
 Daniel
 PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

Re: problem converting database to UTF-8

From
Vladimir Konrad
Date:
> > cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql
> >
> > Size should not matter in this case...
>
> Yeah it does. iconv buffers everything in memory, as I recall.

Just found an alternative - "uconv" command (part of ICU project):

http://www.icu-project.org/userguide/intro.html

(not sure if it is in debian, but it does not buffer entire file when
used in a pipe).

Vlad

Re: problem converting database to UTF-8

From
Jasen Betts
Date:
On 2009-01-22, David Goodenough <david.goodenough@btconnect.com> wrote:
> I have a database which was created as LATIN1 (the machine has the
> wrong locales installed when I set up  PG).  It is running 8.3.
>
> So I found various places which said the way to do this was to do
> a pg_dumpall -f dump_file, get rid of the entire database, init_db -E UTF-8,
> and then psql -f dumpfile.
>
> But the psql fails saying:-
>
> psql:dumpfile:49: ERROR:  encoding LATIN1 does not match server's locale
> en_GB.UTF-8
> DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.
>
> I have en_GB.UTF-8 now as my primary locale, and en_GB.ISO8859-1 is
> also generated.
>
> So I looked around again and found people saying I needed to use iconv,
> but that does not help, I get the same error.

> Is there a definative HOWTO that I can follow, if not does someone
> have a set of instructions that will work?

simplest best way is to dump the individual databases and then restore
them individually.

else you need to go theough the dump file and doctor the
CREATE DATABASE lines to use UTF-8 (or not specify an encoding)
whilst making sure that CLIENT_ENCODING remains set to LATIN1
(so that the data from the dump file remains intelligible)

alternately you may be able to use iconv to convert the whole file to
UTF8 and also all ocurrances of LATIN1 in commands with UTF8

The frst thing I'd do is look in the dump file and see what's on line
49.

Re: problem converting database to UTF-8

From
Schwaighofer Clemens
Date:
On Fri, Jan 23, 2009 at 02:18, David Goodenough
<david.goodenough@btconnect.com> wrote:
>
> Is there a definative HOWTO that I can follow, if not does someone
> have a set of instructions that will work?
>
> If it matters I am running under Debian.

I did it once for a very large db (large for me was 5GB) and converted
it from EUC to UTF8.

1) dumped all the data (pg_dump_all) on the source system so it was all EUC
2) split the file into manage able chunks (per LINE not per BYTE if
you work with multibyte things).
3) iconv -f EUC -t UTF8 -c (yes -c because there might be some strange
thing inside and so it doesn't stop)
4) put the files together again
5) sed to replace all EUC to UTF8
6) import into new created db on target system with all set to UTF8

--
[ Clemens Schwaighofer                      -----=====:::::~ ]
[ IT Engineer/Manager                                        ]
[ E-Graphics Communications, TEQUILA\ Japan IT Group         ]
[                6-17-2 Ginza Chuo-ku, Tokyo 104-8167, JAPAN ]
[ Tel: +81-(0)3-3545-7703            Fax: +81-(0)3-3545-7343 ]
[ http://www.tequila.jp                                      ]

Advertising Age Global Agency of the Year 2008
Adweek Global Agency of the Year 2008

This e-mail is intended only for the named person or entity to which
it is addressed and contains valuable business information that is
privileged, confidential and/or otherwise protected from disclosure.
Dissemination, distribution or copying of this e-mail or the
information herein by anyone other than the intended recipient, or
an employee or agent responsible for delivering the message to the
intended recipient, is strictly prohibited.  All contents are the
copyright property of TBWA Worldwide, its agencies or a client of
such agencies. If you are not the intended recipient, you are
nevertheless bound to respect the worldwide legal rights of TBWA
Worldwide, its agencies and its clients. We require that unintended
recipients delete the e-mail and destroy all electronic copies in
their system, retaining no copies in any media.If you have received
this e-mail in error, please immediately notify us via e-mail to
disclaimer@tbwaworld.com.  We appreciate your cooperation.

We make no warranties as to the accuracy or completeness of this
e-mail and accept no liability for its content or use.  Any opinions
expressed in this e-mail are those of the author and do not
necessarily reflect the opinions of TBWA Worldwide or any of its
agencies or affiliates.