Thread: problem converting database to UTF-8
I have a database which was created as LATIN1 (the machine has the wrong locales installed when I set up PG). It is running 8.3. So I found various places which said the way to do this was to do a pg_dumpall -f dump_file, get rid of the entire database, init_db -E UTF-8, and then psql -f dumpfile. But the psql fails saying:- psql:dumpfile:49: ERROR: encoding LATIN1 does not match server's locale en_GB.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. I have en_GB.UTF-8 now as my primary locale, and en_GB.ISO8859-1 is also generated. So I looked around again and found people saying I needed to use iconv, but that does not help, I get the same error. Is there a definative HOWTO that I can follow, if not does someone have a set of instructions that will work? If it matters I am running under Debian. David
> Is there a definative HOWTO that I can follow, if not does someone > have a set of instructions that will work? What about running "iconv" command on the dumped .sql file and transform it to the utf8? Vlad PS: man iconv for manual
On Thursday 22 January 2009, Vladimir Konrad wrote: > > Is there a definative HOWTO that I can follow, if not does someone > > have a set of instructions that will work? > > What about running "iconv" command on the dumped .sql file and transform > it to the utf8? > > Vlad > > PS: man iconv for manual iconv does not change the database encodings embedded in the file (and it is quite large). Is there no automated procedure. David
> iconv does not change the database encodings embedded in the file > (and it is quite large). Have you read the manual? file A pathname of an input file. If no file operands are specified, or if a file operand is '-', the standard input shall be used. cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql Size should not matter in this case... V
On Thursday 22 January 2009, Vladimir Konrad <vk@dsl.pipex.com> wrote: > > iconv does not change the database encodings embedded in the file > > (and it is quite large). > > Have you read the manual? > > file A pathname of an input file. If no file operands are > specified, or if a file operand is '-', the standard input shall > be used. > > > cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql > > Size should not matter in this case... Yeah it does. iconv buffers everything in memory, as I recall. However, you can "split" the file into manageable pieces, run each through iconv, and recombine afterwards. -- Current Peeve: The mindset that the Internet is some sort of school for novice sysadmins and that everyone -not- doing stupid dangerous things should act like patient teachers with the ones who are. -- Bill Cole, NANAE
On Thursday 22 January 2009, Vladimir Konrad wrote: > > iconv does not change the database encodings embedded in the file > > (and it is quite large). > > Have you read the manual? > > file A pathname of an input file. If no file operands are > specified, or if a file operand is '-', the standard input shall > be used. > > > cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql > > Size should not matter in this case... > > V You have not understood what I said. I ran iconv, and it changes the encoding of the data, but not the ENCODING= statements that are embedded in the datastream. Yes I can change those with sed, but I do not know what else I need to change. There must be an easier way. David
> You have not understood what I said. I ran iconv, and it changes the > encoding of the data, but not the ENCODING= statements that are > embedded in the datastream. Yes I can change those with sed, but > I do not know what else I need to change. There must be an easier > way. Oops, please accept my apologies... Also, it looks that iconv tries to buffer everything in RAM :-(. Vlad
On Thursday 22 January 2009, David Goodenough <david.goodenough@btconnect.com> wrote: > > You have not understood what I said. I ran iconv, and it changes the > encoding of the data, but not the ENCODING= statements that are > embedded in the datastream. Yes I can change those with sed, but > I do not know what else I need to change. There must be an easier > way. There isn't. I'm pretty sure that iconv plus changing the ENCODING statement(s) is all I did when we converted, though. -- Current Peeve: The mindset that the Internet is some sort of school for novice sysadmins and that everyone -not- doing stupid dangerous things should act like patient teachers with the ones who are. -- Bill Cole, NANAE
Alan Hodgson wrote: > Yeah it does. iconv buffers everything in memory, as I recall. > > However, you can "split" the file into manageable pieces, run each through > iconv, and recombine afterwards. Another way is to just use GNU recode for large files instead of iconv. It's slower but doesn't need much memory, contrary to iconv that indeed seems to want the entire stream in memory before proceeding. Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
> > cat the-source-dump.sql | iconv -t utf8 - > my-converted.sql > > > > Size should not matter in this case... > > Yeah it does. iconv buffers everything in memory, as I recall. Just found an alternative - "uconv" command (part of ICU project): http://www.icu-project.org/userguide/intro.html (not sure if it is in debian, but it does not buffer entire file when used in a pipe). Vlad
On 2009-01-22, David Goodenough <david.goodenough@btconnect.com> wrote: > I have a database which was created as LATIN1 (the machine has the > wrong locales installed when I set up PG). It is running 8.3. > > So I found various places which said the way to do this was to do > a pg_dumpall -f dump_file, get rid of the entire database, init_db -E UTF-8, > and then psql -f dumpfile. > > But the psql fails saying:- > > psql:dumpfile:49: ERROR: encoding LATIN1 does not match server's locale > en_GB.UTF-8 > DETAIL: The server's LC_CTYPE setting requires encoding UTF8. > > I have en_GB.UTF-8 now as my primary locale, and en_GB.ISO8859-1 is > also generated. > > So I looked around again and found people saying I needed to use iconv, > but that does not help, I get the same error. > Is there a definative HOWTO that I can follow, if not does someone > have a set of instructions that will work? simplest best way is to dump the individual databases and then restore them individually. else you need to go theough the dump file and doctor the CREATE DATABASE lines to use UTF-8 (or not specify an encoding) whilst making sure that CLIENT_ENCODING remains set to LATIN1 (so that the data from the dump file remains intelligible) alternately you may be able to use iconv to convert the whole file to UTF8 and also all ocurrances of LATIN1 in commands with UTF8 The frst thing I'd do is look in the dump file and see what's on line 49.
On Fri, Jan 23, 2009 at 02:18, David Goodenough <david.goodenough@btconnect.com> wrote: > > Is there a definative HOWTO that I can follow, if not does someone > have a set of instructions that will work? > > If it matters I am running under Debian. I did it once for a very large db (large for me was 5GB) and converted it from EUC to UTF8. 1) dumped all the data (pg_dump_all) on the source system so it was all EUC 2) split the file into manage able chunks (per LINE not per BYTE if you work with multibyte things). 3) iconv -f EUC -t UTF8 -c (yes -c because there might be some strange thing inside and so it doesn't stop) 4) put the files together again 5) sed to replace all EUC to UTF8 6) import into new created db on target system with all set to UTF8 -- [ Clemens Schwaighofer -----=====:::::~ ] [ IT Engineer/Manager ] [ E-Graphics Communications, TEQUILA\ Japan IT Group ] [ 6-17-2 Ginza Chuo-ku, Tokyo 104-8167, JAPAN ] [ Tel: +81-(0)3-3545-7703 Fax: +81-(0)3-3545-7343 ] [ http://www.tequila.jp ] Advertising Age Global Agency of the Year 2008 Adweek Global Agency of the Year 2008 This e-mail is intended only for the named person or entity to which it is addressed and contains valuable business information that is privileged, confidential and/or otherwise protected from disclosure. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. All contents are the copyright property of TBWA Worldwide, its agencies or a client of such agencies. If you are not the intended recipient, you are nevertheless bound to respect the worldwide legal rights of TBWA Worldwide, its agencies and its clients. We require that unintended recipients delete the e-mail and destroy all electronic copies in their system, retaining no copies in any media.If you have received this e-mail in error, please immediately notify us via e-mail to disclaimer@tbwaworld.com. We appreciate your cooperation. We make no warranties as to the accuracy or completeness of this e-mail and accept no liability for its content or use. Any opinions expressed in this e-mail are those of the author and do not necessarily reflect the opinions of TBWA Worldwide or any of its agencies or affiliates.