Home > mailing lists

Re: Mixed UTF8 / Latin1 database - Mailing list pgsql-general

From	Frank Finner
Subject	Re: Mixed UTF8 / Latin1 database
Date	April 18, 2004 16:41:48
Msg-id	20040418184134.16c8931e.postgresql@finner.de Whole thread Raw
In response to	Mixed UTF8 / Latin1 database (Claudio Cicali <c.cicali@mclink.it>)
List	pgsql-general

Tree view

On Fri, 16 Apr 2004 14:38:33 +0200 Claudio Cicali <c.cicali@mclink.it> sat down, thought long and
then wrote:

> Hi,
> 
> I'm trying to restore a pg_dump-backed up database from one
> server to another. The problem is that the db is "mixed encoded"
> in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once
> from a hypersonic db... that screwed up something and now I'm
> fighting with that...).
> 
> So, trying to restore that db into a UTF-8 encoded new one, gives
> me errors ("invalid unicode character..."), but importing it
> into a LATIN1 econcoded one, gives me weird characters (of course).

Hi,

I had a similiar problem some months ago. I did it like this (all in one line):

PGUSER=postgres ssh -C source_server 'PGUSER=postgres pg_dump -c -t table database'|recode
latin1..utf8|psql -a database postgres

I used the well known UNIX program "recode", which does the job very well. But, the really nasty
thing about this method is, that, if you treat a table that contains already UTF-8 encoded
characters, they will be encoded again to something that is no valid encoding at all. So I first
tried it without recoding, finding out which tables caused errors, then did the job with recoding
only these tables while copying and copying the others like they were. I was quite successful, all
errors had been extinguished afterwards.

If you have mixed tables (tables with Latin1 AND UTF8), I am afraid you have to do the dirty work by
hand, for example, use a Perl script, that reads the dump and does for every line something like

open (INFILE, "< /path/to/input/file"); # This would be your pg_dump´ed mixed up file
open (OUTFILE, "> /path/to/output/file"); # This should become a clean dump with UTF-8
while (<INFILE>)
{
  $line=$_;
  $line =~ s/ä/\x84/g; # substitutes every "ä" by "\x84" with "\x84" as UTF-8 encoding of "ä"
  print OUTFILE "$line";
}
close INFILE;
close OUTFILE;

this means, substitute Latin1 characters (only "ä" in this example) by UTF-8 characters. In German,
there are only 7 of them(äöüÄÖÜß), so it´s not too hard, but I am afraid, your mileage may vary. You
should use a substitution line ($line =~ ...) for every Latin1 character which might occur in your
dump. After substitution you can read in the dump into the UTF-8 database.

Before using the result in production, test, if it is really clean! Well, if you don´t get any more
"invalid unicode character...", it should be OK.

> 
> I'm wondering if anyone could have a script or something to help me
> with this situation... :(
> 
> thanks.

Hope I could help.

> 
> 
> 
> -- 
> Claudio Cicali
> c.cicali@mclink.it
> http://www.flexer.it
> GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

Regards, Frank.

pgsql-general by date:

From: Andrew Dunstan
Date: 18 April 2004, 14:17:37
Subject: Re: [HACKERS] Remove MySQL Tools from Source?

From: Jerry LeVan
Date: 18 April 2004, 18:41:36
Subject: Folding subtotals into query?

Re: Mixed UTF8 / Latin1 database - Mailing list pgsql-general

Previous

Next