Re: Mixed UTF8 / Latin1 database - Mailing list pgsql-general
From | Frank Finner |
---|---|
Subject | Re: Mixed UTF8 / Latin1 database |
Date | |
Msg-id | 20040418184134.16c8931e.postgresql@finner.de Whole thread Raw |
In response to | Mixed UTF8 / Latin1 database (Claudio Cicali <c.cicali@mclink.it>) |
List | pgsql-general |
On Fri, 16 Apr 2004 14:38:33 +0200 Claudio Cicali <c.cicali@mclink.it> sat down, thought long and then wrote: > Hi, > > I'm trying to restore a pg_dump-backed up database from one > server to another. The problem is that the db is "mixed encoded" > in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once > from a hypersonic db... that screwed up something and now I'm > fighting with that...). > > So, trying to restore that db into a UTF-8 encoded new one, gives > me errors ("invalid unicode character..."), but importing it > into a LATIN1 econcoded one, gives me weird characters (of course). Hi, I had a similiar problem some months ago. I did it like this (all in one line): PGUSER=postgres ssh -C source_server 'PGUSER=postgres pg_dump -c -t table database'|recode latin1..utf8|psql -a database postgres I used the well known UNIX program "recode", which does the job very well. But, the really nasty thing about this method is, that, if you treat a table that contains already UTF-8 encoded characters, they will be encoded again to something that is no valid encoding at all. So I first tried it without recoding, finding out which tables caused errors, then did the job with recoding only these tables while copying and copying the others like they were. I was quite successful, all errors had been extinguished afterwards. If you have mixed tables (tables with Latin1 AND UTF8), I am afraid you have to do the dirty work by hand, for example, use a Perl script, that reads the dump and does for every line something like open (INFILE, "< /path/to/input/file"); # This would be your pg_dump´ed mixed up file open (OUTFILE, "> /path/to/output/file"); # This should become a clean dump with UTF-8 while (<INFILE>) { $line=$_; $line =~ s/ä/\x84/g; # substitutes every "ä" by "\x84" with "\x84" as UTF-8 encoding of "ä" print OUTFILE "$line"; } close INFILE; close OUTFILE; this means, substitute Latin1 characters (only "ä" in this example) by UTF-8 characters. In German, there are only 7 of them(äöüÄÖÜß), so it´s not too hard, but I am afraid, your mileage may vary. You should use a substitution line ($line =~ ...) for every Latin1 character which might occur in your dump. After substitution you can read in the dump into the UTF-8 database. Before using the result in production, test, if it is really clean! Well, if you don´t get any more "invalid unicode character...", it should be OK. > > I'm wondering if anyone could have a script or something to help me > with this situation... :( > > thanks. Hope I could help. > > > > -- > Claudio Cicali > c.cicali@mclink.it > http://www.flexer.it > GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings Regards, Frank.
pgsql-general by date: