Thread: Mixed UTF8 / Latin1 database

Mixed UTF8 / Latin1 database

From
Claudio Cicali
Date:
Hi,

I'm trying to restore a pg_dump-backed up database from one
server to another. The problem is that the db is "mixed encoded"
in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once
from a hypersonic db... that screwed up something and now I'm
fighting with that...).

So, trying to restore that db into a UTF-8 encoded new one, gives
me errors ("invalid unicode character..."), but importing it
into a LATIN1 econcoded one, gives me weird characters (of course).

I'm wondering if anyone could have a script or something to help me
with this situation... :(

thanks.



--
Claudio Cicali
c.cicali@mclink.it
http://www.flexer.it
GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE

Re: Mixed UTF8 / Latin1 database

From
Jean-Michel POURE
Date:
> I'm wondering if anyone could have a script or something to help me
> with this situation... :(

Knowing that Unicode is composed of plain ASCII characters,
you may perform a conversion test using PHP "recode" function.

You may test each record as follows:

$test = recode ("latin1..u8", $record);

If the $test value differs from the $record one, then it is a Latin1 string.

On the converse, it is Unicode UT-8. There is no garantee. Make a test on a
few values, I am not coding right now and only have access to email.

Cheers,
Jean-Michel


Re: Mixed UTF8 / Latin1 database

From
Markus Bertheau
Date:
В Птн, 16.04.2004, в 16:26, Jean-Michel POURE пишет:
> > I'm wondering if anyone could have a script or something to help me
> > with this situation... :(
>
> Knowing that Unicode is composed of plain ASCII characters,

It's not, Unicode is just a number to glyph mapping, and UTF-8, what you
probably mean, uses the eighth bit too, which cannot be said of ASCII -
ASCII is a 7 bit character set.
> you may perform a conversion test using PHP "recode" function.
>
> You may test each record as follows:
>
> $test = recode ("latin1..u8", $record);
>
> If the $test value differs from the $record one, then it is a Latin1 string.

Unfortunately unless $record is all ASCII $test and $record will always
differ, because every byte stream is valid latin1 and can thus be
converted to UTF-8.

What you can do is to manually replace ä ü ö § and all other non-ASCII
byte values with their UTF-8 equivalent and then go through the data
manually to check it for correctness.

Another hint you can get at in your program is to try to check if your
input is valid UTF-8 and only convert from latin1 if it is not. You can
probably check for valid UTF-8 by seeing if the conversion from UTF-8 to
UTF-8 succeeds.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: Mixed UTF8 / Latin1 database

From
Frank Finner
Date:
On Fri, 16 Apr 2004 14:38:33 +0200 Claudio Cicali <c.cicali@mclink.it> sat down, thought long and
then wrote:

> Hi,
> 
> I'm trying to restore a pg_dump-backed up database from one
> server to another. The problem is that the db is "mixed encoded"
> in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once
> from a hypersonic db... that screwed up something and now I'm
> fighting with that...).
> 
> So, trying to restore that db into a UTF-8 encoded new one, gives
> me errors ("invalid unicode character..."), but importing it
> into a LATIN1 econcoded one, gives me weird characters (of course).

Hi,

I had a similiar problem some months ago. I did it like this (all in one line):

PGUSER=postgres ssh -C source_server 'PGUSER=postgres pg_dump -c -t table database'|recode
latin1..utf8|psql -a database postgres

I used the well known UNIX program "recode", which does the job very well. But, the really nasty
thing about this method is, that, if you treat a table that contains already UTF-8 encoded
characters, they will be encoded again to something that is no valid encoding at all. So I first
tried it without recoding, finding out which tables caused errors, then did the job with recoding
only these tables while copying and copying the others like they were. I was quite successful, all
errors had been extinguished afterwards.

If you have mixed tables (tables with Latin1 AND UTF8), I am afraid you have to do the dirty work by
hand, for example, use a Perl script, that reads the dump and does for every line something like

open (INFILE, "< /path/to/input/file"); # This would be your pg_dump´ed mixed up file
open (OUTFILE, "> /path/to/output/file"); # This should become a clean dump with UTF-8
while (<INFILE>)
{
  $line=$_;
  $line =~ s/ä/\x84/g; # substitutes every "ä" by "\x84" with "\x84" as UTF-8 encoding of "ä"
  print OUTFILE "$line";
}
close INFILE;
close OUTFILE;

this means, substitute Latin1 characters (only "ä" in this example) by UTF-8 characters. In German,
there are only 7 of them(äöüÄÖÜß), so it´s not too hard, but I am afraid, your mileage may vary. You
should use a substitution line ($line =~ ...) for every Latin1 character which might occur in your
dump. After substitution you can read in the dump into the UTF-8 database.

Before using the result in production, test, if it is really clean! Well, if you don´t get any more
"invalid unicode character...", it should be OK.

> 
> I'm wondering if anyone could have a script or something to help me
> with this situation... :(
> 
> thanks.

Hope I could help.

> 
> 
> 
> -- 
> Claudio Cicali
> c.cicali@mclink.it
> http://www.flexer.it
> GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

Regards, Frank.