Trouble with Mixed UTF-8 and Latin1 data - Mailing list pgsql-general

From valgog
Subject Trouble with Mixed UTF-8 and Latin1 data
Date
Msg-id 1e1eb9e0-69cf-4783-9c65-911c46f543e9@s12g2000prg.googlegroups.com
Whole thread Raw
Responses Re: Trouble with Mixed UTF-8 and Latin1 data
List pgsql-general
Hi,

we have a 8.2.1 database that has a Latin1 encoding.

We managed to write there UTF-8 data (not Latin1) and, as this version
of the database, was actually allowing everything to be written to the
Latin1 database, by now we have a problem of having data in different
encoding on that database. I know it looks like a mess, but we could
not do anything against that by now.

To normalize the strings on the side of our Java frontend I have
written a special function in Java, that takes a string (normally
fetched from the database with JDBC connection), that can be Latin1 or
several times converted into UTF-8 and brings it to the native java
encoding (UTF-16) (trying to  convert back to Latin1 several times
until the conversion fails and taking a valid string).

Now, as we want to migrate the database to 8.3, all that text with
mixed encoding cannon be imported into the new database (as far as I
remember starting from version 8.2.3 or so). and my task is to
efficiently update the database so, that it contains only correctly
UTF-8 encoded texts (though being Latin1 database), and then dump it
and import into the new 8.3 instance.

Unfortunately using iconv on the dump file will not work, as it will
either corrupt my Latin1 data that is not compatible with UTF-8 or
convert texts that are already in UTF-8 twice. None of these
possibilities are good, as Latin1 encoded data is about 20% of total
amount of text data in that database.

One idea is to write the function, that will normalize the data to
UTF-8 in PL/pgSQL (that I could not do from the first try) or to write
such a procedure in C (that I have never done yet for Postgres) and
then to update the fields, that have different values.... in that case
I have another question:

Is it possible to directly access and update TOAST data for all the
tables?

Maybe there is a possibility to switch the checking of the correctness
of the UTF-8 encoding off and enable restoration of the corrupted
data, and deal with it later on the new database already (having
better update performance I hope).

pgsql-general by date:

Previous
From: "Bryan Murphy"
Date:
Subject: full text index and most frequently used words
Next
From: luca.ciciriello@email.it
Date:
Subject: Re: hyperthreading and pqlib