Re: [HACKERS] Invalid unicode in COPY problem - Mailing list pgsql-general

From John Hansen
Subject Re: [HACKERS] Invalid unicode in COPY problem
Date
Msg-id 5066E5A966339E42AA04BA10BA706AE50A9310@rodrick.geeknet.com.au
Whole thread Raw
List pgsql-general
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: linux@alteeve.com; pgsql-general@postgresql.org;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
>
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: linux@alteeve.com
> > > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > >
> > > We have developed patches which relaxes the character
> validation so
> > > that PostgreSQL accepts invalid characters. It works like this:
> >
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data,
> then at least
> > encode it using a valid encoding.
> >
> > In fact, I've proposed strengthening the validation
> routines for UTF-8.
>
> Actually I myself thought as you are before. Later I found
> that it was not so good idea. People already have invalid
> encoded data in their precious database and have very hard
> time to migrate to newer version of PostgreSQL because of
> encoding validation.
>
> Think about this kind of situation:
>
> There is a table t1(member_id integer primary key,
> member_name text, address text, phone text, email text). I
> have to reach each member by either adress, phone or email.
> Unfortunately some of address field have wrong encoded data.
> In this case I will use phone or email to reach them.
>
> Now I need to upgrade to newer PostgreSQL within 1 day. I
> know I have to fix wrong encoded field but it will take more
> than 1 day. So I would like to import the data first then fix
> wrong encoded field on running database since I can reach
> members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

> I saw this kind of situation in the real world and that's why
> we developed the patches.
> --
> Tatsuo Ishii
>
>

pgsql-general by date:

Previous
From: Tino Wildenhain
Date:
Subject: Re: [HACKERS] Invalid unicode in COPY problem
Next
From: Karsten Hilbert
Date:
Subject: Re: [HACKERS] Invalid unicode in COPY problem