Thread: Differences in UTF8 between 8.0 and 8.1

Differences in UTF8 between 8.0 and 8.1

From
Paul Lindner
Date:
I've been doing some test imports of UNICODE databases into Postgres
8.1beta3.  The only problem I've seen is that some data from 8.0
databases will not import.

I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
restore these results in
Invalid UNICODE byte sequence detected near byte ...

Question:

Does the 8.1 Unicode sanity code accept the full set of characters
accepted by the 8.0 Unicode sanity code?

If not we'll see a lot of problems like the one above.


I believe this patch is the one causing the problem I see:
 http://www.mail-archive.com/pgsql-patches@postgresql.org/msg08198/unicode.diff


Is there any solution other than scrubbing the entire dataset to
conform to the new (8.1) encoding rules?

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From
Andrew - Supernews
Date:
On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote:
> I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
> restore these results in
>
>  Invalid UNICODE byte sequence detected near byte ...

What were the exact offending bytes?

> Question:
>
> Does the 8.1 Unicode sanity code accept the full set of characters
> accepted by the 8.0 Unicode sanity code?

No. 8.0 and before accepted a lot of stuff that it should never have, and
failed to accept stuff that it should have.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Differences in UTF8 between 8.0 and 8.1

From
Paul Lindner
Date:
On Sun, Oct 23, 2005 at 05:56:50AM -0000, Andrew - Supernews wrote:
> On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote:
> > I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
> > restore these results in
> >
> >  Invalid UNICODE byte sequence detected near byte ...
>
> What were the exact offending bytes?

Here's a cut and paste from emacs hexl-mode:

00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55  5083c5083.< VALU
00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4  ES (202553, '...
00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056  .....', .---.> V
00000030: 414c 5545 5320 2832 3032 3535 332c 2027  ALUES (202553, '
00000040: d2b9 272c 200a 3136 3939 3432 6331 3639  ..', .169942c169
00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833  942.< VALUES (83
00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a  015, '.....H', .
00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833  ---.> VALUES (83
00000080: 3031 352c 2027 c6a4 4827 2c20 0a         015, '..H', .

This is of a minimal diff between a UTF8 scrubbed file and the
original dump.

It appears the offending bytes are:
 C1 F9 C2 D0 C7

and
 B7 ED A8


> > Question:
> >
> > Does the 8.1 Unicode sanity code accept the full set of characters
> > accepted by the 8.0 Unicode sanity code?
>
> No. 8.0 and before accepted a lot of stuff that it should never have, and
> failed to accept stuff that it should have.

Thanks go out to John Hansen, he recommended to run the dump through iconv:
 iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

This seems to strip out invalid UTF8 and will allow for a clean
import.  Someone should add this to the Release Notes/FAQ..


--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From
Christopher Kings-Lynne
Date:
> Thanks go out to John Hansen, he recommended to run the dump through iconv:
> 
>   iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
> 
> This seems to strip out invalid UTF8 and will allow for a clean
> import.  Someone should add this to the Release Notes/FAQ..

Yes I think that's extremely important to put in the 'upgrading/back 
compatibiliy' section for these release notes.

Chris



Re: Differences in UTF8 between 8.0 and 8.1

From
Andrew - Supernews
Date:
On 2005-10-24, Paul Lindner <lindner@inuus.com> wrote:
> Here's a cut and paste from emacs hexl-mode:
>
> 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55  5083c5083.< VALU
> 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4  ES (202553, '...
> 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056  .....', .---.> V
> 00000030: 414c 5545 5320 2832 3032 3535 332c 2027  ALUES (202553, '
> 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639  ..', .169942c169
> 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833  942.< VALUES (83
> 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a  015, '.....H', .
> 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833  ---.> VALUES (83
> 00000080: 3031 352c 2027 c6a4 4827 2c20 0a         015, '..H', .
>
> This is of a minimal diff between a UTF8 scrubbed file and the
> original dump.
>
> It appears the offending bytes are:
>
>   C1 F9 C2 D0 C7

I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
was never actually a valid utf-8 string, and that the d2 b9 is only valid
by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
utf-8 check was broken, but I didn't realize it was quite so bad.

> and
>
>   B7 ED A8

Likewise, that whole sequence b7 ed a8 c6 a4 was probably never valid;
c6 a4 also isn't a character you'd expect to find in common use.

My guess is that this was data in some non-utf-8 charset that managed to
get past the defective checks in 8.0.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Differences in UTF8 between 8.0 and 8.1

From
Paul Lindner
Date:
On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>
> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
> was never actually a valid utf-8 string, and that the d2 b9 is only valid
> by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
> utf-8 check was broken, but I didn't realize it was quite so bad.

Looking at the data it appears that it is a sequence of latin1
characters.  They all have the eighth bit set and all seem to pass the
check.

In a million rows I found 2 examples of this.

However I'm running into another problem now.  The command:
 iconv -c -f UTF8 -t UTF8

does strip out the invalid characters.  However, iconv reads the
entire file into memory before it writes out any data.  This is not so
good for multi-gigabyte dump files and doesn't allow for it to be used
in a pipe between pg_dump and psql.

Anyone have any other recommendations?  GNU recode might do it, but
I'm a bit stymied by the syntax.  A quick perl script using
Text::Iconv didn't work either.  I'm off to look at some other perl
modules and will try to create a script so I can strip out the invalid
characters.

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From
Andrej Ricnik-Bay
Date:
> does strip out the invalid characters.  However, iconv reads the
> entire file into memory before it writes out any data.  This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
>
> Anyone have any other recommendations?  GNU recode might do it, but
> I'm a bit stymied by the syntax.  A quick perl script using
> Text::Iconv didn't work either.  I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.
How about an ugly kludge  ...

split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
for i in `ls -1 dumpfile*`; do   iconv -c -f UTF8 -t UTF8 $i;done
cat dumpfile* > new_dump


Cheers,
Andrej


Re: Differences in UTF8 between 8.0 and 8.1

From
Christopher Kings-Lynne
Date:
> However I'm running into another problem now.  The command:
> 
>   iconv -c -f UTF8 -t UTF8 
> 
> does strip out the invalid characters.  However, iconv reads the
> entire file into memory before it writes out any data.  This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
> 
> Anyone have any other recommendations?  GNU recode might do it, but
> I'm a bit stymied by the syntax.  A quick perl script using
> Text::Iconv didn't work either.  I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.

recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

Chris



Re: Differences in UTF8 between 8.0 and 8.1

From
jtv@xs4all.nl
Date:
Andrej Ricnik-Bay wrote:

> How about an ugly kludge  ...
>
> split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
> for i in `ls -1 dumpfile*`; do   iconv -c -f UTF8 -t UTF8 $i;done
> cat dumpfile* > new_dump

Not with UTF-8...  You might break in the middle of a multibyte character.


Jeroen




Re: Differences in UTF8 between 8.0 and 8.1

From
Andrew - Supernews
Date:
On 2005-10-27, Paul Lindner <lindner@inuus.com> wrote:
> On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
>> was never actually a valid utf-8 string, and that the d2 b9 is only valid
>> by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
>> utf-8 check was broken, but I didn't realize it was quite so bad.
>
> Looking at the data it appears that it is a sequence of latin1
> characters.  They all have the eighth bit set and all seem to pass the
> check.

In latin1 it comes out as total gibberish, so I think you'll find it is
actually in something else. Some googling suggests it is most likely in a
Chinese double-byte charset (GB2312).

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Differences in UTF8 between 8.0 and 8.1

From
Gregory Maxwell
Date:
On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote:
> >   iconv -c -f UTF8 -t UTF8
> recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

I've got a file with characters that pg won't accept that recode does
not fix but iconv does. Iconv is fine for my application, so I'm just
posting to the list so that anyone looking for why recode didn't work
for them will find the suggestion to use iconv.


Re: Differences in UTF8 between 8.0 and 8.1

From
Paul Lindner
Date:
On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote:
> On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote:
> > >   iconv -c -f UTF8 -t UTF8
> > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql
>
> I've got a file with characters that pg won't accept that recode does
> not fix but iconv does. Iconv is fine for my application, so I'm just
> posting to the list so that anyone looking for why recode didn't work
> for them will find the suggestion to use iconv.

recode did not work for my sample data.  It passed through the problem
character sequences.  I'm still looking for an iconv that doesn't read
the entire file into memory.

At this point I'm looking to use the split command to process input in
10000 line chunks.  Sadly that can't be used in a pipe.

BTW, how will sites that use Slony deal with this issue?

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com