Thread: Differences in UTF8 between 8.0 and 8.1

Differences in UTF8 between 8.0 and 8.1

From

Paul Lindner

Date:

22 October 2005, 12:52:15

I've been doing some test imports of UNICODE databases into Postgres
8.1beta3.  The only problem I've seen is that some data from 8.0
databases will not import.

I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
restore these results in
Invalid UNICODE byte sequence detected near byte ...

Question:

Does the 8.1 Unicode sanity code accept the full set of characters
accepted by the 8.0 Unicode sanity code?

If not we'll see a lot of problems like the one above.


I believe this patch is the one causing the problem I see:
 http://www.mail-archive.com/pgsql-patches@postgresql.org/msg08198/unicode.diff


Is there any solution other than scrubbing the entire dataset to
conform to the new (8.1) encoding rules?

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From

Andrew - Supernews

Date:

23 October 2005, 02:56:57

On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote:
> I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
> restore these results in
>
>  Invalid UNICODE byte sequence detected near byte ...

What were the exact offending bytes?

> Question:
>
> Does the 8.1 Unicode sanity code accept the full set of characters
> accepted by the 8.0 Unicode sanity code?

No. 8.0 and before accepted a lot of stuff that it should never have, and
failed to accept stuff that it should have.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Differences in UTF8 between 8.0 and 8.1

From

Paul Lindner

Date:

24 October 2005, 01:20:28

On Sun, Oct 23, 2005 at 05:56:50AM -0000, Andrew - Supernews wrote:
> On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote:
> > I've generated dumps using pg_dump from 8.0 and 8.1.  Attempting to
> > restore these results in
> >
> >  Invalid UNICODE byte sequence detected near byte ...
>
> What were the exact offending bytes?

Here's a cut and paste from emacs hexl-mode:

00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55  5083c5083.< VALU
00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4  ES (202553, '...
00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056  .....', .---.> V
00000030: 414c 5545 5320 2832 3032 3535 332c 2027  ALUES (202553, '
00000040: d2b9 272c 200a 3136 3939 3432 6331 3639  ..', .169942c169
00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833  942.< VALUES (83
00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a  015, '.....H', .
00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833  ---.> VALUES (83
00000080: 3031 352c 2027 c6a4 4827 2c20 0a         015, '..H', .

This is of a minimal diff between a UTF8 scrubbed file and the
original dump.

It appears the offending bytes are:
 C1 F9 C2 D0 C7

and
 B7 ED A8


> > Question:
> >
> > Does the 8.1 Unicode sanity code accept the full set of characters
> > accepted by the 8.0 Unicode sanity code?
>
> No. 8.0 and before accepted a lot of stuff that it should never have, and
> failed to accept stuff that it should have.

Thanks go out to John Hansen, he recommended to run the dump through iconv:
 iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

This seems to strip out invalid UTF8 and will allow for a clean
import.  Someone should add this to the Release Notes/FAQ..


--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From

Christopher Kings-Lynne

Date:

24 October 2005, 02:06:07

> Thanks go out to John Hansen, he recommended to run the dump through iconv:
> 
>   iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
> 
> This seems to strip out invalid UTF8 and will allow for a clean
> import.  Someone should add this to the Release Notes/FAQ..

Yes I think that's extremely important to put in the 'upgrading/back 
compatibiliy' section for these release notes.

Chris

Re: Differences in UTF8 between 8.0 and 8.1

From

Andrew - Supernews

Date:

24 October 2005, 02:07:46

On 2005-10-24, Paul Lindner <lindner@inuus.com> wrote:
> Here's a cut and paste from emacs hexl-mode:
>
> 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55  5083c5083.< VALU
> 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4  ES (202553, '...
> 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056  .....', .---.> V
> 00000030: 414c 5545 5320 2832 3032 3535 332c 2027  ALUES (202553, '
> 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639  ..', .169942c169
> 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833  942.< VALUES (83
> 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a  015, '.....H', .
> 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833  ---.> VALUES (83
> 00000080: 3031 352c 2027 c6a4 4827 2c20 0a         015, '..H', .
>
> This is of a minimal diff between a UTF8 scrubbed file and the
> original dump.
>
> It appears the offending bytes are:
>
>   C1 F9 C2 D0 C7

I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
was never actually a valid utf-8 string, and that the d2 b9 is only valid
by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
utf-8 check was broken, but I didn't realize it was quite so bad.

> and
>
>   B7 ED A8

Likewise, that whole sequence b7 ed a8 c6 a4 was probably never valid;
c6 a4 also isn't a character you'd expect to find in common use.

My guess is that this was data in some non-utf-8 charset that managed to
get past the defective checks in 8.0.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Differences in UTF8 between 8.0 and 8.1

From

Paul Lindner

Date:

26 October 2005, 22:01:33

On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>
> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
> was never actually a valid utf-8 string, and that the d2 b9 is only valid
> by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
> utf-8 check was broken, but I didn't realize it was quite so bad.

Looking at the data it appears that it is a sequence of latin1
characters.  They all have the eighth bit set and all seem to pass the
check.

In a million rows I found 2 examples of this.

However I'm running into another problem now.  The command:
 iconv -c -f UTF8 -t UTF8

does strip out the invalid characters.  However, iconv reads the
entire file into memory before it writes out any data.  This is not so
good for multi-gigabyte dump files and doesn't allow for it to be used
in a pipe between pg_dump and psql.

Anyone have any other recommendations?  GNU recode might do it, but
I'm a bit stymied by the syntax.  A quick perl script using
Text::Iconv didn't work either.  I'm off to look at some other perl
modules and will try to create a script so I can strip out the invalid
characters.

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Re: Differences in UTF8 between 8.0 and 8.1

From

Andrej Ricnik-Bay

Date:

26 October 2005, 22:40:24

> does strip out the invalid characters.  However, iconv reads the
> entire file into memory before it writes out any data.  This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
>
> Anyone have any other recommendations?  GNU recode might do it, but
> I'm a bit stymied by the syntax.  A quick perl script using
> Text::Iconv didn't work either.  I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.
How about an ugly kludge  ...

split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
for i in `ls -1 dumpfile*`; do   iconv -c -f UTF8 -t UTF8 $i;done
cat dumpfile* > new_dump


Cheers,
Andrej

Re: Differences in UTF8 between 8.0 and 8.1

From

Christopher Kings-Lynne

Date:

26 October 2005, 22:48:21

> However I'm running into another problem now.  The command:
> 
>   iconv -c -f UTF8 -t UTF8 
> 
> does strip out the invalid characters.  However, iconv reads the
> entire file into memory before it writes out any data.  This is not so
> good for multi-gigabyte dump files and doesn't allow for it to be used
> in a pipe between pg_dump and psql.
> 
> Anyone have any other recommendations?  GNU recode might do it, but
> I'm a bit stymied by the syntax.  A quick perl script using
> Text::Iconv didn't work either.  I'm off to look at some other perl
> modules and will try to create a script so I can strip out the invalid
> characters.

recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

Chris

Re: Differences in UTF8 between 8.0 and 8.1

From

jtv@xs4all.nl

Date:

27 October 2005, 02:26:30

Andrej Ricnik-Bay wrote:

> How about an ugly kludge  ...
>
> split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile
> for i in `ls -1 dumpfile*`; do   iconv -c -f UTF8 -t UTF8 $i;done
> cat dumpfile* > new_dump

Not with UTF-8...  You might break in the middle of a multibyte character.


Jeroen

Re: Differences in UTF8 between 8.0 and 8.1

From

Andrew - Supernews

Date:

27 October 2005, 08:56:55

On 2005-10-27, Paul Lindner <lindner@inuus.com> wrote:
> On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote:
>> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9
>> was never actually a valid utf-8 string, and that the d2 b9 is only valid
>> by coincidence (it's a Cyrillic letter from Azerbaijani).  I know the 8.0
>> utf-8 check was broken, but I didn't realize it was quite so bad.
>
> Looking at the data it appears that it is a sequence of latin1
> characters.  They all have the eighth bit set and all seem to pass the
> check.

In latin1 it comes out as total gibberish, so I think you'll find it is
actually in something else. Some googling suggests it is most likely in a
Chinese double-byte charset (GB2312).

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Differences in UTF8 between 8.0 and 8.1

From

Gregory Maxwell

Date:

30 October 2005, 12:49:45

On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote:
> >   iconv -c -f UTF8 -t UTF8
> recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql

I've got a file with characters that pg won't accept that recode does
not fix but iconv does. Iconv is fine for my application, so I'm just
posting to the list so that anyone looking for why recode didn't work
for them will find the suggestion to use iconv.

Re: Differences in UTF8 between 8.0 and 8.1

From

Paul Lindner

Date:

01 November 2005, 08:16:12

On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote:
> On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote:
> > >   iconv -c -f UTF8 -t UTF8
> > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql
>
> I've got a file with characters that pg won't accept that recode does
> not fix but iconv does. Iconv is fine for my application, so I'm just
> posting to the list so that anyone looking for why recode didn't work
> for them will find the suggestion to use iconv.

recode did not work for my sample data.  It passed through the problem
character sequences.  I'm still looking for an iconv that doesn't read
the entire file into memory.

At this point I'm looking to use the split command to process input in
10000 line chunks.  Sadly that can't be used in a pipe.

BTW, how will sites that use Slony deal with this issue?

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com