Thread: Differences in UTF8 between 8.0 and 8.1
I've been doing some test imports of UNICODE databases into Postgres 8.1beta3. The only problem I've seen is that some data from 8.0 databases will not import. I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to restore these results in Invalid UNICODE byte sequence detected near byte ... Question: Does the 8.1 Unicode sanity code accept the full set of characters accepted by the 8.0 Unicode sanity code? If not we'll see a lot of problems like the one above. I believe this patch is the one causing the problem I see: http://www.mail-archive.com/pgsql-patches@postgresql.org/msg08198/unicode.diff Is there any solution other than scrubbing the entire dataset to conform to the new (8.1) encoding rules? -- Paul Lindner ||||| | | | | | | | | | lindner@inuus.com
On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote: > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > restore these results in > > Invalid UNICODE byte sequence detected near byte ... What were the exact offending bytes? > Question: > > Does the 8.1 Unicode sanity code accept the full set of characters > accepted by the 8.0 Unicode sanity code? No. 8.0 and before accepted a lot of stuff that it should never have, and failed to accept stuff that it should have. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On Sun, Oct 23, 2005 at 05:56:50AM -0000, Andrew - Supernews wrote: > On 2005-10-22, Paul Lindner <lindner@inuus.com> wrote: > > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > > restore these results in > > > > Invalid UNICODE byte sequence detected near byte ... > > What were the exact offending bytes? Here's a cut and paste from emacs hexl-mode: 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '... 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 .....', .---.> V 00000030: 414c 5545 5320 2832 3032 3535 332c 2027 ALUES (202553, ' 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639 ..', .169942c169 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833 942.< VALUES (83 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a 015, '.....H', . 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833 ---.> VALUES (83 00000080: 3031 352c 2027 c6a4 4827 2c20 0a 015, '..H', . This is of a minimal diff between a UTF8 scrubbed file and the original dump. It appears the offending bytes are: C1 F9 C2 D0 C7 and B7 ED A8 > > Question: > > > > Does the 8.1 Unicode sanity code accept the full set of characters > > accepted by the 8.0 Unicode sanity code? > > No. 8.0 and before accepted a lot of stuff that it should never have, and > failed to accept stuff that it should have. Thanks go out to John Hansen, he recommended to run the dump through iconv: iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql This seems to strip out invalid UTF8 and will allow for a clean import. Someone should add this to the Release Notes/FAQ.. -- Paul Lindner ||||| | | | | | | | | | lindner@inuus.com
> Thanks go out to John Hansen, he recommended to run the dump through iconv: > > iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql > > This seems to strip out invalid UTF8 and will allow for a clean > import. Someone should add this to the Release Notes/FAQ.. Yes I think that's extremely important to put in the 'upgrading/back compatibiliy' section for these release notes. Chris
On 2005-10-24, Paul Lindner <lindner@inuus.com> wrote: > Here's a cut and paste from emacs hexl-mode: > > 00000000: 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU > 00000010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '... > 00000020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 .....', .---.> V > 00000030: 414c 5545 5320 2832 3032 3535 332c 2027 ALUES (202553, ' > 00000040: d2b9 272c 200a 3136 3939 3432 6331 3639 ..', .169942c169 > 00000050: 3934 320a 3c20 5641 4c55 4553 2028 3833 942.< VALUES (83 > 00000060: 3031 352c 2027 b7ed a8c6 a448 272c 200a 015, '.....H', . > 00000070: 2d2d 2d0a 3e20 5641 4c55 4553 2028 3833 ---.> VALUES (83 > 00000080: 3031 352c 2027 c6a4 4827 2c20 0a 015, '..H', . > > This is of a minimal diff between a UTF8 scrubbed file and the > original dump. > > It appears the offending bytes are: > > C1 F9 C2 D0 C7 I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 was never actually a valid utf-8 string, and that the d2 b9 is only valid by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 utf-8 check was broken, but I didn't realize it was quite so bad. > and > > B7 ED A8 Likewise, that whole sequence b7 ed a8 c6 a4 was probably never valid; c6 a4 also isn't a character you'd expect to find in common use. My guess is that this was data in some non-utf-8 charset that managed to get past the defective checks in 8.0. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote: > > I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 > was never actually a valid utf-8 string, and that the d2 b9 is only valid > by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 > utf-8 check was broken, but I didn't realize it was quite so bad. Looking at the data it appears that it is a sequence of latin1 characters. They all have the eighth bit set and all seem to pass the check. In a million rows I found 2 examples of this. However I'm running into another problem now. The command: iconv -c -f UTF8 -t UTF8 does strip out the invalid characters. However, iconv reads the entire file into memory before it writes out any data. This is not so good for multi-gigabyte dump files and doesn't allow for it to be used in a pipe between pg_dump and psql. Anyone have any other recommendations? GNU recode might do it, but I'm a bit stymied by the syntax. A quick perl script using Text::Iconv didn't work either. I'm off to look at some other perl modules and will try to create a script so I can strip out the invalid characters. -- Paul Lindner ||||| | | | | | | | | | lindner@inuus.com
> does strip out the invalid characters. However, iconv reads the > entire file into memory before it writes out any data. This is not so > good for multi-gigabyte dump files and doesn't allow for it to be used > in a pipe between pg_dump and psql. > > Anyone have any other recommendations? GNU recode might do it, but > I'm a bit stymied by the syntax. A quick perl script using > Text::Iconv didn't work either. I'm off to look at some other perl > modules and will try to create a script so I can strip out the invalid > characters. How about an ugly kludge ... split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done cat dumpfile* > new_dump Cheers, Andrej
> However I'm running into another problem now. The command: > > iconv -c -f UTF8 -t UTF8 > > does strip out the invalid characters. However, iconv reads the > entire file into memory before it writes out any data. This is not so > good for multi-gigabyte dump files and doesn't allow for it to be used > in a pipe between pg_dump and psql. > > Anyone have any other recommendations? GNU recode might do it, but > I'm a bit stymied by the syntax. A quick perl script using > Text::Iconv didn't work either. I'm off to look at some other perl > modules and will try to create a script so I can strip out the invalid > characters. recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql Chris
Andrej Ricnik-Bay wrote: > How about an ugly kludge ... > > split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile > for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done > cat dumpfile* > new_dump Not with UTF-8... You might break in the middle of a multibyte character. Jeroen
On 2005-10-27, Paul Lindner <lindner@inuus.com> wrote: > On Mon, Oct 24, 2005 at 05:07:40AM -0000, Andrew - Supernews wrote: >> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 >> was never actually a valid utf-8 string, and that the d2 b9 is only valid >> by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 >> utf-8 check was broken, but I didn't realize it was quite so bad. > > Looking at the data it appears that it is a sequence of latin1 > characters. They all have the eighth bit set and all seem to pass the > check. In latin1 it comes out as total gibberish, so I think you'll find it is actually in something else. Some googling suggests it is most likely in a Chinese double-byte charset (GB2312). -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote: > > iconv -c -f UTF8 -t UTF8 > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql I've got a file with characters that pg won't accept that recode does not fix but iconv does. Iconv is fine for my application, so I'm just posting to the list so that anyone looking for why recode didn't work for them will find the suggestion to use iconv.
On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote: > On 10/26/05, Christopher Kings-Lynne <chriskl@familyhealth.com.au> wrote: > > > iconv -c -f UTF8 -t UTF8 > > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql > > I've got a file with characters that pg won't accept that recode does > not fix but iconv does. Iconv is fine for my application, so I'm just > posting to the list so that anyone looking for why recode didn't work > for them will find the suggestion to use iconv. recode did not work for my sample data. It passed through the problem character sequences. I'm still looking for an iconv that doesn't read the entire file into memory. At this point I'm looking to use the split command to process input in 10000 line chunks. Sadly that can't be used in a pipe. BTW, how will sites that use Slony deal with this issue? -- Paul Lindner ||||| | | | | | | | | | lindner@inuus.com