COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence - Mailing list pgsql-bugs

From Steven Schlansker
Subject COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Date
Msg-id 8F72262C-5694-4626-A87F-00604FB5E1D6@trumpet.io
Whole thread Raw
Responses Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
List pgsql-bugs
Hello fine PostgreSQL bug-busters,

I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte,
causingdata corruption in backups and transferred data. 

First, the environment -

 PostgreSQL 8.4.4 on i386-apple-darwin10.3.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc.
build5646) (dot 1), 64-bit 

Mac OS X 10.6.4

[steven@xxx:~]% psql --version
psql (PostgreSQL) 8.4.4
contains support for command-line editing

Now, the setup:
          Name           |       Owner        | Encoding |  Collation  |    Ctype    |   Access privileges   |  Size
|Tablespace |        Description       
baddb                    | xxxxxxx_production | UTF8     | en_US.utf-8 | en_US.utf-8 |                       | 207 MB
|pg_default |  

baddb=> create table badtable (a int, b int, c character varying, d character varying, e character varying, f character
varying[],g text, h character varying[],i character varying[], j character varying[], k character varying[], l
charactervarying[], m character varying[], n character varying[],o character varying, p character varying); 

baddb=> \copy badtable from '/tmp/data.copy'
baddb=> \copy badtable to '/tmp/badness.copy'
baddb=> \copy badtable from '/tmp/badness.copy'
ERROR:  invalid byte sequence for encoding "UTF8": 0xcf2c
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is
controlledby "client_encoding". 
CONTEXT:  COPY badtable, line 1

Obviously, this wouldn't be too helpful without the datafile in question:

1       2377510 FOURSQUARE      1403504 Pizza Hut       {}      \N      {}      {}      {}      {Pizza}
{πίτσα,hut,food,ζωγράφου,pizza,eat,zografou}   {}      \N      \N  \N 

Since this is likely to be eaten by various mail clients or lost in translation, please find attached a TGZ of the data
filein question. 



The hexdump shows exactly what goes wrong:

[steven@xxx:/tmp]% hexdump -C data.copy
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000070  67 72 61 66 6f 75 7d 09  7b 7d 09 5c 4e 09 5c 4e  |grafou}.{}.\N.\N|
00000080  09 5c 4e 0a                                       |.\N.|
00000084
[steven@xxx:/tmp]% hexdump -C badness.out
00000000  31 09 32 33 37 37 35 31  30 09 46 4f 55 52 53 51  |1.2377510.FOURSQ|
00000010  55 41 52 45 09 31 34 30  33 35 30 34 09 50 69 7a  |UARE.1403504.Piz|
00000020  7a 61 20 48 75 74 09 7b  7d 09 5c 4e 09 7b 7d 09  |za Hut.{}.\N.{}.|
00000030  7b 7d 09 7b 7d 09 7b 50  69 7a 7a 61 7d 09 7b cf  |{}.{}.{Pizza}.{.|
00000040  80 ce af cf 84 cf 83 ce  b1 2c 68 75 74 2c 66 6f  |.........,hut,fo|
00000050  6f 64 2c ce b6 cf 89 ce  b3 cf 81 ce ac cf 86 ce  |od,.............|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|
00000070  72 61 66 6f 75 7d 09 7b  7d 09 5c 4e 09 5c 4e 09  |rafou}.{}.\N.\N.|
00000080  5c 4e 0a                                          |\N.|


Note offset 0x62:
00000060  bf cf 85 2c 70 69 7a 7a  61 2c 65 61 74 2c 7a 6f  |...,pizza,eat,zo|
00000060  bf cf 2c 70 69 7a 7a 61  2c 65 61 74 2c 7a 6f 67  |..,pizza,eat,zog|

The 0xCF85 multibyte UTF-8 character was truncated to 0xCF!  When I try to reimport it, it goes in as CF2C (the 2C is
thefollowing comma) which is not valid and matches the error printed by the client. 

I can reproduce this problem using the JDBC driver as well, so I do not think it is a PSQL bug but instead a PostgreSQL
backendproblem.  I could be wrong, of course. 

Interestingly enough, this problem *does not* happen on a different machine -
 PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit
(stock Ubuntu package)

I can reliably reproduce it on two different Macs, though.

As with all random email posters, I'm not actually subscribed to the list - please keep me on the CC for replies.

I would very much appreciate help tracking this down!  Thanks for your time :)

Steven Schlansker


Attachment

pgsql-bugs by date:

Previous
From: "Albert Ullrich"
Date:
Subject: BUG #5626: Parallel pg_restore fails with "tuple concurrently updated"
Next
From: Jens Wilke
Date:
Subject: Re: BUG #5623: uuid-ossp contribs fail to compile