COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence - Mailing list pgsql-bugs
From | Steven Schlansker |
---|---|
Subject | COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence |
Date | |
Msg-id | 8F72262C-5694-4626-A87F-00604FB5E1D6@trumpet.io Whole thread Raw |
Responses |
Re: COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
|
List | pgsql-bugs |
Hello fine PostgreSQL bug-busters, I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causingdata corruption in backups and transferred data. First, the environment - PostgreSQL 8.4.4 on i386-apple-darwin10.3.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build5646) (dot 1), 64-bit Mac OS X 10.6.4 [steven@xxx:~]% psql --version psql (PostgreSQL) 8.4.4 contains support for command-line editing Now, the setup: Name | Owner | Encoding | Collation | Ctype | Access privileges | Size |Tablespace | Description baddb | xxxxxxx_production | UTF8 | en_US.utf-8 | en_US.utf-8 | | 207 MB |pg_default | baddb=> create table badtable (a int, b int, c character varying, d character varying, e character varying, f character varying[],g text, h character varying[],i character varying[], j character varying[], k character varying[], l charactervarying[], m character varying[], n character varying[],o character varying, p character varying); baddb=> \copy badtable from '/tmp/data.copy' baddb=> \copy badtable to '/tmp/badness.copy' baddb=> \copy badtable from '/tmp/badness.copy' ERROR: invalid byte sequence for encoding "UTF8": 0xcf2c HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". CONTEXT: COPY badtable, line 1 Obviously, this wouldn't be too helpful without the datafile in question: 1 2377510 FOURSQUARE 1403504 Pizza Hut {} \N {} {} {} {Pizza} {πίτσα,hut,food,ζωγράφου,pizza,eat,zografou} {} \N \N \N Since this is likely to be eaten by various mail clients or lost in translation, please find attached a TGZ of the data filein question. The hexdump shows exactly what goes wrong: [steven@xxx:/tmp]% hexdump -C data.copy 00000000 31 09 32 33 37 37 35 31 30 09 46 4f 55 52 53 51 |1.2377510.FOURSQ| 00000010 55 41 52 45 09 31 34 30 33 35 30 34 09 50 69 7a |UARE.1403504.Piz| 00000020 7a 61 20 48 75 74 09 7b 7d 09 5c 4e 09 7b 7d 09 |za Hut.{}.\N.{}.| 00000030 7b 7d 09 7b 7d 09 7b 50 69 7a 7a 61 7d 09 7b cf |{}.{}.{Pizza}.{.| 00000040 80 ce af cf 84 cf 83 ce b1 2c 68 75 74 2c 66 6f |.........,hut,fo| 00000050 6f 64 2c ce b6 cf 89 ce b3 cf 81 ce ac cf 86 ce |od,.............| 00000060 bf cf 85 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f |...,pizza,eat,zo| 00000070 67 72 61 66 6f 75 7d 09 7b 7d 09 5c 4e 09 5c 4e |grafou}.{}.\N.\N| 00000080 09 5c 4e 0a |.\N.| 00000084 [steven@xxx:/tmp]% hexdump -C badness.out 00000000 31 09 32 33 37 37 35 31 30 09 46 4f 55 52 53 51 |1.2377510.FOURSQ| 00000010 55 41 52 45 09 31 34 30 33 35 30 34 09 50 69 7a |UARE.1403504.Piz| 00000020 7a 61 20 48 75 74 09 7b 7d 09 5c 4e 09 7b 7d 09 |za Hut.{}.\N.{}.| 00000030 7b 7d 09 7b 7d 09 7b 50 69 7a 7a 61 7d 09 7b cf |{}.{}.{Pizza}.{.| 00000040 80 ce af cf 84 cf 83 ce b1 2c 68 75 74 2c 66 6f |.........,hut,fo| 00000050 6f 64 2c ce b6 cf 89 ce b3 cf 81 ce ac cf 86 ce |od,.............| 00000060 bf cf 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f 67 |..,pizza,eat,zog| 00000070 72 61 66 6f 75 7d 09 7b 7d 09 5c 4e 09 5c 4e 09 |rafou}.{}.\N.\N.| 00000080 5c 4e 0a |\N.| Note offset 0x62: 00000060 bf cf 85 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f |...,pizza,eat,zo| 00000060 bf cf 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f 67 |..,pizza,eat,zog| The 0xCF85 multibyte UTF-8 character was truncated to 0xCF! When I try to reimport it, it goes in as CF2C (the 2C is thefollowing comma) which is not valid and matches the error printed by the client. I can reproduce this problem using the JDBC driver as well, so I do not think it is a PSQL bug but instead a PostgreSQL backendproblem. I could be wrong, of course. Interestingly enough, this problem *does not* happen on a different machine - PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit (stock Ubuntu package) I can reliably reproduce it on two different Macs, though. As with all random email posters, I'm not actually subscribed to the list - please keep me on the CC for replies. I would very much appreciate help tracking this down! Thanks for your time :) Steven Schlansker
Attachment
pgsql-bugs by date: