Thread: pg_dump, pg_restore and UTF8: invalid byte sequence

pg_dump, pg_restore and UTF8: invalid byte sequence

From
Date:
hi list
 
while trying to get pgsql 8.2b1 running, i'm already thumbling at the first steps...
source pgslq is 8.1.5 (win32, utf8) on localhost:5432, target is 8.2b1 (win32, utf8) on localhost:5433
 
 
to get the data from one instance to the other, i'm trying to use pg_dump & pg_restore:
 
C:\pgsql8.2beta1\bin\pg_dump.exe -Fc -Z 5 -U postgres -p 5432 -f d:\mydb.backup mydb
C:\pgsql8.2beta1\bin\pg_restore -d mydb -p 5433 -U postgres d:\mydb.backup
 
first of all, pg_restore does not seem to like if the destination database does not yet exist.
if you omit "-d mydb", pg_restore just dumps the mydb.backup content to the command line instead of restoring the data.
the parameter "-C" doesn't do its job either. the only way around this seems to manually create the target table before running pg_restore, despite CREATE DATABASE... is present in the mydb.backup. shouldn't pg_restore be able to create the db for me, with the exact same settings as in the dump?
 
when creating the db manually beforehand, i'm getting over 130 errors (mostly due to TSearch2 & co functions already present). the manual suggests to create the db from template0, but doing so will break restoration of tsearch2 (missing function errors appear) objects.
 
 
the other, more severe problem is not all UTF8-data being properly encoded/decoded. during the restore i'm getting this error:
 
pg_restore: [archiver (db)] Error from TOC entry 3430; 0 4131599 TABLE DATA posts postgres
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe3b66e
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY posts, line 1
 
as a result, the entire table is empty! shouldn't pg_dump encode the utf8 bytesequences?
 
 
also, regarding pg_restore, its quite troubling it has the same parameter-set as pg_dump. even if its probably the same tool in the background, shouldn't pg_restore hide the parameters that are just for dumping data and only showing the ones relevant for data restoring?
 
thanks,
thomas

Re: pg_dump, pg_restore and UTF8: invalid byte sequence

From
Date:
> shouldn't pg_dump encode the utf8 bytesequences?

at least i found out why the invalid unicode sequences appear in the first
place: tsearch2 in 8.1 doesn't properly handle utf8 characters: the
character's 2-byte representation is converted to lowercase byte for byte.
for example: "ä" which is encoded as "ä" is written to the db by tsearch2
as "ã¤" which is an invalid utf8 byte sequence.

striping the ts2 index columb before dumping fixes the encoding problems. i
guess the 8.2 -> 8.1.5 backport should fix it as well, i'll try asap.

> also, regarding pg_restore, its quite troubling it has the same
> parameter-set as pg_dump

never mind this, it is too late in the evening 8-)

- thomas