invalid UTF-8 byte sequence detected - Mailing list pgsql-general
From | Markus Wollny |
---|---|
Subject | invalid UTF-8 byte sequence detected |
Date | |
Msg-id | 28011CD60FB1724DBA4442E38277F6262857D2@hermes.computec.de Whole thread Raw |
Responses |
Re: invalid UTF-8 byte sequence detected
|
List | pgsql-general |
Hello! I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0xb4 <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT: COPY board_message, line 1125662, column text: "HI Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon▒s viel..." <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT: COPY board_message (board_id, thread_id, father_id,message_id, user_id, title, signat ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank, user_rank_description,user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created, poll_id,idxfti) FROM stdin; <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0x98 <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT: COPY kidszone_tournament2005_user, line 427, columnphone: "02302▒74" <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT: COPY kidszone_tournament2005_user (id, first_name,last_name, adress, birthday, phone, email, place, permission, ude, ude_box, invited) FROM stdin; There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it has meworried and leaves me with some questions: 1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encoded databases,data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but still we cannotreally make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenient ifwe had to check each and every text input for character set conformance in the application, so is there a way to ascertain"sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that (I usedcustom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-output thatgenerates a standard SQL-script), although it too throws the error. 2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error. Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import, soI can accomplish a full restore? Kind regards, Markus
pgsql-general by date: