Re: invalid UTF-8 byte sequence detected - Mailing list pgsql-general
From | Bruce Momjian |
---|---|
Subject | Re: invalid UTF-8 byte sequence detected |
Date | |
Msg-id | 200511151846.jAFIk2F19608@candle.pha.pa.us Whole thread Raw |
In response to | invalid UTF-8 byte sequence detected ("Markus Wollny" <Markus.Wollny@computec.de>) |
List | pgsql-general |
I am also confused how invalid UTF8 sequences got into your database. It shouldn't have been possible. --------------------------------------------------------------------------- Markus Wollny wrote: > Hello! > > I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: > > <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0xb4 > <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT: COPY board_message, line 1125662, column text: "HI > > Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon?s viel..." > <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT: COPY board_message (board_id, thread_id, father_id,message_id, user_id, title, signat > ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank,user_rank_description, user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created,poll_id, idxfti) FROM stdin; > > <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0x98 > <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT: COPY kidszone_tournament2005_user, line 427, columnphone: "02302?74" > <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT: COPY kidszone_tournament2005_user (id, first_name,last_name, adress, birthday, phone, > email, place, permission, ude, ude_box, invited) FROM stdin; > > There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it hasme worried and leaves me with some questions: > > 1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encodeddatabases, data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but stillwe cannot really make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenientif we had to check each and every text input for character set conformance in the application, so is there away to ascertain "sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that(I used custom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-outputthat generates a standard SQL-script), although it too throws the error. > > 2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error. Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import, soI can accomplish a full restore? > > Kind regards, > > Markus > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-general by date: