Re: invalid UTF-8 byte sequence detected - Mailing list pgsql-general

From Bruce Momjian
Subject Re: invalid UTF-8 byte sequence detected
Date
Msg-id 200511151846.jAFIk2F19608@candle.pha.pa.us
Whole thread Raw
In response to invalid UTF-8 byte sequence detected  ("Markus Wollny" <Markus.Wollny@computec.de>)
List pgsql-general
I am also confused how invalid UTF8 sequences got into your database.
It shouldn't have been possible.

---------------------------------------------------------------------------

Markus Wollny wrote:
> Hello!
>
> I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple
oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: 
>
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte
0xb4
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT:  COPY board_message, line 1125662, column text:
"HI
>
>         Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon?s viel..."
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT:  COPY board_message (board_id, thread_id,
father_id,message_id, user_id, title, signat 
> ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status,
user_rank,user_rank_description, user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login,
user_created,poll_id, idxfti) FROM stdin; 
>
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte
0x98
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT:  COPY kidszone_tournament2005_user, line 427,
columnphone: "02302?74" 
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT:  COPY kidszone_tournament2005_user (id,
first_name,last_name, adress, birthday, phone, 
>  email, place, permission, ude, ude_box, invited) FROM stdin;
>
> There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it
hasme worried and leaves me with some questions: 
>
> 1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8
encodeddatabases, data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but
stillwe cannot really make 100% sure that all clients behave as expected; on the other hand, it would be extremely
inconvenientif we had to check each and every text input for character set conformance in the application, so is there
away to ascertain "sane" data via some database-setting? pg_restore does throw this error and indeed terminates after
that(I used custom dump format for pg_dump), psql on the other hand just continues with the import (using a
pgdumpall-outputthat generates a standard SQL-script), although it too throws the error. 
>
> 2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the
tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error.
Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is
theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import,
soI can accomplish a full restore? 
>
> Kind regards,
>
>    Markus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-general by date:

Previous
From: Jerry LeVan
Date:
Subject: Cursors or Offset, Limit?
Next
From:
Date:
Subject: Customizing the Windows installer