Re: COPY fails on 8.1 with invalid byte sequences in text - Mailing list pgsql-bugs

From Thomas H.
Subject Re: COPY fails on 8.1 with invalid byte sequences in text
Date
Msg-id 0e2801c6fa1a$5dfa5b60$6501a8c0@iwing
Whole thread Raw
In response to COPY fails on 8.1 with invalid byte sequences in text types  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: COPY fails on 8.1 with invalid byte sequences in text
List pgsql-bugs
FYI, prior to 8.2, there is another source of bad UTF8 byte sequences:

when using tsearch2 on utf8 content in <8.2, tsearch2 was generating bad
utf8 sequences. as tsearch2 does lowercase each char in the text its
indexing, it did also do so with multibyte-characters... unfortunately
taking each byte separately, so it seems. the unicode-representation of
german umlauts (äöü) are some examples of charcodes, that where turned into
invalid sequences.

this data could be successfully pg_dump'ed, but not pg_restore'd. in 8.2,
this looks fixed. to upgrade from 8.1.5 to 8.2b1 we had to remove all
tsearch2 index data, dump the db, restore the db in 8.2 and recreate the
indices.

- thomas



----- Original Message -----
From: "Jeff Davis" <pgsql@j-davis.com>
To: <pgsql-bugs@postgresql.org>
Sent: Saturday, October 28, 2006 12:38 AM
Subject: Re: [BUGS] COPY fails on 8.1 with invalid byte sequences in text


> On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote:
>> It seems to be essentially a data corruption issue if applications
>> insert binary data in text fields using escape sequences. Shouldn't
>> PostgreSQL reject an invalid UTF8 sequence in any text type?
>>
>
> Another note: PostgreSQL rejects invalid UTF8 sequences in other
> contexts. For instance, if you use PQexecParams() and insert using type
> text and any format (text or binary), it will reject invalid sequences.
> It will of course allow anything to be sent when the type is bytea.
>
> Also, I thought I'd publish the workaround that I'm using.
>
> I created a function that seems to work for validating text data as
> being valid UTF8.
>
> CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN
> LANGUAGE plperlu AS
> $valid_utf8$
> use utf8;
> return utf8::decode($_[0]) ? 1 : 0;
> $valid_utf8$;
>
> I just add a check constraint on all of my text attributes in all of my
> tables. Not fun, but it works.
>
> Regards,
> Jeff Davis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

pgsql-bugs by date:

Previous
From: Jeff Davis
Date:
Subject: Re: COPY fails on 8.1 with invalid byte sequences in text
Next
From: "Birju Prajapati"
Date:
Subject: Re: COPY fails on 8.1 with invalid byte sequences in text