Thread: COPY fails on 8.1 with invalid byte sequences in text types

COPY fails on 8.1 with invalid byte sequences in text types

From
Jeff Davis
Date:
You can insert invalid UTF8 bytes sequences into a TEXT type on an 8.1
installation by doing something like:

INSERT INTO foo(t) VALUES('\xFF');

Then, you can do a:

COPY foo TO '/some/file';

but if you try to do a:

COPY foo FROM '/some/file';

That will fail because /some/file contains invalid UTF8 sequences, even
though it's the same file you copied out.

It seems to be essentially a data corruption issue if applications
insert binary data in text fields using escape sequences. Shouldn't
PostgreSQL reject an invalid UTF8 sequence in any text type?

Regards,
    Jeff Davis

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Jeff Davis
Date:
On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote:
> It seems to be essentially a data corruption issue if applications
> insert binary data in text fields using escape sequences. Shouldn't
> PostgreSQL reject an invalid UTF8 sequence in any text type?
>

Another note: PostgreSQL rejects invalid UTF8 sequences in other
contexts. For instance, if you use PQexecParams() and insert using type
text and any format (text or binary), it will reject invalid sequences.
It will of course allow anything to be sent when the type is bytea.

Also, I thought I'd publish the workaround that I'm using.

I created a function that seems to work for validating text data as
being valid UTF8.

CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN
LANGUAGE plperlu AS
$valid_utf8$
use utf8;
return utf8::decode($_[0]) ? 1 : 0;
$valid_utf8$;

I just add a check constraint on all of my text attributes in all of my
tables. Not fun, but it works.

Regards,
    Jeff Davis

Re: COPY fails on 8.1 with invalid byte sequences in text

From
"Thomas H."
Date:
FYI, prior to 8.2, there is another source of bad UTF8 byte sequences:

when using tsearch2 on utf8 content in <8.2, tsearch2 was generating bad
utf8 sequences. as tsearch2 does lowercase each char in the text its
indexing, it did also do so with multibyte-characters... unfortunately
taking each byte separately, so it seems. the unicode-representation of
german umlauts (äöü) are some examples of charcodes, that where turned into
invalid sequences.

this data could be successfully pg_dump'ed, but not pg_restore'd. in 8.2,
this looks fixed. to upgrade from 8.1.5 to 8.2b1 we had to remove all
tsearch2 index data, dump the db, restore the db in 8.2 and recreate the
indices.

- thomas



----- Original Message -----
From: "Jeff Davis" <pgsql@j-davis.com>
To: <pgsql-bugs@postgresql.org>
Sent: Saturday, October 28, 2006 12:38 AM
Subject: Re: [BUGS] COPY fails on 8.1 with invalid byte sequences in text


> On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote:
>> It seems to be essentially a data corruption issue if applications
>> insert binary data in text fields using escape sequences. Shouldn't
>> PostgreSQL reject an invalid UTF8 sequence in any text type?
>>
>
> Another note: PostgreSQL rejects invalid UTF8 sequences in other
> contexts. For instance, if you use PQexecParams() and insert using type
> text and any format (text or binary), it will reject invalid sequences.
> It will of course allow anything to be sent when the type is bytea.
>
> Also, I thought I'd publish the workaround that I'm using.
>
> I created a function that seems to work for validating text data as
> being valid UTF8.
>
> CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN
> LANGUAGE plperlu AS
> $valid_utf8$
> use utf8;
> return utf8::decode($_[0]) ? 1 : 0;
> $valid_utf8$;
>
> I just add a check constraint on all of my text attributes in all of my
> tables. Not fun, but it works.
>
> Regards,
> Jeff Davis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: COPY fails on 8.1 with invalid byte sequences in text

From
"Birju Prajapati"
Date:
On 27/10/06, Thomas H. <me@alternize.com> wrote:
> FYI, prior to 8.2, there is another source of bad UTF8 byte sequences:
>
> when using tsearch2 on utf8 content in <8.2, tsearch2 was generating bad
> utf8 sequences. as tsearch2 does lowercase each char in the text its
> indexing, it did also do so with multibyte-characters... unfortunately
> taking each byte separately, so it seems. the unicode-representation of
> german umlauts (=E4=F6=FC) are some examples of charcodes, that where tur=
ned into
> invalid sequences.
>
> this data could be successfully pg_dump'ed, but not pg_restore'd. in 8.2,
> this looks fixed. to upgrade from 8.1.5 to 8.2b1 we had to remove all
> tsearch2 index data, dump the db, restore the db in 8.2 and recreate the
> indices.

You need to initdb with utf8 and then install tsearch2 with utf8. Both
need utf8. I had a similar problem. Perhaps your 8.1 postgres cluster
wasn't utf8?

>
> - thomas
>
>
>
> ----- Original Message -----
> From: "Jeff Davis" <pgsql@j-davis.com>
> To: <pgsql-bugs@postgresql.org>
> Sent: Saturday, October 28, 2006 12:38 AM
> Subject: Re: [BUGS] COPY fails on 8.1 with invalid byte sequences in text
>
>
> > On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote:
> >> It seems to be essentially a data corruption issue if applications
> >> insert binary data in text fields using escape sequences. Shouldn't
> >> PostgreSQL reject an invalid UTF8 sequence in any text type?
> >>
> >
> > Another note: PostgreSQL rejects invalid UTF8 sequences in other
> > contexts. For instance, if you use PQexecParams() and insert using type
> > text and any format (text or binary), it will reject invalid sequences.
> > It will of course allow anything to be sent when the type is bytea.
> >
> > Also, I thought I'd publish the workaround that I'm using.
> >
> > I created a function that seems to work for validating text data as
> > being valid UTF8.
> >
> > CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN
> > LANGUAGE plperlu AS
> > $valid_utf8$
> > use utf8;
> > return utf8::decode($_[0]) ? 1 : 0;
> > $valid_utf8$;
> >
> > I just add a check constraint on all of my text attributes in all of my
> > tables. Not fun, but it works.
> >
> > Regards,
> > Jeff Davis
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: explain analyze is your friend
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Jeff Davis
Date:
On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote:
> You can insert invalid UTF8 bytes sequences into a TEXT type on an 8.1
> installation by doing something like:
>

I created a patch that appears to fix the problem, and does not appear
to break anything else.

Is this acceptable?

Regards,
    Jeff Davis

Attachment

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I created a patch that appears to fix the problem, and does not appear
> to break anything else.

... except maybe bytea ...

            regards, tom lane

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Jeff Davis
Date:
On Tue, 2006-10-31 at 16:13 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I created a patch that appears to fix the problem, and does not appear
> > to break anything else.
>
> ... except maybe bytea ...
>

Ok. So then it seems that the only possible places to fix it are in
textin and all the other input functions for all the character types*,
or if we change COPY to use the send/recv functions rather than the
out/in functions.

I don't think we want to change the format for COPY, so is it reasonable
to change the input functions to reject invalid byte sequences?

COPY isn't just an issue for backups. Slony-I uses COPY to transfer
data, and if there are any invalid byte sequences than replication will
fail. The COPY doc page makes every implication that something COPY'd
out can be COPY'd back in in the same way.

Is this not a bug? Is there a discussion on -hackers about this that I
missed?

Regards,
    Jeff Davis

* I don't immediately know what we'd do about "char". I think people
expect it to accept 256 values, but clearly that would violate a lot of
encodings. However, the current behavior creates the same problem with
COPY.

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Is this not a bug?

I don't actually see that it is.  The documentation is perfectly clear
on the point:

    (It is your responsibility that the byte sequences you create
    are valid characters in the server character set encoding.)

(This is in 4.1.2.1. String Constants)  If you don't want to deal with
this, don't use octal escapes to construct multibyte characters.

            regards, tom lane

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Dennis Bjorklund
Date:
Tom Lane skrev:

>> Is this not a bug?
>
> I don't actually see that it is.  The documentation is perfectly clear
> on the point:
>
>     (It is your responsibility that the byte sequences you create
>     are valid characters in the server character set encoding.)
>
> (This is in 4.1.2.1. String Constants)  If you don't want to deal with
> this, don't use octal escapes to construct multibyte characters.

The database also has a responsibility to protect itself. If you can
insert data into the database so that the stored value is invalid for
the datatype of the column, then it's broken imho.

Having a statement in the doc saying "please don't do this because you
will corrupt the data" doesn't really make the situation better.

/Dennis

Re: COPY fails on 8.1 with invalid byte sequences in text

From
Jeff Davis
Date:
On Tue, 2006-10-31 at 23:18 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Is this not a bug?
>
> I don't actually see that it is.  The documentation is perfectly clear
> on the point:
>
>     (It is your responsibility that the byte sequences you create
>     are valid characters in the server character set encoding.)
>
> (This is in 4.1.2.1. String Constants)  If you don't want to deal with
> this, don't use octal escapes to construct multibyte characters.
>

I have thought about this some more, and I still disagree on a few
points.

First, there is no way, as a DBA, to _not_ use octal escapes, because
there is no option to turn off. I can either inspect the application,
or just trust it, but I can't control it from the database (short of
adding a CHECK to every text column). Since it interferes with the
proper use of COPY, which in turn interferes with the use of pg_dump and
applications such as Slony, I think that it is something that should be
controlled at database.

Second, you pointed out that my patch breaks BYTEA. I'd like to point
out that (as I understand it) the patch only breaks BYTEA if you rely on
the cstring escaping to pass binary data to byteain, like:

=> SELECT E'\377'::bytea;

In my opinion, that's wrong anyway, because it will fail if you do
something like:

=> SELECT E'\377\000\377';

Because the NULL in the middle terminates the cstring, which passes a 1-
byte string to byteain, resulting in a 1-byte value, instead of a 3-byte
value which it should be.

However, if you pass ASCII data to byteain that represents binary data,
like:

=> SELECT E'\\377\\000\\377'::bytea;

It seems to work just fine, and that's what PQescapeByteaConn() does.
There may be a backwards-compatibility issue, but I don't think my patch
is 100% broken (unless, of course, you found some other way that it's
broken).

This is a practical concern because some applications use escaping that
protects against SQL injection, but does not protect against invalid
byte sequences. It's not obvious to an application programmer that
protecting against SQL injection is not enough, particularly if the
programmer is trying to be "database agnostic" or doesn't test against
different encodings. Ideally, they should use PQescapeStringConn() and
not convert anything to an octal escape, but that's not always the case.

Lastly, if we retain the current behavior, I think a note should be
added to the COPY docs. The current docs imply that if you COPY it out,
you can COPY it back in. Even if I read the above documentation note
directly after reading the COPY documentation, it would not make me
think that COPY could fail for built-in types.

Regards,
    Jeff Davis