Thread: COPY fails on 8.1 with invalid byte sequences in text types
You can insert invalid UTF8 bytes sequences into a TEXT type on an 8.1 installation by doing something like: INSERT INTO foo(t) VALUES('\xFF'); Then, you can do a: COPY foo TO '/some/file'; but if you try to do a: COPY foo FROM '/some/file'; That will fail because /some/file contains invalid UTF8 sequences, even though it's the same file you copied out. It seems to be essentially a data corruption issue if applications insert binary data in text fields using escape sequences. Shouldn't PostgreSQL reject an invalid UTF8 sequence in any text type? Regards, Jeff Davis
On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote: > It seems to be essentially a data corruption issue if applications > insert binary data in text fields using escape sequences. Shouldn't > PostgreSQL reject an invalid UTF8 sequence in any text type? > Another note: PostgreSQL rejects invalid UTF8 sequences in other contexts. For instance, if you use PQexecParams() and insert using type text and any format (text or binary), it will reject invalid sequences. It will of course allow anything to be sent when the type is bytea. Also, I thought I'd publish the workaround that I'm using. I created a function that seems to work for validating text data as being valid UTF8. CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN LANGUAGE plperlu AS $valid_utf8$ use utf8; return utf8::decode($_[0]) ? 1 : 0; $valid_utf8$; I just add a check constraint on all of my text attributes in all of my tables. Not fun, but it works. Regards, Jeff Davis
FYI, prior to 8.2, there is another source of bad UTF8 byte sequences: when using tsearch2 on utf8 content in <8.2, tsearch2 was generating bad utf8 sequences. as tsearch2 does lowercase each char in the text its indexing, it did also do so with multibyte-characters... unfortunately taking each byte separately, so it seems. the unicode-representation of german umlauts (äöü) are some examples of charcodes, that where turned into invalid sequences. this data could be successfully pg_dump'ed, but not pg_restore'd. in 8.2, this looks fixed. to upgrade from 8.1.5 to 8.2b1 we had to remove all tsearch2 index data, dump the db, restore the db in 8.2 and recreate the indices. - thomas ----- Original Message ----- From: "Jeff Davis" <pgsql@j-davis.com> To: <pgsql-bugs@postgresql.org> Sent: Saturday, October 28, 2006 12:38 AM Subject: Re: [BUGS] COPY fails on 8.1 with invalid byte sequences in text > On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote: >> It seems to be essentially a data corruption issue if applications >> insert binary data in text fields using escape sequences. Shouldn't >> PostgreSQL reject an invalid UTF8 sequence in any text type? >> > > Another note: PostgreSQL rejects invalid UTF8 sequences in other > contexts. For instance, if you use PQexecParams() and insert using type > text and any format (text or binary), it will reject invalid sequences. > It will of course allow anything to be sent when the type is bytea. > > Also, I thought I'd publish the workaround that I'm using. > > I created a function that seems to work for validating text data as > being valid UTF8. > > CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN > LANGUAGE plperlu AS > $valid_utf8$ > use utf8; > return utf8::decode($_[0]) ? 1 : 0; > $valid_utf8$; > > I just add a check constraint on all of my text attributes in all of my > tables. Not fun, but it works. > > Regards, > Jeff Davis > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On 27/10/06, Thomas H. <me@alternize.com> wrote: > FYI, prior to 8.2, there is another source of bad UTF8 byte sequences: > > when using tsearch2 on utf8 content in <8.2, tsearch2 was generating bad > utf8 sequences. as tsearch2 does lowercase each char in the text its > indexing, it did also do so with multibyte-characters... unfortunately > taking each byte separately, so it seems. the unicode-representation of > german umlauts (=E4=F6=FC) are some examples of charcodes, that where tur= ned into > invalid sequences. > > this data could be successfully pg_dump'ed, but not pg_restore'd. in 8.2, > this looks fixed. to upgrade from 8.1.5 to 8.2b1 we had to remove all > tsearch2 index data, dump the db, restore the db in 8.2 and recreate the > indices. You need to initdb with utf8 and then install tsearch2 with utf8. Both need utf8. I had a similar problem. Perhaps your 8.1 postgres cluster wasn't utf8? > > - thomas > > > > ----- Original Message ----- > From: "Jeff Davis" <pgsql@j-davis.com> > To: <pgsql-bugs@postgresql.org> > Sent: Saturday, October 28, 2006 12:38 AM > Subject: Re: [BUGS] COPY fails on 8.1 with invalid byte sequences in text > > > > On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote: > >> It seems to be essentially a data corruption issue if applications > >> insert binary data in text fields using escape sequences. Shouldn't > >> PostgreSQL reject an invalid UTF8 sequence in any text type? > >> > > > > Another note: PostgreSQL rejects invalid UTF8 sequences in other > > contexts. For instance, if you use PQexecParams() and insert using type > > text and any format (text or binary), it will reject invalid sequences. > > It will of course allow anything to be sent when the type is bytea. > > > > Also, I thought I'd publish the workaround that I'm using. > > > > I created a function that seems to work for validating text data as > > being valid UTF8. > > > > CREATE OR REPLACE FUNCTION valid_utf8(TEXT) returns BOOLEAN > > LANGUAGE plperlu AS > > $valid_utf8$ > > use utf8; > > return utf8::decode($_[0]) ? 1 : 0; > > $valid_utf8$; > > > > I just add a check constraint on all of my text attributes in all of my > > tables. Not fun, but it works. > > > > Regards, > > Jeff Davis > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: explain analyze is your friend > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
On Fri, 2006-10-27 at 14:42 -0700, Jeff Davis wrote: > You can insert invalid UTF8 bytes sequences into a TEXT type on an 8.1 > installation by doing something like: > I created a patch that appears to fix the problem, and does not appear to break anything else. Is this acceptable? Regards, Jeff Davis
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > I created a patch that appears to fix the problem, and does not appear > to break anything else. ... except maybe bytea ... regards, tom lane
On Tue, 2006-10-31 at 16:13 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I created a patch that appears to fix the problem, and does not appear > > to break anything else. > > ... except maybe bytea ... > Ok. So then it seems that the only possible places to fix it are in textin and all the other input functions for all the character types*, or if we change COPY to use the send/recv functions rather than the out/in functions. I don't think we want to change the format for COPY, so is it reasonable to change the input functions to reject invalid byte sequences? COPY isn't just an issue for backups. Slony-I uses COPY to transfer data, and if there are any invalid byte sequences than replication will fail. The COPY doc page makes every implication that something COPY'd out can be COPY'd back in in the same way. Is this not a bug? Is there a discussion on -hackers about this that I missed? Regards, Jeff Davis * I don't immediately know what we'd do about "char". I think people expect it to accept 256 values, but clearly that would violate a lot of encodings. However, the current behavior creates the same problem with COPY.
Jeff Davis <pgsql@j-davis.com> writes: > Is this not a bug? I don't actually see that it is. The documentation is perfectly clear on the point: (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) (This is in 4.1.2.1. String Constants) If you don't want to deal with this, don't use octal escapes to construct multibyte characters. regards, tom lane
Tom Lane skrev: >> Is this not a bug? > > I don't actually see that it is. The documentation is perfectly clear > on the point: > > (It is your responsibility that the byte sequences you create > are valid characters in the server character set encoding.) > > (This is in 4.1.2.1. String Constants) If you don't want to deal with > this, don't use octal escapes to construct multibyte characters. The database also has a responsibility to protect itself. If you can insert data into the database so that the stored value is invalid for the datatype of the column, then it's broken imho. Having a statement in the doc saying "please don't do this because you will corrupt the data" doesn't really make the situation better. /Dennis
On Tue, 2006-10-31 at 23:18 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Is this not a bug? > > I don't actually see that it is. The documentation is perfectly clear > on the point: > > (It is your responsibility that the byte sequences you create > are valid characters in the server character set encoding.) > > (This is in 4.1.2.1. String Constants) If you don't want to deal with > this, don't use octal escapes to construct multibyte characters. > I have thought about this some more, and I still disagree on a few points. First, there is no way, as a DBA, to _not_ use octal escapes, because there is no option to turn off. I can either inspect the application, or just trust it, but I can't control it from the database (short of adding a CHECK to every text column). Since it interferes with the proper use of COPY, which in turn interferes with the use of pg_dump and applications such as Slony, I think that it is something that should be controlled at database. Second, you pointed out that my patch breaks BYTEA. I'd like to point out that (as I understand it) the patch only breaks BYTEA if you rely on the cstring escaping to pass binary data to byteain, like: => SELECT E'\377'::bytea; In my opinion, that's wrong anyway, because it will fail if you do something like: => SELECT E'\377\000\377'; Because the NULL in the middle terminates the cstring, which passes a 1- byte string to byteain, resulting in a 1-byte value, instead of a 3-byte value which it should be. However, if you pass ASCII data to byteain that represents binary data, like: => SELECT E'\\377\\000\\377'::bytea; It seems to work just fine, and that's what PQescapeByteaConn() does. There may be a backwards-compatibility issue, but I don't think my patch is 100% broken (unless, of course, you found some other way that it's broken). This is a practical concern because some applications use escaping that protects against SQL injection, but does not protect against invalid byte sequences. It's not obvious to an application programmer that protecting against SQL injection is not enough, particularly if the programmer is trying to be "database agnostic" or doesn't test against different encodings. Ideally, they should use PQescapeStringConn() and not convert anything to an octal escape, but that's not always the case. Lastly, if we retain the current behavior, I think a note should be added to the COPY docs. The current docs imply that if you COPY it out, you can COPY it back in. Even if I read the above documentation note directly after reading the COPY documentation, it would not make me think that COPY could fail for built-in types. Regards, Jeff Davis