Thread: XML with invalid chars
I came across this today, while helping a customer. The following will happily create a piece of XML with an embedded ^A: select xmlelement(name foo, null, E'abc\x01def'); Now, a ^A is totally forbidden in XML version 1.0, and allowed but only as "" or equivalent in XML version 1.1, and not as a 0x01 byte (see <http://en.wikipedia.org/wiki/XML#Valid_characters>) ISTM this is something we should definitely try to fix ASAP, even if we probably can't backpatch the fix. (Interestingly, the software than runs my PostgreSQL blog, Serendipity, appears to have a similar bug, at least in the version Devrim is using, having cheerfully embedded a ^L in its RSS feed a few days ago, thus causing planet.postgresql.org to blow up.) cheers andrew
On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote: > I came across this today, while helping a customer. The following will > happily create a piece of XML with an embedded ^A: > > select xmlelement(name foo, null, E'abc\x01def'); > > Now, a ^A is totally forbidden in XML version 1.0, and allowed but only > as "" or equivalent in XML version 1.1, and not as a 0x01 byte > (see <http://en.wikipedia.org/wiki/XML#Valid_characters>) > > ISTM this is something we should definitely try to fix ASAP, even if we > probably can't backpatch the fix. +1. Given that such a datum breaks dump+reload, it seems risky to do nothing at all in the back branches.
On mån, 2011-04-25 at 19:25 -0400, Andrew Dunstan wrote: > I came across this today, while helping a customer. The following > will > happily create a piece of XML with an embedded ^A: > > select xmlelement(name foo, null, E'abc\x01def'); > > Now, a ^A is totally forbidden in XML version 1.0, and allowed but > only > as "" or equivalent in XML version 1.1, and not as a 0x01 byte > (see <http://en.wikipedia.org/wiki/XML#Valid_characters>) The best place to fix this might be in escape_xml() in xml.c. Since we don't support XML 1.1 yet, just reject all invalid characters there according to XML 1.0. Relevant bits from the SQL standard: i) Let CS be the character set of SQLT. Let XMLVRAW be the result of mapping SQLV to Unicode using the implementation-defined mapping of character strings of CS to Unicode. If any Unicode code point in XMLVRAW does not represent a valid XML character, then an exception condition is raised: SQL/XML mapping error — invalid XML character. [This is what you'd add.] ii) Let XMLV be XMLVRAW, with each instance of “&” (U+0026) replaced by “&”, each instance of “<” (U+003C) replaced by “<”, each instance of “>” (U+003E) replaced by “>”, and each instance of Carriage Return (U+000D) replaced by “ ”. [This is what it already does.]
On 04/26/2011 05:11 PM, Noah Misch wrote: > On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote: >> I came across this today, while helping a customer. The following will >> happily create a piece of XML with an embedded ^A: >> >> select xmlelement(name foo, null, E'abc\x01def'); >> >> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only >> as "" or equivalent in XML version 1.1, and not as a 0x01 byte >> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>) >> >> ISTM this is something we should definitely try to fix ASAP, even if we >> probably can't backpatch the fix. > +1. Given that such a datum breaks dump+reload, it seems risky to do nothing at > all in the back branches. Here's a patch along the lines suggested by Peter. I'm not sure what to do about the back branches and cases where data is already in databases. This is fairly ugly. Suggestions welcome. cheers andrew
Attachment
On Wed, Apr 27, 2011 at 03:05:30PM -0400, Andrew Dunstan wrote: > On 04/26/2011 05:11 PM, Noah Misch wrote: >> On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote: >>> I came across this today, while helping a customer. The following will >>> happily create a piece of XML with an embedded ^A: >>> >>> select xmlelement(name foo, null, E'abc\x01def'); >>> >>> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only >>> as "" or equivalent in XML version 1.1, and not as a 0x01 byte >>> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>) >>> >>> ISTM this is something we should definitely try to fix ASAP, even if we >>> probably can't backpatch the fix. >> +1. Given that such a datum breaks dump+reload, it seems risky to do nothing at >> all in the back branches. > > Here's a patch along the lines suggested by Peter. > > I'm not sure what to do about the back branches and cases where data is > already in databases. This is fairly ugly. Suggestions welcome. We could provide a script in (or linked from) the release notes for testing the data in all your xml columns. To make things worse, the dump/reload problems seems to depend on your version of libxml2, or something. With git master, a CentOS 5 system with 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a lenient libxml2 will be liable to store XML data that won't restore on a system with a strict libxml2. Perhaps we should emit a build-time warning if the local libxml2 is lenient? > *** a/src/backend/utils/adt/xml.c > --- b/src/backend/utils/adt/xml.c > *************** > *** 1844,1850 **** escape_xml(const char *str) > --- 1844,1865 ---- > case '\r': > appendStringInfoString(&buf, " "); > break; > + case '\n': > + case '\t': > + appendStringInfoCharMacro(&buf, *p); > + break; > default: > + /* > + * Any control char we haven't already explicitly handled > + * (i.e. TAB, NL and CR)is an error. > + * If we ever support XML 1.1 they will be allowed, > + * but will have to be escaped. > + */ > + if (*p < '\x20') This needs to be an unsigned comparison. On my system, "char" is signed, so "SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly. The XML character set forbids more than just control characters; see http://www.w3.org/TR/xml/#charsets. We also ought to reject, for example, "SELECT xmlelement(name foo, null, E'\ufffe')". Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment" and "xmlpi" still let the invalid byte through. You can also still inject the byte into an attribute value via "xmlelement". I wonder if it wouldn't make more sense to just pass any XML that we generate from scratch through libxml2. There are a lot of holes to plug, otherwise. > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("character out of range"), > + errdetail("XML does not support control characters."))); > appendStringInfoCharMacro(&buf, *p); > break; > }
On 04/27/2011 05:30 PM, Noah Misch wrote: > >> I'm not sure what to do about the back branches and cases where data is >> already in databases. This is fairly ugly. Suggestions welcome. > We could provide a script in (or linked from) the release notes for testing the > data in all your xml columns. Yeah, we'll have to do something like that. What a blasted mess, > To make things worse, the dump/reload problems seems to depend on your version > of libxml2, or something. With git master, a CentOS 5 system with > 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with > 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a > lenient libxml2 will be liable to store XML data that won't restore on a system > with a strict libxml2. Perhaps we should emit a build-time warning if the local > libxml2 is lenient? No, I think we need to be strict ourselves. >> + if (*p< '\x20') > This needs to be an unsigned comparison. On my system, "char" is signed, so > "SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly. Good point. Perhaps we'd be better off using iscntrl(*p). > The XML character set forbids more than just control characters; see > http://www.w3.org/TR/xml/#charsets. We also ought to reject, for example, > "SELECT xmlelement(name foo, null, E'\ufffe')". > > Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment" > and "xmlpi" still let the invalid byte through. You can also still inject the > byte into an attribute value via "xmlelement". I wonder if it wouldn't make > more sense to just pass any XML that we generate from scratch through libxml2. > There are a lot of holes to plug, otherwise. > Maybe there are, but I'd want lots of convincing that we should do that at this stage. Maybe for 9.2. I think we can plug the holes fairly simply for xmlpi and xmlcomment, and catch the attributes by moving this check up into map_sql_value_to_xml_value(). This is a significant data integrity bug, much along the same lines as the invalidly encoded data holes we plugged a release or two back. I'm amazed we haven't hit it till now, but we're sure to see more of it - XML use with Postgres is growing substantially, I believe. cheers andrew
On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote: > On 04/27/2011 05:30 PM, Noah Misch wrote: >> To make things worse, the dump/reload problems seems to depend on your version >> of libxml2, or something. With git master, a CentOS 5 system with >> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with >> 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a >> lenient libxml2 will be liable to store XML data that won't restore on a system >> with a strict libxml2. Perhaps we should emit a build-time warning if the local >> libxml2 is lenient? > > No, I think we need to be strict ourselves. Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at least, do so when linked to a libxml2 that neglects to do so itself? >> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment" >> and "xmlpi" still let the invalid byte through. You can also still inject the >> byte into an attribute value via "xmlelement". I wonder if it wouldn't make >> more sense to just pass any XML that we generate from scratch through libxml2. >> There are a lot of holes to plug, otherwise. > > Maybe there are, but I'd want lots of convincing that we should do that > at this stage. Maybe for 9.2. I think we can plug the holes fairly > simply for xmlpi and xmlcomment, and catch the attributes by moving this > check up into map_sql_value_to_xml_value(). I don't have much convincing to offer -- hunting down the holes seem fine, too. Thanks, nm
On 04/27/2011 05:30 PM, Noah Misch wrote: > >> I'm not sure what to do about the back branches and cases where data is >> already in databases. This is fairly ugly. Suggestions welcome. > We could provide a script in (or linked from) the release notes for testing the > data in all your xml columns. > Here's a draft. We'd need to come up with slightly modified versions for older versions of Postgres that don't sport array_agg() and unnest() cheers andrew create function cleanup_xml_table (schema_name text,table_name text, columns text[]) returns void languageplpgsql as $func$ declare cmd text; cond text; sep text := ''; alt text := ''; col text; forbiddentext := $$[\x1-\x8\xB\xC\xE-\x1F]$$; begin cmd := 'update ' || quote_ident(schema_name) || '.' || quote_ident(table_name) || ' set '; for col in select unnest(columns) loop cmd:= cmd || sep; cond := cond || alt; sep := ', '; alt := ' or '; cmd := cmd|| quote_ident(col) || '=' || 'regexp_replace(' || quote_ident(col) , || '::text, ' || quote_literal(forbiden) || ', $$$$, $$g$$)::xml'; cond := cond || quote_ident(col) || '::text ~ ' || quote_literal(forbidden); end loop; cmd := cmd || ' where ' || cond; execute cmd; return; end; $func$; select cleanup_xml_table(table_schema,table_name, cols) from (select table_schema::text, table_name::text, array_agg(column_name::text) as cols from information_schema.columns where data_type = 'xml' and is_updatable = 'yes' group by table_schema, table_name) xmltabs;
On 04/27/2011 11:41 PM, Noah Misch wrote: > On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote: >> On 04/27/2011 05:30 PM, Noah Misch wrote: >>> To make things worse, the dump/reload problems seems to depend on your version >>> of libxml2, or something. With git master, a CentOS 5 system with >>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with >>> 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a >>> lenient libxml2 will be liable to store XML data that won't restore on a system >>> with a strict libxml2. Perhaps we should emit a build-time warning if the local >>> libxml2 is lenient? >> No, I think we need to be strict ourselves. > Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at > least, do so when linked to a libxml2 that neglects to do so itself? Yep. >>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment" >>> and "xmlpi" still let the invalid byte through. You can also still inject the >>> byte into an attribute value via "xmlelement". I wonder if it wouldn't make >>> more sense to just pass any XML that we generate from scratch through libxml2. >>> There are a lot of holes to plug, otherwise. >> Maybe there are, but I'd want lots of convincing that we should do that >> at this stage. Maybe for 9.2. I think we can plug the holes fairly >> simply for xmlpi and xmlcomment, and catch the attributes by moving this >> check up into map_sql_value_to_xml_value(). > I don't have much convincing to offer -- hunting down the holes seem fine, too. > > I think I've done that. Here's the patch I have now. It looks like we can catch pretty much everything by putting checks in four places, which isn't too bad. Please review and try to break. cheers andrew
Attachment
On Sun, May 08, 2011 at 06:25:27PM -0400, Andrew Dunstan wrote: > On 04/27/2011 11:41 PM, Noah Misch wrote: >> On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote: >>> On 04/27/2011 05:30 PM, Noah Misch wrote: >>>> To make things worse, the dump/reload problems seems to depend on your version >>>> of libxml2, or something. With git master, a CentOS 5 system with >>>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with >>>> 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a >>>> lenient libxml2 will be liable to store XML data that won't restore on a system >>>> with a strict libxml2. Perhaps we should emit a build-time warning if the local >>>> libxml2 is lenient? >>> No, I think we need to be strict ourselves. >> Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at >> least, do so when linked to a libxml2 that neglects to do so itself? > > Yep. I see you've gone with doing it unconditionally. I'd lean toward testing the library in pg_xml_init and setting a flag indicating whether we need the extra pass. However, a later patch can always optimize that. >>>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment" >>>> and "xmlpi" still let the invalid byte through. You can also still inject the >>>> byte into an attribute value via "xmlelement". I wonder if it wouldn't make >>>> more sense to just pass any XML that we generate from scratch through libxml2. >>>> There are a lot of holes to plug, otherwise. >>> Maybe there are, but I'd want lots of convincing that we should do that >>> at this stage. Maybe for 9.2. I think we can plug the holes fairly >>> simply for xmlpi and xmlcomment, and catch the attributes by moving this >>> check up into map_sql_value_to_xml_value(). >> I don't have much convincing to offer -- hunting down the holes seem fine, too. > > I think I've done that. Here's the patch I have now. It looks like we > can catch pretty much everything by putting checks in four places, which > isn't too bad. > > Please review and try to break. Here are the test cases I tried: -- caught successfully SELECT E'\x01'::xml; SELECT xmlcomment(E'\x01'); SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), ''); SELECT xmlelement(name foo, NULL, E'\x01'); SELECT xmlforest(E'\x01' AS foo); SELECT xmlpi(name foo, E'\x01'); SELECT query_to_xml($$SELECT E'\x01'$$, true, false, ''); -- not caught SELECT xmlroot('<root/>', version E'\x01'); SELECT xmlcomment(E'\ufffe'); -- not directly related, but also wrongly accepted SELECT xmlroot('<root/>', version ' '); SELECT xmlroot('<root/>', version 'foo'); Offhand, I don't find libxml2's handling of XML declarations particularly consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption = document) accepts '<?xml version="foo"?>' but rejects '<?xml version=" "?>'. Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content) accepts anything, even control characters. The XML 1.0 standard is stricter: the version must match ^1\.[0-9]+$. We might want to tighten this at the same time. > diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c > index ee82d46..12cfd56 100644 > --- a/src/backend/utils/adt/xml.c > +++ b/src/backend/utils/adt/xml.c > @@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, StringInfo result, > #define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance" > #define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml" > > +/* forbidden C0 control chars */ > +#define FORBIDDEN_C0 \ > + "\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \ > + "\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F" > + > +static inline void > +check_forbidden_c0(char * str) > +{ > + if (strpbrk(str,FORBIDDEN_C0) != NULL) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("character out of range"), > + errdetail("XML does not support control characters."))); This would be an errhint, I think. However, the message seems to emphasize the wrong thing. XML 1.0 defines a lexical production called Char that includes various Unicode character ranges. Control characters as we know them happen to not fall in any of those ranges. The characters aren't unsupported in the sense of being missing features; the language simply forbids them. libxml2's error message for this case is "PCDATA invalid Char value 1" (assuming \x01). Mentioning PCDATA seems redundant, since no other context offers greater freedom. How about: ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid XML 1.0 Char \\U%08x", char_val))); nm
On 05/09/2011 11:25 PM, Noah Misch wrote: > > I see you've gone with doing it unconditionally. I'd lean toward testing the > library in pg_xml_init and setting a flag indicating whether we need the extra > pass. However, a later patch can always optimize that. > I wasn't terribly keen on the idea, but we can look at it again later. >> >> Please review and try to break. > Here are the test cases I tried: > > -- caught successfully > SELECT E'\x01'::xml; > SELECT xmlcomment(E'\x01'); > SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), ''); > SELECT xmlelement(name foo, NULL, E'\x01'); > SELECT xmlforest(E'\x01' AS foo); > SELECT xmlpi(name foo, E'\x01'); > SELECT query_to_xml($$SELECT E'\x01'$$, true, false, ''); > > -- not caught > SELECT xmlroot('<root/>', version E'\x01'); That's an easy fix. > SELECT xmlcomment(E'\ufffe'); That's a bit harder. Do we want to extend these checks to cover surrogates and end of plane characters, which are the remaining forbidden chars? It certainly seems likely to be a somewhat slower test since I think we'd need to process the input strings a Unicode char at a time, but we do that in other places and it seems acceptable. What do people think? > -- not directly related, but also wrongly accepted > SELECT xmlroot('<root/>', version ' '); > SELECT xmlroot('<root/>', version 'foo'); > > Offhand, I don't find libxml2's handling of XML declarations particularly > consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption = > document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'. > Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content) > accepts anything, even control characters. The XML 1.0 standard is stricter: > the version must match ^1\.[0-9]+$. We might want to tighten this at the same > time. We can add some stuff to check the version strings. Doesn't seem terribly difficult. > libxml2's error message for this case is "PCDATA invalid Char value 1" > (assuming \x01). Mentioning PCDATA seems redundant, since no other context > offers greater freedom. How about: > > ereport(ERROR, > (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > errmsg("invalid XML 1.0 Char \\U%08x", char_val))); > > That would also mean processing the string a unicode char at a time. So maybe that's what we need to do. Thanks for the input. cheers andrew
On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote: > On 05/09/2011 11:25 PM, Noah Misch wrote: >> SELECT xmlcomment(E'\ufffe'); > > That's a bit harder. Do we want to extend these checks to cover > surrogates and end of plane characters, which are the remaining > forbidden chars? It certainly seems likely to be a somewhat slower test > since I think we'd need to process the input strings a Unicode char at a > time, but we do that in other places and it seems acceptable. What do > people think? My thinking was that we should only make this flag day for xml-type users if we're going to fix it all the way.
On 05/11/2011 07:00 PM, Noah Misch wrote: > On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote: >> On 05/09/2011 11:25 PM, Noah Misch wrote: >>> SELECT xmlcomment(E'\ufffe'); >> That's a bit harder. Do we want to extend these checks to cover >> surrogates and end of plane characters, which are the remaining >> forbidden chars? It certainly seems likely to be a somewhat slower test >> since I think we'd need to process the input strings a Unicode char at a >> time, but we do that in other places and it seems acceptable. What do >> people think? > My thinking was that we should only make this flag day for xml-type users if > we're going to fix it all the way. Fair enough. cheers andrew