Thread: XML with invalid chars

XML with invalid chars

From

Andrew Dunstan

Date:

25 April 2011, 20:25:15

I came across this today, while helping a customer. The following will 
happily create a piece of XML with an embedded ^A:
   select xmlelement(name foo, null, E'abc\x01def');

Now, a ^A is totally forbidden in XML version 1.0, and allowed but only 
as "" or equivalent in XML version 1.1, and not as a 0x01 byte 
(see <http://en.wikipedia.org/wiki/XML#Valid_characters>)

ISTM this is something we should definitely try to fix ASAP, even if we 
probably can't backpatch the fix.

(Interestingly, the software than runs my PostgreSQL blog, Serendipity, 
appears to have a similar bug, at least in the version Devrim is using, 
having cheerfully embedded a ^L in its RSS feed a few days ago, thus 
causing planet.postgresql.org to blow up.)

cheers

andrew

Re: XML with invalid chars

From

Noah Misch

Date:

26 April 2011, 18:11:32

On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:
> I came across this today, while helping a customer. The following will  
> happily create a piece of XML with an embedded ^A:
>
>    select xmlelement(name foo, null, E'abc\x01def');
>
> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only  
> as "" or equivalent in XML version 1.1, and not as a 0x01 byte  
> (see <http://en.wikipedia.org/wiki/XML#Valid_characters>)
>
> ISTM this is something we should definitely try to fix ASAP, even if we  
> probably can't backpatch the fix.

+1.  Given that such a datum breaks dump+reload, it seems risky to do nothing at
all in the back branches.

Re: XML with invalid chars

From

Peter Eisentraut

Date:

26 April 2011, 19:03:23

On mån, 2011-04-25 at 19:25 -0400, Andrew Dunstan wrote:
> I came across this today, while helping a customer. The following
> will 
> happily create a piece of XML with an embedded ^A:
> 
>     select xmlelement(name foo, null, E'abc\x01def');
> 
> Now, a ^A is totally forbidden in XML version 1.0, and allowed but
> only 
> as "" or equivalent in XML version 1.1, and not as a 0x01 byte 
> (see <http://en.wikipedia.org/wiki/XML#Valid_characters>)

The best place to fix this might be in escape_xml() in xml.c.  Since we
don't support XML 1.1 yet, just reject all invalid characters there
according to XML 1.0.

Relevant bits from the SQL standard:

i)
Let CS be the character set of SQLT. Let XMLVRAW be the result of mapping SQLV to Unicode
using the implementation-defined mapping of character strings of CS to Unicode. If any Unicode
code point in XMLVRAW does not represent a valid XML character, then an exception condition
is raised: SQL/XML mapping error — invalid XML character.

[This is what you'd add.]

ii)
Let XMLV be XMLVRAW, with each instance of “&” (U+0026) replaced by “&”, each
instance of “<” (U+003C) replaced by “<”, each instance of “>” (U+003E) replaced by “>”,
and each instance of Carriage Return (U+000D) replaced by “”.

[This is what it already does.]

Re: XML with invalid chars

From

Andrew Dunstan

Date:

27 April 2011, 16:06:20


On 04/26/2011 05:11 PM, Noah Misch wrote:
> On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:
>> I came across this today, while helping a customer. The following will
>> happily create a piece of XML with an embedded ^A:
>>
>>     select xmlelement(name foo, null, E'abc\x01def');
>>
>> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
>> as "" or equivalent in XML version 1.1, and not as a 0x01 byte
>> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>)
>>
>> ISTM this is something we should definitely try to fix ASAP, even if we
>> probably can't backpatch the fix.
> +1.  Given that such a datum breaks dump+reload, it seems risky to do nothing at
> all in the back branches.



Here's a patch along the lines suggested by Peter.

I'm not sure what to do about the back branches and cases where data is
already in databases. This is fairly ugly. Suggestions welcome.

cheers

andrew

Attachment

xmlchars.patch

Re: XML with invalid chars

From

Noah Misch

Date:

27 April 2011, 18:31:12

On Wed, Apr 27, 2011 at 03:05:30PM -0400, Andrew Dunstan wrote:
> On 04/26/2011 05:11 PM, Noah Misch wrote:
>> On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:
>>> I came across this today, while helping a customer. The following will
>>> happily create a piece of XML with an embedded ^A:
>>>
>>>     select xmlelement(name foo, null, E'abc\x01def');
>>>
>>> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
>>> as "" or equivalent in XML version 1.1, and not as a 0x01 byte
>>> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>)
>>>
>>> ISTM this is something we should definitely try to fix ASAP, even if we
>>> probably can't backpatch the fix.
>> +1.  Given that such a datum breaks dump+reload, it seems risky to do nothing at
>> all in the back branches.
>
> Here's a patch along the lines suggested by Peter.
>
> I'm not sure what to do about the back branches and cases where data is  
> already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something.  With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2.  Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

> *** a/src/backend/utils/adt/xml.c
> --- b/src/backend/utils/adt/xml.c
> ***************
> *** 1844,1850 **** escape_xml(const char *str)
> --- 1844,1865 ----
>               case '\r':
>                   appendStringInfoString(&buf, "");
>                   break;
> +             case '\n':
> +             case '\t':
> +                 appendStringInfoCharMacro(&buf, *p);
> +                 break;
>               default:
> +                 /* 
> +                  * Any control char we haven't already explicitly handled
> +                  * (i.e. TAB, NL and CR)is an error. 
> +                  * If we ever support XML 1.1 they will be allowed,
> +                  * but will have to be escaped.
> +                  */
> +                 if (*p < '\x20')

This needs to be an unsigned comparison.  On my system, "char" is signed, so
"SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

The XML character set forbids more than just control characters; see
http://www.w3.org/TR/xml/#charsets.  We also ought to reject, for example,
"SELECT xmlelement(name foo, null, E'\ufffe')".

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through.  You can also still inject the
byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

> +                     ereport(ERROR,
> +                             (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                              errmsg("character out of range"),
> +                              errdetail("XML does not support control characters.")));
>                   appendStringInfoCharMacro(&buf, *p);
>                   break;
>           }

Re: XML with invalid chars

From

Andrew Dunstan

Date:

28 April 2011, 00:22:48

On 04/27/2011 05:30 PM, Noah Misch wrote:
>
>> I'm not sure what to do about the back branches and cases where data is
>> already in databases. This is fairly ugly. Suggestions welcome.
> We could provide a script in (or linked from) the release notes for testing the
> data in all your xml columns.

Yeah, we'll have to do something like that. What a blasted mess,

> To make things worse, the dump/reload problems seems to depend on your version
> of libxml2, or something.  With git master, a CentOS 5 system with
> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
> 2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
> lenient libxml2 will be liable to store XML data that won't restore on a system
> with a strict libxml2.  Perhaps we should emit a build-time warning if the local
> libxml2 is lenient?

No, I think we need to be strict ourselves.

>> +                 if (*p<  '\x20')
> This needs to be an unsigned comparison.  On my system, "char" is signed, so
> "SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

Good point. Perhaps we'd be better off using iscntrl(*p).

> The XML character set forbids more than just control characters; see
> http://www.w3.org/TR/xml/#charsets.  We also ought to reject, for example,
> "SELECT xmlelement(name foo, null, E'\ufffe')".
>
> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
> and "xmlpi" still let the invalid byte through.  You can also still inject the
> byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
> more sense to just pass any XML that we generate from scratch through libxml2.
> There are a lot of holes to plug, otherwise.
>

Maybe there are, but I'd want lots of convincing that we should do that 
at this stage. Maybe for 9.2. I think we can plug the holes fairly 
simply for xmlpi and xmlcomment, and catch the attributes by moving this 
check up into map_sql_value_to_xml_value().

This is a significant data integrity bug, much along the same lines as 
the invalidly encoded data holes we plugged a release or two back. I'm 
amazed we haven't hit it till now, but we're sure to see more of it - 
XML use with Postgres is growing substantially, I believe.

cheers

andrew

Re: XML with invalid chars

From

Noah Misch

Date:

28 April 2011, 00:41:42

On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:
> On 04/27/2011 05:30 PM, Noah Misch wrote:
>> To make things worse, the dump/reload problems seems to depend on your version
>> of libxml2, or something.  With git master, a CentOS 5 system with
>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
>> 2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
>> lenient libxml2 will be liable to store XML data that won't restore on a system
>> with a strict libxml2.  Perhaps we should emit a build-time warning if the local
>> libxml2 is lenient?
>
> No, I think we need to be strict ourselves.

Then I suppose we'd also scan for invalid characters in xml_parse()?  Or, at
least, do so when linked to a libxml2 that neglects to do so itself?

>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
>> and "xmlpi" still let the invalid byte through.  You can also still inject the
>> byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
>> more sense to just pass any XML that we generate from scratch through libxml2.
>> There are a lot of holes to plug, otherwise.
>
> Maybe there are, but I'd want lots of convincing that we should do that  
> at this stage. Maybe for 9.2. I think we can plug the holes fairly  
> simply for xmlpi and xmlcomment, and catch the attributes by moving this  
> check up into map_sql_value_to_xml_value().

I don't have much convincing to offer -- hunting down the holes seem fine, too.

Thanks,
nm

Re: XML with invalid chars

From

Andrew Dunstan

Date:

28 April 2011, 14:51:32


On 04/27/2011 05:30 PM, Noah Misch wrote:
>
>> I'm not sure what to do about the back branches and cases where data is
>> already in databases. This is fairly ugly. Suggestions welcome.
> We could provide a script in (or linked from) the release notes for testing the
> data in all your xml columns.
>

Here's a draft. We'd need to come up with slightly modified versions for 
older versions of Postgres that don't sport array_agg() and unnest()

cheers

andrew
   create function cleanup_xml_table           (schema_name text,table_name text, columns text[])   returns void
languageplpgsql as   $func$
 
   declare        cmd text;        cond text;        sep text := '';        alt text := '';        col text;
forbiddentext := $$[\x1-\x8\xB\xC\xE-\x1F]$$;   begin        cmd := 'update ' || quote_ident(schema_name) || '.' ||
                 quote_ident(table_name) || ' set ';        for col in select unnest(columns)        loop
cmd:= cmd || sep;            cond := cond || alt;            sep := ', ';            alt := ' or ';            cmd :=
cmd|| quote_ident(col) || '=' ||                'regexp_replace(' || quote_ident(col) , || '::text,  ' ||
quote_literal(forbiden) || ', $$$$, $$g$$)::xml';            cond := cond ||  quote_ident(col) || '::text ~ ' ||
        quote_literal(forbidden);        end loop;        cmd := cmd || ' where ' || cond;        execute cmd;
return;  end;
 
   $func$;
   select cleanup_xml_table(table_schema,table_name, cols)       from          (select table_schema::text,
   table_name::text,                  array_agg(column_name::text) as cols           from information_schema.columns
      where data_type = 'xml'                 and is_updatable = 'yes'           group by table_schema, table_name)
xmltabs;

Re: XML with invalid chars

From

Andrew Dunstan

Date:

08 May 2011, 19:25:37


On 04/27/2011 11:41 PM, Noah Misch wrote:
> On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:
>> On 04/27/2011 05:30 PM, Noah Misch wrote:
>>> To make things worse, the dump/reload problems seems to depend on your version
>>> of libxml2, or something.  With git master, a CentOS 5 system with
>>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
>>> 2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
>>> lenient libxml2 will be liable to store XML data that won't restore on a system
>>> with a strict libxml2.  Perhaps we should emit a build-time warning if the local
>>> libxml2 is lenient?
>> No, I think we need to be strict ourselves.
> Then I suppose we'd also scan for invalid characters in xml_parse()?  Or, at
> least, do so when linked to a libxml2 that neglects to do so itself?

Yep.

>>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
>>> and "xmlpi" still let the invalid byte through.  You can also still inject the
>>> byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
>>> more sense to just pass any XML that we generate from scratch through libxml2.
>>> There are a lot of holes to plug, otherwise.
>> Maybe there are, but I'd want lots of convincing that we should do that
>> at this stage. Maybe for 9.2. I think we can plug the holes fairly
>> simply for xmlpi and xmlcomment, and catch the attributes by moving this
>> check up into map_sql_value_to_xml_value().
> I don't have much convincing to offer -- hunting down the holes seem fine, too.
>
>

I think I've done that. Here's the patch I have now. It looks like we
can catch pretty much everything by putting checks in four places, which
isn't too bad.

Please review and try to break.

cheers

andrew

Attachment

xmlchars2.patch

Re: XML with invalid chars

From

Noah Misch

Date:

10 May 2011, 00:25:59

On Sun, May 08, 2011 at 06:25:27PM -0400, Andrew Dunstan wrote:
> On 04/27/2011 11:41 PM, Noah Misch wrote:
>> On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:
>>> On 04/27/2011 05:30 PM, Noah Misch wrote:
>>>> To make things worse, the dump/reload problems seems to depend on your version
>>>> of libxml2, or something.  With git master, a CentOS 5 system with
>>>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
>>>> 2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
>>>> lenient libxml2 will be liable to store XML data that won't restore on a system
>>>> with a strict libxml2.  Perhaps we should emit a build-time warning if the local
>>>> libxml2 is lenient?
>>> No, I think we need to be strict ourselves.
>> Then I suppose we'd also scan for invalid characters in xml_parse()?  Or, at
>> least, do so when linked to a libxml2 that neglects to do so itself?
>
> Yep.

I see you've gone with doing it unconditionally.  I'd lean toward testing the
library in pg_xml_init and setting a flag indicating whether we need the extra
pass.  However, a later patch can always optimize that.

>>>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
>>>> and "xmlpi" still let the invalid byte through.  You can also still inject the
>>>> byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
>>>> more sense to just pass any XML that we generate from scratch through libxml2.
>>>> There are a lot of holes to plug, otherwise.
>>> Maybe there are, but I'd want lots of convincing that we should do that
>>> at this stage. Maybe for 9.2. I think we can plug the holes fairly
>>> simply for xmlpi and xmlcomment, and catch the attributes by moving this
>>> check up into map_sql_value_to_xml_value().
>> I don't have much convincing to offer -- hunting down the holes seem fine, too.
>
> I think I've done that. Here's the patch I have now. It looks like we  
> can catch pretty much everything by putting checks in four places, which  
> isn't too bad.
>
> Please review and try to break.

Here are the test cases I tried:

-- caught successfully
SELECT E'\x01'::xml;
SELECT xmlcomment(E'\x01');
SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
SELECT xmlelement(name foo, NULL, E'\x01');
SELECT xmlforest(E'\x01' AS foo);
SELECT xmlpi(name foo, E'\x01');
SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');

-- not caught
SELECT xmlroot('<root/>', version E'\x01');
SELECT xmlcomment(E'\ufffe');

-- not directly related, but also wrongly accepted
SELECT xmlroot('<root/>', version ' ');
SELECT xmlroot('<root/>', version 'foo');

Offhand, I don't find libxml2's handling of XML declarations particularly
consistent.  My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
document) accepts '<?xml version="foo"?>' but rejects '<?xml version=" "?>'.
Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
accepts anything, even control characters.  The XML 1.0 standard is stricter:
the version must match ^1\.[0-9]+$.  We might want to tighten this at the same
time.

> diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
> index ee82d46..12cfd56 100644
> --- a/src/backend/utils/adt/xml.c
> +++ b/src/backend/utils/adt/xml.c
> @@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, StringInfo result,
>  #define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance"
>  #define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml"
>  
> +/* forbidden C0 control chars */
> +#define FORBIDDEN_C0  \
> +    "\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \
> +    "\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F"
> +
> +static inline void
> +check_forbidden_c0(char * str)
> +{
> +    if (strpbrk(str,FORBIDDEN_C0) != NULL)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                 errmsg("character out of range"),
> +                 errdetail("XML does not support control characters.")));

This would be an errhint, I think.  However, the message seems to emphasize
the wrong thing.  XML 1.0 defines a lexical production called Char that
includes various Unicode character ranges.  Control characters as we know them
happen to not fall in any of those ranges.  The characters aren't unsupported
in the sense of being missing features; the language simply forbids them.

libxml2's error message for this case is "PCDATA invalid Char value 1"
(assuming \x01).  Mentioning PCDATA seems redundant, since no other context
offers greater freedom.  How about:

ereport(ERROR,    (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),     errmsg("invalid XML 1.0 Char \\U%08x",
char_val)));

nm

Re: XML with invalid chars

From

Andrew Dunstan

Date:

11 May 2011, 19:17:33


On 05/09/2011 11:25 PM, Noah Misch wrote:
>
> I see you've gone with doing it unconditionally.  I'd lean toward testing the
> library in pg_xml_init and setting a flag indicating whether we need the extra
> pass.  However, a later patch can always optimize that.
>


I wasn't terribly keen on the idea, but we can look at it again later.

>>
>> Please review and try to break.
> Here are the test cases I tried:
>
> -- caught successfully
> SELECT E'\x01'::xml;
> SELECT xmlcomment(E'\x01');
> SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
> SELECT xmlelement(name foo, NULL, E'\x01');
> SELECT xmlforest(E'\x01' AS foo);
> SELECT xmlpi(name foo, E'\x01');
> SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');
>
> -- not caught
> SELECT xmlroot('<root/>', version E'\x01');


That's an easy fix.


> SELECT xmlcomment(E'\ufffe');


That's a bit harder. Do we want to extend these checks to cover 
surrogates and end of plane characters, which are the remaining 
forbidden chars? It certainly seems likely to be a somewhat slower test 
since I think we'd need to process the input strings a Unicode char at a 
time, but we do that in other places and it seems acceptable. What do 
people think?


> -- not directly related, but also wrongly accepted
> SELECT xmlroot('<root/>', version ' ');
> SELECT xmlroot('<root/>', version 'foo');
>
> Offhand, I don't find libxml2's handling of XML declarations particularly
> consistent.  My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
> document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
> Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
> accepts anything, even control characters.  The XML 1.0 standard is stricter:
> the version must match ^1\.[0-9]+$.  We might want to tighten this at the same
> time.


We can add some stuff to check the version strings. Doesn't seem 
terribly difficult.

> libxml2's error message for this case is "PCDATA invalid Char value 1"
> (assuming \x01).  Mentioning PCDATA seems redundant, since no other context
> offers greater freedom.  How about:
>
> ereport(ERROR,
>         (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>          errmsg("invalid XML 1.0 Char \\U%08x", char_val)));
>
>

That would also mean processing the string a unicode char at a time. So 
maybe that's what we need to do.

Thanks for the input.

cheers

andrew

Re: XML with invalid chars

From

Noah Misch

Date:

11 May 2011, 20:00:39

On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote:
> On 05/09/2011 11:25 PM, Noah Misch wrote:
>> SELECT xmlcomment(E'\ufffe');
>
> That's a bit harder. Do we want to extend these checks to cover  
> surrogates and end of plane characters, which are the remaining  
> forbidden chars? It certainly seems likely to be a somewhat slower test  
> since I think we'd need to process the input strings a Unicode char at a  
> time, but we do that in other places and it seems acceptable. What do  
> people think?

My thinking was that we should only make this flag day for xml-type users if
we're going to fix it all the way.

Re: XML with invalid chars

From

Andrew Dunstan

Date:

11 May 2011, 20:39:30


On 05/11/2011 07:00 PM, Noah Misch wrote:
> On Wed, May 11, 2011 at 06:17:07PM -0400, Andrew Dunstan wrote:
>> On 05/09/2011 11:25 PM, Noah Misch wrote:
>>> SELECT xmlcomment(E'\ufffe');
>> That's a bit harder. Do we want to extend these checks to cover
>> surrogates and end of plane characters, which are the remaining
>> forbidden chars? It certainly seems likely to be a somewhat slower test
>> since I think we'd need to process the input strings a Unicode char at a
>> time, but we do that in other places and it seems acceptable. What do
>> people think?
> My thinking was that we should only make this flag day for xml-type users if
> we're going to fix it all the way.

Fair enough.

cheers

andrew