Re: XML with invalid chars - Mailing list pgsql-hackers

From Noah Misch
Subject Re: XML with invalid chars
Date
Msg-id 20110510032546.GB5617@tornado.gateway.2wire.net
Whole thread Raw
In response to Re: XML with invalid chars  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: XML with invalid chars  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
On Sun, May 08, 2011 at 06:25:27PM -0400, Andrew Dunstan wrote:
> On 04/27/2011 11:41 PM, Noah Misch wrote:
>> On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote:
>>> On 04/27/2011 05:30 PM, Noah Misch wrote:
>>>> To make things worse, the dump/reload problems seems to depend on your version
>>>> of libxml2, or something.  With git master, a CentOS 5 system with
>>>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
>>>> 2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
>>>> lenient libxml2 will be liable to store XML data that won't restore on a system
>>>> with a strict libxml2.  Perhaps we should emit a build-time warning if the local
>>>> libxml2 is lenient?
>>> No, I think we need to be strict ourselves.
>> Then I suppose we'd also scan for invalid characters in xml_parse()?  Or, at
>> least, do so when linked to a libxml2 that neglects to do so itself?
>
> Yep.

I see you've gone with doing it unconditionally.  I'd lean toward testing the
library in pg_xml_init and setting a flag indicating whether we need the extra
pass.  However, a later patch can always optimize that.

>>>> Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
>>>> and "xmlpi" still let the invalid byte through.  You can also still inject the
>>>> byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
>>>> more sense to just pass any XML that we generate from scratch through libxml2.
>>>> There are a lot of holes to plug, otherwise.
>>> Maybe there are, but I'd want lots of convincing that we should do that
>>> at this stage. Maybe for 9.2. I think we can plug the holes fairly
>>> simply for xmlpi and xmlcomment, and catch the attributes by moving this
>>> check up into map_sql_value_to_xml_value().
>> I don't have much convincing to offer -- hunting down the holes seem fine, too.
>
> I think I've done that. Here's the patch I have now. It looks like we  
> can catch pretty much everything by putting checks in four places, which  
> isn't too bad.
>
> Please review and try to break.

Here are the test cases I tried:

-- caught successfully
SELECT E'\x01'::xml;
SELECT xmlcomment(E'\x01');
SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
SELECT xmlelement(name foo, NULL, E'\x01');
SELECT xmlforest(E'\x01' AS foo);
SELECT xmlpi(name foo, E'\x01');
SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');

-- not caught
SELECT xmlroot('<root/>', version E'\x01');
SELECT xmlcomment(E'\ufffe');

-- not directly related, but also wrongly accepted
SELECT xmlroot('<root/>', version ' ');
SELECT xmlroot('<root/>', version 'foo');

Offhand, I don't find libxml2's handling of XML declarations particularly
consistent.  My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
document) accepts '<?xml version="foo"?>' but rejects '<?xml version=" "?>'.
Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
accepts anything, even control characters.  The XML 1.0 standard is stricter:
the version must match ^1\.[0-9]+$.  We might want to tighten this at the same
time.

> diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
> index ee82d46..12cfd56 100644
> --- a/src/backend/utils/adt/xml.c
> +++ b/src/backend/utils/adt/xml.c
> @@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, StringInfo result,
>  #define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance"
>  #define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml"
>  
> +/* forbidden C0 control chars */
> +#define FORBIDDEN_C0  \
> +    "\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \
> +    "\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F"
> +
> +static inline void
> +check_forbidden_c0(char * str)
> +{
> +    if (strpbrk(str,FORBIDDEN_C0) != NULL)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                 errmsg("character out of range"),
> +                 errdetail("XML does not support control characters.")));

This would be an errhint, I think.  However, the message seems to emphasize
the wrong thing.  XML 1.0 defines a lexical production called Char that
includes various Unicode character ranges.  Control characters as we know them
happen to not fall in any of those ranges.  The characters aren't unsupported
in the sense of being missing features; the language simply forbids them.

libxml2's error message for this case is "PCDATA invalid Char value 1"
(assuming \x01).  Mentioning PCDATA seems redundant, since no other context
offers greater freedom.  How about:

ereport(ERROR,    (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),     errmsg("invalid XML 1.0 Char \\U%08x",
char_val)));

nm


pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: the big picture for index-only scans
Next
From: Bruce Momjian
Date:
Subject: Re: "stored procedures" - use cases?