Re: XML with invalid chars - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: XML with invalid chars
Date
Msg-id 4DCB0AE3.5020503@dunslane.net
Whole thread Raw
In response to Re: XML with invalid chars  (Noah Misch <noah@leadboat.com>)
Responses Re: XML with invalid chars  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers

On 05/09/2011 11:25 PM, Noah Misch wrote:
>
> I see you've gone with doing it unconditionally.  I'd lean toward testing the
> library in pg_xml_init and setting a flag indicating whether we need the extra
> pass.  However, a later patch can always optimize that.
>


I wasn't terribly keen on the idea, but we can look at it again later.

>>
>> Please review and try to break.
> Here are the test cases I tried:
>
> -- caught successfully
> SELECT E'\x01'::xml;
> SELECT xmlcomment(E'\x01');
> SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
> SELECT xmlelement(name foo, NULL, E'\x01');
> SELECT xmlforest(E'\x01' AS foo);
> SELECT xmlpi(name foo, E'\x01');
> SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');
>
> -- not caught
> SELECT xmlroot('<root/>', version E'\x01');


That's an easy fix.


> SELECT xmlcomment(E'\ufffe');


That's a bit harder. Do we want to extend these checks to cover 
surrogates and end of plane characters, which are the remaining 
forbidden chars? It certainly seems likely to be a somewhat slower test 
since I think we'd need to process the input strings a Unicode char at a 
time, but we do that in other places and it seems acceptable. What do 
people think?


> -- not directly related, but also wrongly accepted
> SELECT xmlroot('<root/>', version ' ');
> SELECT xmlroot('<root/>', version 'foo');
>
> Offhand, I don't find libxml2's handling of XML declarations particularly
> consistent.  My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
> document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
> Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
> accepts anything, even control characters.  The XML 1.0 standard is stricter:
> the version must match ^1\.[0-9]+$.  We might want to tighten this at the same
> time.


We can add some stuff to check the version strings. Doesn't seem 
terribly difficult.

> libxml2's error message for this case is "PCDATA invalid Char value 1"
> (assuming \x01).  Mentioning PCDATA seems redundant, since no other context
> offers greater freedom.  How about:
>
> ereport(ERROR,
>         (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>          errmsg("invalid XML 1.0 Char \\U%08x", char_val)));
>
>

That would also mean processing the string a unicode char at a time. So 
maybe that's what we need to do.

Thanks for the input.

cheers

andrew


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Prefered Types
Next
From: "Kevin Grittner"
Date:
Subject: Re: performance-test farm