Thread: XML Issue with DTDs

XML Issue with DTDs

From
Florian Pflug
Date:
Hi,

While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and
de-escapesthem (i.e. replaces entity references by their text equivalent), I've ran into another issue with the XML
type.

XML values can either contain a DOCUMENT or CONTENT. In the first case, the value is well-formed XML according to the
XMLspecification. In the latter case, the value is a collection of nodes, each of which may contain children. Without
DTDsin the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a DOCUMENT may contain only a single root node while
aCONTENT may contain multiple. That guarantees that a concatenation of two XML values is always at least valid CONTENT.
That,however, is no longer true once DTDs enter the picture. A DOCUMENT may contain a DTD as long as it precedes the
rootnode (processing instructions and comments may precede the DTD, though). Yet CONTENT may not include a DTD at all.
Aconcatenation of a DOCUMENT with a DTD and CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet
XMLCONCATfails to complain. The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to
CONTENT.
 select xmlconcat(   xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),   xmlparse(content '<test/>')
)::text::xml;

Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset
ofall the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is
CONTENT,or inventing a third XMLOPTION, say ANY. 

We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION
setting.That means either  

(1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to
XMLAGG

or

(2) Complaining if these values contain a DTD.

or

(3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently
handleXML declarations (<?xml …?>), so I'm slightly in favour of that. 

Thoughts?

best regards,
Florian Pflug




Re: XML Issue with DTDs

From
Robert Haas
Date:
On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote:
> While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and
de-escapesthem (i.e. > Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION
valuewhich is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing
DTDsif XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. 

Or we can just decide that it was a bug that this was ever allowed,
and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.This is roughly what we did with encoding checks.

> We then need to ensure that combining XML values yields something that is valid according to the most general
XMLOPTIONsetting. That means either 
>
> (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to
XMLAGG
>
> or
>
> (2) Complaining if these values contain a DTD.
>
> or
>
> (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.
>
> I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently
handleXML declarations (<?xml …?>), so I'm slightly in favour of that. 

I don't like #3, mostly because I don't like XMLOPTION ANY in the
first place.  Either #1 or #2 sounds OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: XML Issue with DTDs

From
Florian Pflug
Date:
On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote:
>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a
supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION
isCONTENT, or inventing a third XMLOPTION, say ANY. 
>
> Or we can just decide that it was a bug that this was ever allowed,
> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
> This is roughly what we did with encoding checks.

What exactly do you suggest we outlaw? If there are XML values which
are CONTENT but not a DOCUMENT, and other values which are a DOCUMENT
but not CONTENT, then what is pg_restore supposed to set XMLOPTION
to?

best regards,
Florian Pflug




Re: XML Issue with DTDs

From
Robert Haas
Date:
On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote:
>>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a
supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION
isCONTENT, or inventing a third XMLOPTION, say ANY. 
>>
>> Or we can just decide that it was a bug that this was ever allowed,
>> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
>> This is roughly what we did with encoding checks.
>
> What exactly do you suggest we outlaw?

<!DOCTYPE> anywhere but at the beginning.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: XML Issue with DTDs

From
Peter Eisentraut
Date:
On 12/19/13, 6:40 PM, Florian Pflug wrote:
> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
> 
>   select xmlconcat(
>     xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
>     xmlparse(content '<test/>')
>   )::text::xml;

The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
just not implemented.



Re: XML Issue with DTDs

From
Florian Pflug
Date:
On Dec23, 2013, at 03:45 , Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp@phlo.org> wrote:
>> On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote:
>>>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a
supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION
isCONTENT, or inventing a third XMLOPTION, say ANY. 
>>>
>>> Or we can just decide that it was a bug that this was ever allowed,
>>> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
>>> This is roughly what we did with encoding checks.
>>
>> What exactly do you suggest we outlaw?
>
> <!DOCTYPE> anywhere but at the beginning.

I think we're talking past one another here. Fixing XMLCONCAT/XMLAGG
to not produce XML values which are neither valid DOCUMENTS nor valid
CONTENT fixes *one* part of the problem.

The other part of the problem is that since not every DOCUMENT
is valid CONTENT (because CONTENT forbids DTDs) and not every CONTENT
is a valid DOCUMENT (because DOCUMENT forbids multiple root nodes), it's
impossible to set XMLOPTION to a value which accepts *all* valid XML
values. That breaks pg_dump/pg_restore. To fix this, we must provide
a way to insert XML data which accepts both DOCUMENTS and CONTENT, and
not only one or the other. Due to the way COPY works, we cannot call
a special conversion function, so we must modify the input functions.

My initial thought was to simply allow XML values which are CONTENT,
not DOCUMENTS, to contain a DTD (at the beginning), thus making CONTENT
a superset of DOCUMENT. But I've since then realized that the 2003
standard explicitly constrains CONTENT to *not* contain a DTD. The
only other option that I can see is to invert a third, non-standard
XMLOPTION value, ANY. ANY would accept anything accepted by either
DOCUMENT or CONTENT, but no more than that.

best regards,
Florian Pflug







Re: XML Issue with DTDs

From
Florian Pflug
Date:
On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e@gmx.net> wrote:
> On 12/19/13, 6:40 PM, Florian Pflug wrote:
>> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
>>
>>  select xmlconcat(
>>    xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
>>    xmlparse(content '<test/>')
>>  )::text::xml;
>
> The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
> just not implemented.

OK, cool, I'll try to figure out how to do that with libxml

best regards,
Florian Pflug




Re: XML Issue with DTDs

From
Florian Pflug
Date:
On Dec26, 2013, at 21:30 , Florian Pflug <fgp@phlo.org> wrote:
> On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e@gmx.net> wrote:
>> On 12/19/13, 6:40 PM, Florian Pflug wrote:
>>> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT.
>>>
>>> select xmlconcat(
>>>   xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
>>>   xmlparse(content '<test/>')
>>> )::text::xml;
>>
>> The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
>> just not implemented.
>
> OK, cool, I'll try to figure out how to do that with libxml

Hm, I've read through the (draft) SQL/XML 2003 standard, and it seems that
it mandates more processing of DTDs than we currently do. In particular, it
says that attribute default values and custom entities are to be expanded
by xmlparse(). Without doing that, stripping the DTD can change the meaning
of an XML document, or make it not well-formed (in the case of custom
entity definitions). So I think that we unless we implement that, I we have
to raise an error, not silently strip the DTD.

best regards,
Florian Pflug