Thread: XML Issue with DTDs
Hi, While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapesthem (i.e. replaces entity references by their text equivalent), I've ran into another issue with the XML type. XML values can either contain a DOCUMENT or CONTENT. In the first case, the value is well-formed XML according to the XMLspecification. In the latter case, the value is a collection of nodes, each of which may contain children. Without DTDsin the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a DOCUMENT may contain only a single root node while aCONTENT may contain multiple. That guarantees that a concatenation of two XML values is always at least valid CONTENT. That,however, is no longer true once DTDs enter the picture. A DOCUMENT may contain a DTD as long as it precedes the rootnode (processing instructions and comments may precede the DTD, though). Yet CONTENT may not include a DTD at all. Aconcatenation of a DOCUMENT with a DTD and CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet XMLCONCATfails to complain. The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. select xmlconcat( xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'), xmlparse(content '<test/>') )::text::xml; Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset ofall the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT,or inventing a third XMLOPTION, say ANY. We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting.That means either (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG or (2) Complaining if these values contain a DTD. or (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY. I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handleXML declarations (<?xml …?>), so I'm slightly in favour of that. Thoughts? best regards, Florian Pflug
On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote: > While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapesthem (i.e. > Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION valuewhich is a superset of all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDsif XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. Or we can just decide that it was a bug that this was ever allowed, and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.This is roughly what we did with encoding checks. > We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTIONsetting. That means either > > (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG > > or > > (2) Complaining if these values contain a DTD. > > or > > (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY. > > I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handleXML declarations (<?xml …?>), so I'm slightly in favour of that. I don't like #3, mostly because I don't like XMLOPTION ANY in the first place. Either #1 or #2 sounds OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote: >> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION isCONTENT, or inventing a third XMLOPTION, say ANY. > > Or we can just decide that it was a bug that this was ever allowed, > and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. > This is roughly what we did with encoding checks. What exactly do you suggest we outlaw? If there are XML values which are CONTENT but not a DOCUMENT, and other values which are a DOCUMENT but not CONTENT, then what is pg_restore supposed to set XMLOPTION to? best regards, Florian Pflug
On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp@phlo.org> wrote: > On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote: >>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION isCONTENT, or inventing a third XMLOPTION, say ANY. >> >> Or we can just decide that it was a bug that this was ever allowed, >> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. >> This is roughly what we did with encoding checks. > > What exactly do you suggest we outlaw? <!DOCTYPE> anywhere but at the beginning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/19/13, 6:40 PM, Florian Pflug wrote: > The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. > > select xmlconcat( > xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'), > xmlparse(content '<test/>') > )::text::xml; The SQL standard specifies that DTDs are dropped by xmlconcat. It's just not implemented.
On Dec23, 2013, at 03:45 , Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Dec20, 2013, at 18:52 , Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug <fgp@phlo.org> wrote: >>>> Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a supersetof all the others - otherwise, dump & restore won't work reliably. That means either allowing DTDs if XMLOPTION isCONTENT, or inventing a third XMLOPTION, say ANY. >>> >>> Or we can just decide that it was a bug that this was ever allowed, >>> and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. >>> This is roughly what we did with encoding checks. >> >> What exactly do you suggest we outlaw? > > <!DOCTYPE> anywhere but at the beginning. I think we're talking past one another here. Fixing XMLCONCAT/XMLAGG to not produce XML values which are neither valid DOCUMENTS nor valid CONTENT fixes *one* part of the problem. The other part of the problem is that since not every DOCUMENT is valid CONTENT (because CONTENT forbids DTDs) and not every CONTENT is a valid DOCUMENT (because DOCUMENT forbids multiple root nodes), it's impossible to set XMLOPTION to a value which accepts *all* valid XML values. That breaks pg_dump/pg_restore. To fix this, we must provide a way to insert XML data which accepts both DOCUMENTS and CONTENT, and not only one or the other. Due to the way COPY works, we cannot call a special conversion function, so we must modify the input functions. My initial thought was to simply allow XML values which are CONTENT, not DOCUMENTS, to contain a DTD (at the beginning), thus making CONTENT a superset of DOCUMENT. But I've since then realized that the 2003 standard explicitly constrains CONTENT to *not* contain a DTD. The only other option that I can see is to invert a third, non-standard XMLOPTION value, ANY. ANY would accept anything accepted by either DOCUMENT or CONTENT, but no more than that. best regards, Florian Pflug
On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/19/13, 6:40 PM, Florian Pflug wrote: >> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. >> >> select xmlconcat( >> xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'), >> xmlparse(content '<test/>') >> )::text::xml; > > The SQL standard specifies that DTDs are dropped by xmlconcat. It's > just not implemented. OK, cool, I'll try to figure out how to do that with libxml best regards, Florian Pflug
On Dec26, 2013, at 21:30 , Florian Pflug <fgp@phlo.org> wrote: > On Dec23, 2013, at 18:39 , Peter Eisentraut <peter_e@gmx.net> wrote: >> On 12/19/13, 6:40 PM, Florian Pflug wrote: >>> The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. >>> >>> select xmlconcat( >>> xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'), >>> xmlparse(content '<test/>') >>> )::text::xml; >> >> The SQL standard specifies that DTDs are dropped by xmlconcat. It's >> just not implemented. > > OK, cool, I'll try to figure out how to do that with libxml Hm, I've read through the (draft) SQL/XML 2003 standard, and it seems that it mandates more processing of DTDs than we currently do. In particular, it says that attribute default values and custom entities are to be expanded by xmlparse(). Without doing that, stripping the DTD can change the meaning of an XML document, or make it not well-formed (in the case of custom entity definitions). So I think that we unless we implement that, I we have to raise an error, not silently strip the DTD. best regards, Florian Pflug