Thread: PostgreSQL 8.3 XML parser seems not to recognize the DOCTYPE element in XML files

PostgreSQL 8.3 XML parser seems not to recognize the DOCTYPE element in XML files

From
"Lawrence Oluyede"
Date:
As specified in the W3C Recommendation for XML the DOCTYPE element is
perfectly valid in a document.
I have a bunch of XML files generated by the boost library which
contains a doctype like this:

<!DOCTYPE boost_serialization>

which lies within the bound of the recommendation
(http://www.w3.org/TR/xml/#sec-prolog-dtd):

"Note that it is possible to construct a well-formed document
containing a doctypedecl that neither points to an external subset nor
contains an internal subset."

PostgreSQL 8.3 instead doesn't allow the insertion of XML with doctype
in its new native data type returning this error message:

"""
ERROR:  invalid XML content
DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
<!DOCTYPE foo>
 ^

********** Error **********

ERROR: invalid XML content
SQL state: 2200N
Detail: Entity: line 2: parser error : StartTag: invalid element name
<!DOCTYPE foo>
"""

This kind of behavior surprises me because pgsql has been compiled
with the following flags on the development machine:
 ./configure --with-python --with-openssl --with-pam --with-libxml
--with-libxslt --enable-thread-safety --enable-debug

During the configuration stage it creates a Makefile binding the
system version of the libxml2 library which is 2.6.30, the same
library I use through Python (which parses correctly the XML file with
the doctype).

Any hints?

Added to TODO:

* Allow XML to accept more liberal DOCTYPE specifications

  http://archives.postgresql.org/pgsql-general/2008-02/msg00347.php


---------------------------------------------------------------------------

Lawrence Oluyede wrote:
> As specified in the W3C Recommendation for XML the DOCTYPE element is
> perfectly valid in a document.
> I have a bunch of XML files generated by the boost library which
> contains a doctype like this:
>
> <!DOCTYPE boost_serialization>
>
> which lies within the bound of the recommendation
> (http://www.w3.org/TR/xml/#sec-prolog-dtd):
>
> "Note that it is possible to construct a well-formed document
> containing a doctypedecl that neither points to an external subset nor
> contains an internal subset."
>
> PostgreSQL 8.3 instead doesn't allow the insertion of XML with doctype
> in its new native data type returning this error message:
>
> """
> ERROR:  invalid XML content
> DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
> <!DOCTYPE foo>
>  ^
>
> ********** Error **********
>
> ERROR: invalid XML content
> SQL state: 2200N
> Detail: Entity: line 2: parser error : StartTag: invalid element name
> <!DOCTYPE foo>
> """
>
> This kind of behavior surprises me because pgsql has been compiled
> with the following flags on the development machine:
>  ./configure --with-python --with-openssl --with-pam --with-libxml
> --with-libxslt --enable-thread-safety --enable-debug
>
> During the configuration stage it creates a Makefile binding the
> system version of the libxml2 library which is 2.6.30, the same
> library I use through Python (which parses correctly the XML file with
> the doctype).
>
> Any hints?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Bruce Momjian wrote:
> Added to TODO:
>
> * Allow XML to accept more liberal DOCTYPE specifications

Is any form of DOCTYPE accepted?

We're getting errors on the second line like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM http://host.domain/dtd/dotdisposition0_02.dtd">

The actual host.domain value is resolved by DNS,
and wget of the url works on the machine.
Attempts to cast the document to type xml give:

ERROR:  invalid XML content
DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM "http://host.domain/dtd/dot
^

It would be nice to use the xml type, but we always have DOCTYPE....

-Kevin

Bruce Momjian wrote:
> Added to TODO:
>
> * Allow XML to accept more liberal DOCTYPE specifications

Is any form of DOCTYPE accepted?

We're getting errors on the second line like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM http://host.domain/dtd/dotdisposition0_02.dtd">

The actual host.domain value is resolved by DNS,
and wget of the url works on the machine.
Attempts to cast the document to type xml give:

ERROR:  invalid XML content
DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM "http://host.domain/dtd/dot
^

It would be nice to use the xml type, but we always have DOCTYPE....

-Kevin

Bruce Momjian wrote:
> Added to TODO:
>
> * Allow XML to accept more liberal DOCTYPE specifications

Is any form of DOCTYPE accepted?

We're getting errors on a second line in an XML document that
starts like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM "http://host.domain/dtd/dotdisposition0_02.dtd">

The actual host.domain value is resolved by DNS,
and wget of the url works on the server running PostgreSQL.
Attempts to cast the document to type xml give:

ERROR:  invalid XML content
DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
<!DOCTYPE DOT_OFFICER_CITATION SYSTEM "http://host.domain/dtd/dot
^

It would be nice to use the xml type, but we always have DOCTYPE.
I understand that PostgreSQL won't validate against the specified
DOCTYPE, but it shouldn't error out on it, either.

-Kevin

Am Thursday, 7. February 2008 schrieb Lawrence Oluyede:
> PostgreSQL 8.3 instead doesn't allow the insertion of XML with doctype
> in its new native data type returning this error message:
>
> """
> ERROR:  invalid XML content
> DETAIL:  Entity: line 2: parser error : StartTag: invalid element name
> <!DOCTYPE foo>
>  ^

It turns out that this behavior is entirely correct.  It depends on the XML
option.  If you set the XML option to DOCUMENT, you can parse documents
including DOCTYPE declarations.  If you set the XML option to CONTENT, then
what you can parse is defined by the production

XMLDecl? content

which does not allow for a DOCTYPE.

The default XML option is CONTENT, which explains the behavior.

Now, the supercorrect way to parse XML values would be using the XMLPARSE()
function, which requires you to specify the XML option inline.  That way,
everything works.