Thread: XML validation of whitespace values

XML validation of whitespace values

From
Tim Kane
Date:
HI all,

I’ve had an interesting problem trying to perform an UPDATE based on the results of processing an xpath against a field of type xml.
Specifically, my query would fail with:

ERROR:  could not parse XML document
DETAIL:  line 1: Start tag expected, '<' not found


I thought this strange, since the xml type is (I believe) supposed to guarantee that the value is well formed (wether it’s a document or content).

After some amount of digging to locate the rogue entry, it turns out the field value is actually a single whitespace character – which apparently is well formed..

Thus I’m left with the following situation…



clone=# select xml_is_well_formed(' ');
 xml_is_well_formed
--------------------
 t
(1 row)


clone=# select xpath_exists (‘//test', ' ');
ERROR:  could not parse XML document
DETAIL:  line 1: Start tag expected, '<' not found



clone=# create temp table xml_test (document xml);
CREATE TABLE
clone=# insert into xml_test values (' ');
INSERT 0 1
clone=# select xml_is_well_formed(document::text) from xml_test;
 xml_is_well_formed
--------------------
 t
(1 row)

clone=# select xpath_exists ('test', document) from xml_test;
ERROR:  could not parse XML document
DETAIL:  line 1: Start tag expected, '<' not found



This is confusing, since… if the xml is well formed – it probably shouldn’t cause the parser to error.  And if it isn’t well formed, then the instantiation of the value as an xml type should fail.

Re: XML validation of whitespace values

From
Peter Eisentraut
Date:
On 3/14/14, 11:12 AM, Tim Kane wrote:
> clone=# select xml_is_well_formed(' ');
>  xml_is_well_formed
> --------------------
>  t
> (1 row)
>
>
> clone=# select xpath_exists (‘//test', ' ');
> ERROR:  could not parse XML document
> DETAIL:  line 1: Start tag expected, '<' not found

There are several issues at work here:

- contrib/xml2 has a slightly different notion of what is an OK xml
value than the built-in xml type.

- A string consisting of whitespace is well-formed XML content, but not
a well-formed XML document.  Compare xmlparse(document ' ') vs
xmlparse(content ' ').  contrib/xml2 (apparently) follows the latter
interpretation.

- xpath queries only work properly on XML documents.

It might be better if you wrote your code without using contrib/xml2,
and instead caught any parse exceptions in, say, plpgsql code.



Re: XML validation of whitespace values

From
David G Johnston
Date:
Tim Kane wrote
> clone=# create temp table xml_test (document xml);
> CREATE TABLE

If you know you need to use xpath on this content then you should do one of
the following:

SELECT CASE WHEN document IS DOCUMENT THEN xpath(...) ELSE
default_value_for_missing_data END;

CREATE TABLE xml_test (
document xml NOT NULL CHECK (document IS DOCUMENT)
);

For better and worse in the name of "simplicity" both xml and json(b?) types
are usable for both the embedded content and the entire document.  Unless
you think you have a good reason for a single column to represent both
content and documents you should probably constrain stored data to be
whichever you are expecting.  This is the root of your confusion since, as
Peter said, xpathing only operates on documents - not content fragments.

"The second argument must be a well formed XML document. In particular, it
must have a single root node element."

Unfortunately the decision to treat documents and content as the same types
means this cannot be discovered at parse-time but is data dependent and thus
a run-time error.  By putting in the constraint you ensure that any use of
said column in an xpath query will either always succeed or always fail.

Furthermore:

"xml_is_well_formed does the former if the xmloption configuration parameter
is set to DOCUMENT, or the latter if it is set to CONTENT" - the default for
xmloption is CONTENT...

Both quotes above come from:

http://www.postgresql.org/docs/9.3/interactive/functions-xml.html

David J.







--
View this message in context:
http://postgresql.1045698.n5.nabble.com/XML-validation-of-whitespace-values-tp5796092p5803594.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.