Re: BUG #7844: xpath missing entity decoding - bug or feature - Mailing list pgsql-bugs

From Dan Scott
Subject Re: BUG #7844: xpath missing entity decoding - bug or feature
Date
Msg-id CAAY5AM3CjMcq2qMUX6gBPPcT+H9xDCjJb0tAGbrW3ra9SytJfA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7844: xpath missing entity decoding - bug or feature  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Mon, Nov 4, 2013 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dan Scott <denials@gmail.com> writes:
>> On Sept 09, 2013 Bruce Momjian wrote:
>>> On Fri, Feb  1, 2013 at 12:02:41PM +0000, info(at)fduerr(dot)de wrote:
>>>> The following bug has been logged on the website:
>>>>
>>>> Bug reference:      7844
>>>> Logged by:          fduerr
>>>> Email address:      info(at)fduerr(dot)de
>>>> PostgreSQL version: 9.2.2
>>>> Operating system:   Debian
>>>> Description:
>>>>
>>>> Up until 9.1
>>>>
>>>> select (xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1];
>>>>
>>>> returned 'AT&T'
>>>> 9.2 returns 'AT&T'
>>>>
>>>> Is it a bug or a feature?
>>>> Is there a function to decode xml-entities?
>
>>> Does anyone have a comment on this?
>
>> Yes, the Evergreen project just ran into this change of behaviour and consider
>> it a bug.
>
>> https://bugs.launchpad.net/evergreen/+bug/1243023 tells the tale, but in short
>> the XPath spec states in "5.2 Element Nodes":
>
>> "Entity references to both internal and external entities are expanded.
>> Character references are resolved." (http://www.w3.org/TR/xpath/)
>
>> So we believe that the extracted text node children of element nodes should be
>> resolved when we retrieve them, as they were in 9.1 and before.
>
> The change in behavior was entirely intentional, see
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=aaf15e5c1cf8d2c27d2f9841343f00027762cb4e
> which was extensively discussed beforehand:
> http://www.postgresql.org/message-id/flat/201106291934.23089.rsmogura@softperience.eu
>
> Before we'd consider reverting this, you'd have to explain why it would be
> okay for xpath() to not return valid XML.  I don't see that the bit of
> spec you mention has anything to do with that consideration --- it's
> talking about some internal processing steps to be done by xpath(),
> not the representation of the final result.
>
> It does seem that there should be a way to convert the result to text with
> character escaping undone.  I'm not seeing anything built-in for that,
> but maybe I'm missing it.

Thanks for the quick response!

XPATH() does not really return valid XML, not even well-formed XML, if
you're talking about the XML standard. The example from the 9.2 docs
at http://www.postgresql.org/docs/9.2/static/functions-xml.html
demonstrates this:

SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
             ARRAY[ARRAY['my', 'http://example.com']]);

 xpath
--------
 {test}
(1 row)

When you say that xpath() returns valid XML, I think what you're
really saying is that XPATH() needs to return results that are
consistent with XML-the-PostgreSQL-type, not XML-the-W3-standard, and
that this is why you have changed XPATH()-the-PostgreSQL-function to
no longer act like XPath-the-W3-specification when it comes to
returning text nodes (which are supposed to have their entities
expanded), because XPATH()-the-PostgreSQL-function has no way of
returning an array of results that can contain both TEXT and XML
types.

So some compromise had to be made. The compromise with behaviour that
hews the closest to the XPath spec--if returning an XML text node,
resolve the entity; if returning an XML element, escape any
entities--was the one that was followed by PostgreSQL until 9.2. With
the 9.2 behaviour, applications like Evergreen that have relied on
that reasonable compromise of the past now face the prospect of having
to sniff the results of each member of the returned array; they need
to determine if the member in question is in fact a well-formed XML
element or text; and in the case of text, then resolving the entities
that might be contained in the text.

The problem with having the application resolve the entities is that
the application is not in the right position to actually resolve the
entities, beyond the built-in entities such as & and <. If the
XML document in the database contains inline entity definitions, then
the database is the only actor capable of resolving those entities.
For example, consider the following valid, well-formed XML document:

<?xml version="1.0"?>
<!DOCTYPE test [<!ENTITY booyah "groovy">]>
<foo>&booyah; dude</foo>

XPath expressions that return text nodes will resolve the entity and
produce the text node "groovy dude". This isn't far-fetched: in a
previous life I worked as a technical writer for a team that wrote
SGML & XML sourced documents, and inline entity definitions were quite
common.

If the entities are _not_ resolved for text nodes on output, then the
only way for the application to figure out what the content should be
is to grab the original XML document and parse it for itself. Which
would defeat one of the purposes of the XPATH() function. Here's an
example of how that looks today in 9.2:

postgres=# CREATE TABLE testy (mytest XML);
CREATE TABLE
postgres=# INSERT INTO testy (mytest) VALUES (XMLPARSE(DOCUMENT '<?xml
version="1.0"?><!DOCTYPE test [<!ENTITY booyah
"groovy">]><foo>&booyah; dude & others</foo>'));
INSERT 0 1

postgres=# SELECT XPATH('//.', mytest) FROM testy;
 {"groovy dude & others","<foo>&booyah; dude & others</foo>","
dude & others"}

In 8.4, this results in:
 {"groovy dude & others","<foo>&booyah; dude & others</foo>","
dude & others"}

This actually turns up an apparent bug in 8.4 through 9.2 inclusive,
as the result of the XPath text() operator is... unexpected (it's not
clear where "groovy" went!):

9.2:
postgres=# SELECT XPATH('//text()', mytest) FROM testy;
 {" dude & others"}

8.4:
 {" dude & others"}

That bug aside, just to confirm that I'm not entirely crazy, I wrote
quick test scripts in Python3, Perl, and PHP to see if they interpret
the XPath spec for text nodes in the same way that I do (that is:
requiring text nodes to have their entities resolved) at
http://stuff.coffeecode.net/xpath-tests/ - if you run these, they all
produce "groovy dude & others" as their output for '//text()' against
the sample document.

I think the ideal way to handle this for 9.2+ would have been to
create a new XPATH_ESCAPE_TEXT() function that implements the
behaviour now seen in 9.2 (to enable the round-tripping that was
expressed as a goal in the lengthy review thread), and to keep XPATH()
working in the way it did prior to 9.2 so that applications that
depend on the historical behaviour would not break.

We didn't notice this change of behaviour earlier because it is
subtle, and hits a corner case of our data. So I apologize that we're
very late to the discussion.

pgsql-bugs by date:

Previous
From: Francisco Olarte
Date:
Subject: Re: BUG #8583: I can't install this product
Next
From: Tom Lane
Date:
Subject: Re: BUG #8582: field serial getted incorrect value from automaticaly created its sequence