Re: BUG #7844: xpath missing entity decoding - bug or feature - Mailing list pgsql-bugs
From | Dan Scott |
---|---|
Subject | Re: BUG #7844: xpath missing entity decoding - bug or feature |
Date | |
Msg-id | CAAY5AM3CjMcq2qMUX6gBPPcT+H9xDCjJb0tAGbrW3ra9SytJfA@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #7844: xpath missing entity decoding - bug or feature (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-bugs |
On Mon, Nov 4, 2013 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dan Scott <denials@gmail.com> writes: >> On Sept 09, 2013 Bruce Momjian wrote: >>> On Fri, Feb 1, 2013 at 12:02:41PM +0000, info(at)fduerr(dot)de wrote: >>>> The following bug has been logged on the website: >>>> >>>> Bug reference: 7844 >>>> Logged by: fduerr >>>> Email address: info(at)fduerr(dot)de >>>> PostgreSQL version: 9.2.2 >>>> Operating system: Debian >>>> Description: >>>> >>>> Up until 9.1 >>>> >>>> select (xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1]; >>>> >>>> returned 'AT&T' >>>> 9.2 returns 'AT&T' >>>> >>>> Is it a bug or a feature? >>>> Is there a function to decode xml-entities? > >>> Does anyone have a comment on this? > >> Yes, the Evergreen project just ran into this change of behaviour and consider >> it a bug. > >> https://bugs.launchpad.net/evergreen/+bug/1243023 tells the tale, but in short >> the XPath spec states in "5.2 Element Nodes": > >> "Entity references to both internal and external entities are expanded. >> Character references are resolved." (http://www.w3.org/TR/xpath/) > >> So we believe that the extracted text node children of element nodes should be >> resolved when we retrieve them, as they were in 9.1 and before. > > The change in behavior was entirely intentional, see > http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=aaf15e5c1cf8d2c27d2f9841343f00027762cb4e > which was extensively discussed beforehand: > http://www.postgresql.org/message-id/flat/201106291934.23089.rsmogura@softperience.eu > > Before we'd consider reverting this, you'd have to explain why it would be > okay for xpath() to not return valid XML. I don't see that the bit of > spec you mention has anything to do with that consideration --- it's > talking about some internal processing steps to be done by xpath(), > not the representation of the final result. > > It does seem that there should be a way to convert the result to text with > character escaping undone. I'm not seeing anything built-in for that, > but maybe I'm missing it. Thanks for the quick response! XPATH() does not really return valid XML, not even well-formed XML, if you're talking about the XML standard. The example from the 9.2 docs at http://www.postgresql.org/docs/9.2/static/functions-xml.html demonstrates this: SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>', ARRAY[ARRAY['my', 'http://example.com']]); xpath -------- {test} (1 row) When you say that xpath() returns valid XML, I think what you're really saying is that XPATH() needs to return results that are consistent with XML-the-PostgreSQL-type, not XML-the-W3-standard, and that this is why you have changed XPATH()-the-PostgreSQL-function to no longer act like XPath-the-W3-specification when it comes to returning text nodes (which are supposed to have their entities expanded), because XPATH()-the-PostgreSQL-function has no way of returning an array of results that can contain both TEXT and XML types. So some compromise had to be made. The compromise with behaviour that hews the closest to the XPath spec--if returning an XML text node, resolve the entity; if returning an XML element, escape any entities--was the one that was followed by PostgreSQL until 9.2. With the 9.2 behaviour, applications like Evergreen that have relied on that reasonable compromise of the past now face the prospect of having to sniff the results of each member of the returned array; they need to determine if the member in question is in fact a well-formed XML element or text; and in the case of text, then resolving the entities that might be contained in the text. The problem with having the application resolve the entities is that the application is not in the right position to actually resolve the entities, beyond the built-in entities such as & and <. If the XML document in the database contains inline entity definitions, then the database is the only actor capable of resolving those entities. For example, consider the following valid, well-formed XML document: <?xml version="1.0"?> <!DOCTYPE test [<!ENTITY booyah "groovy">]> <foo>&booyah; dude</foo> XPath expressions that return text nodes will resolve the entity and produce the text node "groovy dude". This isn't far-fetched: in a previous life I worked as a technical writer for a team that wrote SGML & XML sourced documents, and inline entity definitions were quite common. If the entities are _not_ resolved for text nodes on output, then the only way for the application to figure out what the content should be is to grab the original XML document and parse it for itself. Which would defeat one of the purposes of the XPATH() function. Here's an example of how that looks today in 9.2: postgres=# CREATE TABLE testy (mytest XML); CREATE TABLE postgres=# INSERT INTO testy (mytest) VALUES (XMLPARSE(DOCUMENT '<?xml version="1.0"?><!DOCTYPE test [<!ENTITY booyah "groovy">]><foo>&booyah; dude & others</foo>')); INSERT 0 1 postgres=# SELECT XPATH('//.', mytest) FROM testy; {"groovy dude & others","<foo>&booyah; dude & others</foo>"," dude & others"} In 8.4, this results in: {"groovy dude & others","<foo>&booyah; dude & others</foo>"," dude & others"} This actually turns up an apparent bug in 8.4 through 9.2 inclusive, as the result of the XPath text() operator is... unexpected (it's not clear where "groovy" went!): 9.2: postgres=# SELECT XPATH('//text()', mytest) FROM testy; {" dude & others"} 8.4: {" dude & others"} That bug aside, just to confirm that I'm not entirely crazy, I wrote quick test scripts in Python3, Perl, and PHP to see if they interpret the XPath spec for text nodes in the same way that I do (that is: requiring text nodes to have their entities resolved) at http://stuff.coffeecode.net/xpath-tests/ - if you run these, they all produce "groovy dude & others" as their output for '//text()' against the sample document. I think the ideal way to handle this for 9.2+ would have been to create a new XPATH_ESCAPE_TEXT() function that implements the behaviour now seen in 9.2 (to enable the round-tripping that was expressed as a goal in the lengthy review thread), and to keep XPATH() working in the way it did prior to 9.2 so that applications that depend on the historical behaviour would not break. We didn't notice this change of behaviour earlier because it is subtle, and hits a corner case of our data. So I apologize that we're very late to the discussion.
pgsql-bugs by date: