Thread: BUG #8469: Xpath behaviour unintuitive / arguably wrong
The following bug has been logged on the website: Bug reference: 8469 Logged by: Dennis Email address: dennis.noordsij@helsinki.fi PostgreSQL version: 9.3.0 Operating system: FreeBSD 9.2-RC4 Description: Hi, After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields containing "&" where they are populated from XML. This may be a coincidence and the problem may have existed earlier, in any case, now I noticed. I extract the text content of XML nodes using xpath, from something like: <name>Jones & Smith</name> The reason I end up with "&" is the IMHO rather odd xpath behaviour: # select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))); xpath --------------- {"A & B"} The canonical contents of "a" is "A & B". At first search I've found some rather heated debates about this with bits of name calling; I certainly do not want to get into that and I apologize in advance to those who feel very strongly about this. I've seen one "fix" describe the problem as: ""DESCRIPTION: Submitter invokes following statement: SELECT (XPATH('/*/text()', '<root><</root>'))[1]. He expect (escaped) result "<", but gets "<" """ With respect, this "bug" makes no sense as this produces in fact the right result. The actual value of <root> is "<", it's just escaped when serialized to XML. If <root> were to actually contain "<", it'd be serialized as "<". It should not be possible to be blindly cast to a text type, but explicitly serialized as such. At least the reviewer at: http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu agrees, but I don't know what happened with that. The python lxml implementation based on libxml2 seems to also agree: >>> from lxml import etree >>> a = etree.XML("<a/>") >>> a.text = "A & B"; >>> a <Element a at 8019eb470> >>> etree.tostring(a); '<a>A & B</a>' >>> a.text 'A & B' >>> a.xpath('/a') [<Element a at 8019eb470>] >>> a.xpath('/a/text()') ['A & B'] and similarly for a simple test using xsltproc when set to output text. If this really is the intended behaviour or something which can or will not be changed, then it invites double (un)escaping bugs and so on, and I would like to ask how you are supposed to sanely extract the intended text from a node in an XML document without risking double (un)escaping, and whether everybody else is doing it wrong? I get that xpath(..text()) apparently wants to return a type XML, that this is on purpose and that there are certain use cases where you want to treat the result as a type XML which you could not do if it returned an unescaped text value, like here: select xmlelement(name "b", (select (select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))))[1])); xmlelement ------------------ <b>A & B</b> which does not double escape the contents, but where if you cast, it does: select xmlelement(name "b", (select (select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))))[1]::text)); xmlelement ---------------------- <b>A & B</b> (1 row) I personally don't believe this is very helpful. The escaping is only a serialization artifact, a text node does not actually contain any &s and so on. My first thought is then that casting between text and xml should not even be possible, and always an explicit (de)serialization to/from text using a chosen encoding (with a shortcut to a PostgreSQL unicode text type), i.e. treated similarly to the difference between a unicode string and utf-8 encoded representation , and not the equivalent of blindly casting a byte sequence to a string and back and hoping for the best. If xpath(..text()) then absolutely has to return a type XML I would be happy to explicitly serialize it to a type text, if PostgreSQL would forbid me from (accidentally) storing a result in my text field I almost certainly did not intend (the escaped value containing &). Of course my first preference would be that it would return a type TEXT. I appreciate any thoughts and workarounds. I don't really want to add xml unescapes everywhere, that feels like that php method of unescaping a string until it stops changing. If the user did intend the literal text "&" I of course want to preserve that. Many thanks!
On Tue, Sep 24, 2013 at 06:43:19PM +0000, dennis.noordsij@helsinki.fi wrote: > The following bug has been logged on the website: > > Bug reference: 8469 > Logged by: Dennis > Email address: dennis.noordsij@helsinki.fi > PostgreSQL version: 9.3.0 > Operating system: FreeBSD 9.2-RC4 > Description: > > Hi, > > > After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields > containing "&" where they are populated from XML. This may be a > coincidence and the problem may have existed earlier, in any case, now I > noticed. > > > I extract the text content of XML nodes using xpath, from something like: > > > <name>Jones & Smith</name> > > > The reason I end up with "&" is the IMHO rather odd xpath behaviour: > > > # select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))); > > > xpath > --------------- > {"A & B"} > > > The canonical contents of "a" is "A & B". At first search I've found some > rather heated debates about this with bits of name calling; I certainly do > not want to get into that and I apologize in advance to those who feel very > strongly about this. > > > I've seen one "fix" describe the problem as: > > > ""DESCRIPTION: Submitter invokes following statement: > SELECT (XPATH('/*/text()', '<root><</root>'))[1]. > He expect (escaped) result "<", but gets "<" > """ > > > With respect, this "bug" makes no sense as this produces in fact the right > result. The actual value of <root> is "<", it's just escaped when serialized > to XML. If <root> were to actually contain "<", it'd be serialized as > "<". It should not be possible to be blindly cast to a text type, but > explicitly serialized as such. > > > At least the reviewer at: > > > http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu There are two other similar bug reports on this from February and March of this year: http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org Someone who knows XML needs to take leadership on this and propose a patch. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 10/02/2013 06:19 PM, Bruce Momjian wrote: > On Tue, Sep 24, 2013 at 06:43:19PM +0000, dennis.noordsij@helsinki.fi wrote: >> The following bug has been logged on the website: >> >> Bug reference: 8469 >> Logged by: Dennis >> Email address: dennis.noordsij@helsinki.fi >> PostgreSQL version: 9.3.0 >> Operating system: FreeBSD 9.2-RC4 >> Description: >> >> Hi, >> >> >> After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields >> containing "&" where they are populated from XML. This may be a >> coincidence and the problem may have existed earlier, in any case, now I >> noticed. >> >> >> I extract the text content of XML nodes using xpath, from something like: >> >> >> <name>Jones & Smith</name> >> >> >> The reason I end up with "&" is the IMHO rather odd xpath behaviour: >> >> >> # select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))); >> >> >> xpath >> --------------- >> {"A & B"} >> >> >> The canonical contents of "a" is "A & B". At first search I've found some >> rather heated debates about this with bits of name calling; I certainly do >> not want to get into that and I apologize in advance to those who feel very >> strongly about this. >> >> >> I've seen one "fix" describe the problem as: >> >> >> ""DESCRIPTION: Submitter invokes following statement: >> SELECT (XPATH('/*/text()', '<root><</root>'))[1]. >> He expect (escaped) result "<", but gets "<" >> """ >> >> >> With respect, this "bug" makes no sense as this produces in fact the right >> result. The actual value of <root> is "<", it's just escaped when serialized >> to XML. If <root> were to actually contain "<", it'd be serialized as >> "<". It should not be possible to be blindly cast to a text type, but >> explicitly serialized as such. >> >> >> At least the reviewer at: >> >> >> http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu > > There are two other similar bug reports on this from February and March > of this year: > > http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org I think that should be: http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql.org > http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org > > Someone who knows XML needs to take leadership on this and propose a > patch. agreed Stefan
On Fri, Oct 4, 2013 at 10:20:46PM +0200, Stefan Kaltenbrunner wrote: > >> http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu > > > > There are two other similar bug reports on this from February and March > > of this year: > > > > http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org > > I think that should be: > http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql.org > > > > http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org > > > > Someone who knows XML needs to take leadership on this and propose a > > patch. Added to TODO. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 2013-10-05 Broce Monjian wrote: > On Fri, Oct 4, 2013 at 10:20:46PM +0200, Stefan Kaltenbrunner wrote: > > > > = http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperie= nce.eu > > >=20 > > > There are two other similar bug reports on this from February and = March > > > of this year: > > >=20 > > > = http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgres= ql.org > > I think that should be: > > = http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql= .org > > > = http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql= .org > > > Someone who knows XML needs to take leadership on this and propose = a > > > patch. > Added to TODO. I'm the one who submitted the patch which changed XPATH's behaviour. = While I agree that some might find it counter-intuitive for an XPATH expression like = //node/text() to return quoted output, there's really no way around that given the way = the postgres type system works. A general XPATH function needs to return XML[], not TEXT[], since an = XPATH can not only select text and attribute nodes, but whole XML subtrees. But if it = returns XML, it NEEDS to return something that's actually valid content for a value = of type XML, not plain test masquerading as XML. Otherwise, basic guarantees are = violated, like for example that xmlin(xmlout(value)) yields value (and not an error). So unfortunate as it may be, I really see no way around the current = behaviour. We could (and should) however * Provide an way to actually unescape text contained in XML fields. I = planned to post a followup patch doing that when I submitted the patch for = XPATH, but then stuff came up and=85 oh well=85 :-( * Add a big, fat warning to the docs about this.=20 best regards, Florian Pflug PS: I'm not subscribed to pgsql-bugs, so please keep me on the CC list
<div dir="ltr"><font color="#000000" face="Verdana,sans-serif" style="">Any news on this bug? 9.4 still behaves wrong andthis actively stops us from using Postgresql as a storage for our xml docs.<br id="FontBreak" /></font><br /><br />--<br/>Вилен Тамбовцев </div>
Вилен Тамбовцев <v.tambovtsev@outlook.com> writes: > Any news on this bug? 9.4 still behaves wrong and this actively stops us from using Postgresql as a storage for our xmldocs. AFAICS the conclusion in that thread was that the current behavior is correct; in particular xpath()'s output is still XML and so it must not de-escape anything. http://www.postgresql.org/message-id/72DA66D9-0222-4888-AF55-61D3337CAC7A@phlo.org regards, tom lane
Actually, xpath()'s output is not original XML, at least there must be a function like XML_UNESCAPE() for current behavior otherwise there is no way to get proper xml with xpath -- Вилен Тамбовцев ---------------------------------------- > From: tgl@sss.pgh.pa.us > To: v.tambovtsev@outlook.com > CC: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #8469: Xpath behaviour unintuitive / arguably wrong > Date: Tue, 3 Feb 2015 17:09:34 -0500 > > Вилен Тамбовцев <v.tambovtsev@outlook.com> writes: >> Any news on this bug? 9.4 still behaves wrong and this actively stops us from using Postgresql as a storage for our xmldocs. > > AFAICS the conclusion in that thread was that the current behavior is > correct; in particular xpath()'s output is still XML and so it must not > de-escape anything. > > http://www.postgresql.org/message-id/72DA66D9-0222-4888-AF55-61D3337CAC7A@phlo.org > > regards, tom lane > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
=D0=92=D0=B8=D0=BB=D0=B5=D0=BD =D0=A2=D0=B0=D0=BC=D0=B1=D0=BE=D0=B2=D1=86= =D0=B5=D0=B2-2 wrote > Actually, xpath()'s output is not original XML, at least there must be a > function like XML_UNESCAPE() for current behavior > otherwise there is no way to get proper xml with xpath Based upon the link provided this has been acknowledged. My personal opinion is that this is a bug (via omission) that the patch introduced and that said patch author or committer should work to rectify.= =20 Our XML/xpath handling is a large brick shy of a load if we do not provide = a way to make Unescaped text extraction possible. David J. -- View this message in context: http://postgresql.nabble.com/Re-BUG-8469-Xpat= h-behaviour-unintuitive-arguably-wrong-tp5836629p5836676.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
<div dir="ltr">Well, I had a conversation with Bruce Momjian he said that they added this bug to TODO list. Almost 2 yearshas passed since and the bug is still here. IMO this bug is actually a blocker for XML code part since xpath is corefunctionality. -- Вилен Тамбовцев > Date: Tue, 3 Feb 2015 23:22:54 -0700 > From: david.g.johnston@gmail.com >To: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #8469: Xpath behaviour unintuitive / arguably wrong > >Вилен Тамбовцев-2 wrote >> Actually, xpath()'s output is not original XML, at least there must be a >> functionlike XML_UNESCAPE() for current behavior >> otherwise there is no way to get proper xml with xpath > >Based upon the link provided this has been acknowledged. > > My personal opinion is that this is a bug (via omission)that the patch > introduced and that said patch author or committer should work to rectify. > Our XML/xpathhandling is a large brick shy of a load if we do not provide a > way to make Unescaped text extraction possible.> > David J. > > > > -- > View this message in context: http://postgresql.nabble.com/Re-BUG-8469-Xpath-behaviour-unintuitive-arguably-wrong-tp5836629p5836676.html> Sent fromthe PostgreSQL - bugs mailing list archive at Nabble.com. > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)> To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs</div>