Thread: BUG #8469: Xpath behaviour unintuitive / arguably wrong

BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
dennis.noordsij@helsinki.fi
Date:
The following bug has been logged on the website:

Bug reference:      8469
Logged by:          Dennis
Email address:      dennis.noordsij@helsinki.fi
PostgreSQL version: 9.3.0
Operating system:   FreeBSD 9.2-RC4
Description:

Hi,


After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields
containing "&" where they are populated from XML. This may be a
coincidence and the problem may have existed earlier, in any case, now I
noticed.


I extract the text content of XML nodes using xpath, from something like:


<name>Jones & Smith</name>


The reason I end up with "&" is the IMHO rather odd xpath behaviour:


# select xpath('/a/text()', (select xmlelement(name "a", 'A & B')));


     xpath
---------------
 {"A & B"}


The canonical contents of "a" is "A & B". At first search I've found some
rather heated debates about this with bits of name calling; I certainly do
not want to get into that and I apologize in advance to those who feel very
strongly about this.


I've seen one "fix" describe the problem as:


""DESCRIPTION: Submitter invokes following statement:
SELECT (XPATH('/*/text()', '<root><</root>'))[1].
He expect (escaped) result "<", but gets "<"
"""


With respect, this "bug" makes no sense as this produces in fact the right
result. The actual value of <root> is "<", it's just escaped when serialized
to XML. If <root> were to actually contain "<", it'd be serialized as
"&lt;". It should not be possible to be blindly cast to a text type, but
explicitly serialized as such.


At least the reviewer at:


http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu


agrees, but I don't know what happened with that.


The python lxml implementation based on libxml2 seems to also agree:


>>> from lxml import etree
>>> a = etree.XML("<a/>")
>>> a.text = "A & B";
>>> a
<Element a at 8019eb470>
>>> etree.tostring(a);
'<a>A & B</a>'
>>> a.text
'A & B'
>>> a.xpath('/a')
[<Element a at 8019eb470>]
>>> a.xpath('/a/text()')
['A & B']


and similarly for a simple test using xsltproc when set to output text.


If this really is the intended behaviour or something which can or will not
be changed, then it invites double (un)escaping bugs and so on, and I would
like to ask how you are supposed to sanely extract the intended text from a
node in an XML document without risking double (un)escaping, and whether
everybody else is doing it wrong?


I get that xpath(..text()) apparently wants to return a type XML, that this
is on purpose and that there are certain use cases where you want to treat
the result as a type XML which you could not do if it returned an unescaped
text value, like here:

select xmlelement(name "b", (select (select xpath('/a/text()', (select
xmlelement(name "a", 'A & B'))))[1]));
    xmlelement
------------------
 <b>A & B</b>


which does not double escape the contents, but where if you cast, it does:


select xmlelement(name "b", (select (select xpath('/a/text()', (select
xmlelement(name "a", 'A & B'))))[1]::text));
      xmlelement
----------------------
 <b>A &amp; B</b>
(1 row)


I personally don't believe this is very helpful. The escaping is only a
serialization artifact, a text node does not actually contain any &s and
so on.


My first thought is then that casting between text and xml should not even
be possible, and always an explicit (de)serialization to/from text using a
chosen encoding (with a shortcut to a PostgreSQL unicode text type), i.e.
treated similarly to the difference between a unicode string and utf-8
encoded representation , and not the equivalent of blindly casting a byte
sequence to a string and back and hoping for the best.


If xpath(..text()) then absolutely has to return a type XML I would be happy
to explicitly serialize it to a type text, if PostgreSQL would forbid me
from (accidentally) storing a result in my text field I almost certainly did
not intend (the escaped value containing &). Of course my first
preference would be that it would return a type TEXT.


I appreciate any thoughts and workarounds. I don't really want to add xml
unescapes everywhere, that feels like that php method of unescaping a string
until it stops changing. If the user did intend the literal text "&" I
of course want to preserve that.


Many thanks!

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Bruce Momjian
Date:
On Tue, Sep 24, 2013 at 06:43:19PM +0000, dennis.noordsij@helsinki.fi wrote:
> The following bug has been logged on the website:
>
> Bug reference:      8469
> Logged by:          Dennis
> Email address:      dennis.noordsij@helsinki.fi
> PostgreSQL version: 9.3.0
> Operating system:   FreeBSD 9.2-RC4
> Description:
>
> Hi,
>
>
> After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields
> containing "&" where they are populated from XML. This may be a
> coincidence and the problem may have existed earlier, in any case, now I
> noticed.
>
>
> I extract the text content of XML nodes using xpath, from something like:
>
>
> <name>Jones & Smith</name>
>
>
> The reason I end up with "&" is the IMHO rather odd xpath behaviour:
>
>
> # select xpath('/a/text()', (select xmlelement(name "a", 'A & B')));
>
>
>      xpath
> ---------------
>  {"A & B"}
>
>
> The canonical contents of "a" is "A & B". At first search I've found some
> rather heated debates about this with bits of name calling; I certainly do
> not want to get into that and I apologize in advance to those who feel very
> strongly about this.
>
>
> I've seen one "fix" describe the problem as:
>
>
> ""DESCRIPTION: Submitter invokes following statement:
> SELECT (XPATH('/*/text()', '<root><</root>'))[1].
> He expect (escaped) result "<", but gets "<"
> """
>
>
> With respect, this "bug" makes no sense as this produces in fact the right
> result. The actual value of <root> is "<", it's just escaped when serialized
> to XML. If <root> were to actually contain "<", it'd be serialized as
> "&lt;". It should not be possible to be blindly cast to a text type, but
> explicitly serialized as such.
>
>
> At least the reviewer at:
>
>
> http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu

There are two other similar bug reports on this from February and March
of this year:

    http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org
    http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org

Someone who knows XML needs to take leadership on this and propose a
patch.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Stefan Kaltenbrunner
Date:
On 10/02/2013 06:19 PM, Bruce Momjian wrote:
> On Tue, Sep 24, 2013 at 06:43:19PM +0000, dennis.noordsij@helsinki.fi wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      8469
>> Logged by:          Dennis
>> Email address:      dennis.noordsij@helsinki.fi
>> PostgreSQL version: 9.3.0
>> Operating system:   FreeBSD 9.2-RC4
>> Description:
>>
>> Hi,
>>
>>
>> After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields
>> containing "&" where they are populated from XML. This may be a
>> coincidence and the problem may have existed earlier, in any case, now I
>> noticed.
>>
>>
>> I extract the text content of XML nodes using xpath, from something like:
>>
>>
>> <name>Jones & Smith</name>
>>
>>
>> The reason I end up with "&" is the IMHO rather odd xpath behaviour:
>>
>>
>> # select xpath('/a/text()', (select xmlelement(name "a", 'A & B')));
>>
>>
>>      xpath
>> ---------------
>>  {"A & B"}
>>
>>
>> The canonical contents of "a" is "A & B". At first search I've found some
>> rather heated debates about this with bits of name calling; I certainly do
>> not want to get into that and I apologize in advance to those who feel very
>> strongly about this.
>>
>>
>> I've seen one "fix" describe the problem as:
>>
>>
>> ""DESCRIPTION: Submitter invokes following statement:
>> SELECT (XPATH('/*/text()', '<root><</root>'))[1].
>> He expect (escaped) result "<", but gets "<"
>> """
>>
>>
>> With respect, this "bug" makes no sense as this produces in fact the right
>> result. The actual value of <root> is "<", it's just escaped when serialized
>> to XML. If <root> were to actually contain "<", it'd be serialized as
>> "&lt;". It should not be possible to be blindly cast to a text type, but
>> explicitly serialized as such.
>>
>>
>> At least the reviewer at:
>>
>>
>> http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu
>
> There are two other similar bug reports on this from February and March
> of this year:
>
>     http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org

I think that should be:
http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql.org


>     http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org
>
> Someone who knows XML needs to take leadership on this and propose a
> patch.

agreed


Stefan

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Bruce Momjian
Date:
On Fri, Oct  4, 2013 at 10:20:46PM +0200, Stefan Kaltenbrunner wrote:
> >> http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu
> >
> > There are two other similar bug reports on this from February and March
> > of this year:
> >
> >     http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgresql.org
>
> I think that should be:
> http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql.org
>
>
> >     http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql.org
> >
> > Someone who knows XML needs to take leadership on this and propose a
> > patch.

Added to TODO.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Florian Pflug
Date:
On 2013-10-05 Broce Monjian wrote:
> On Fri, Oct  4, 2013 at 10:20:46PM +0200, Stefan Kaltenbrunner wrote:
> > > > =
http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperie=
nce.eu
> > >=20
> > > There are two other similar bug reports on this from February and =
March
> > > of this year:
> > >=20
> > > =
http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrihigleys.postgres=
ql.org
> > I think that should be:
> > =
http://www.postgresql.org/message-id/E1U1FKL-0002rD-RO@wrigleys.postgresql=
.org

> > > =
http://www.postgresql.org/message-id/E1UHyUw-0001oj-HE@wrigleys.postgresql=
.org

> > > Someone who knows XML needs to take leadership on this and propose =
a
> > > patch.

>  Added to TODO.

I'm the one who submitted the patch which changed XPATH's behaviour. =
While I agree
that some might find it counter-intuitive for an XPATH expression like =
//node/text()
to return quoted output, there's really no way around that given the way =
the postgres
type system works.

A general XPATH function needs to return XML[], not TEXT[], since an =
XPATH can not
only select text and attribute nodes, but whole XML subtrees. But if it =
returns XML,
it NEEDS to return something that's actually valid content for a value =
of type XML,
not plain test masquerading as XML. Otherwise, basic guarantees are =
violated, like
for example that xmlin(xmlout(value)) yields value (and not an error).

So unfortunate as it may be, I really see no way around the current =
behaviour. We
could (and should) however

  * Provide an way to actually unescape text contained in XML fields. I =
planned to
    post a followup patch doing that when I submitted the patch for =
XPATH, but then
    stuff came up and=85 oh well=85 :-(

  * Add a big, fat warning to the docs about this.=20

best regards,
Florian Pflug

PS: I'm not subscribed to pgsql-bugs, so please keep me on the CC list

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Вилен Тамбовцев
Date:
<div dir="ltr"><font color="#000000" face="Verdana,sans-serif" style="">Any news on this bug? 9.4 still behaves wrong
andthis actively stops us from using Postgresql as a storage for our xml docs.<br id="FontBreak" /></font><br /><br
/>--<br/>Вилен Тамбовцев </div> 

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Tom Lane
Date:
Вилен Тамбовцев <v.tambovtsev@outlook.com> writes:
> Any news on this bug? 9.4 still behaves wrong and this actively stops us from using Postgresql as a storage for our
xmldocs. 

AFAICS the conclusion in that thread was that the current behavior is
correct; in particular xpath()'s output is still XML and so it must not
de-escape anything.

http://www.postgresql.org/message-id/72DA66D9-0222-4888-AF55-61D3337CAC7A@phlo.org
        regards, tom lane



Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Вилен Тамбовцев
Date:
Actually,  xpath()'s output is not original XML, at least there must be a function like XML_UNESCAPE() for current
behavior
otherwise there is no way to get proper xml with xpath


--
Вилен Тамбовцев


----------------------------------------
> From: tgl@sss.pgh.pa.us
> To: v.tambovtsev@outlook.com
> CC: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #8469: Xpath behaviour unintuitive / arguably wrong
> Date: Tue, 3 Feb 2015 17:09:34 -0500
>
> Вилен Тамбовцев <v.tambovtsev@outlook.com> writes:
>> Any news on this bug? 9.4 still behaves wrong and this actively stops us from using Postgresql as a storage for our
xmldocs. 
>
> AFAICS the conclusion in that thread was that the current behavior is
> correct; in particular xpath()'s output is still XML and so it must not
> de-escape anything.
>
> http://www.postgresql.org/message-id/72DA66D9-0222-4888-AF55-61D3337CAC7A@phlo.org
>
> regards, tom lane
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs


Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
David G Johnston
Date:
=D0=92=D0=B8=D0=BB=D0=B5=D0=BD =D0=A2=D0=B0=D0=BC=D0=B1=D0=BE=D0=B2=D1=86=
=D0=B5=D0=B2-2 wrote
> Actually,  xpath()'s output is not original XML, at least there must be a
> function like XML_UNESCAPE() for current behavior
> otherwise there is no way to get proper xml with xpath

Based upon the link provided this has been acknowledged.

My personal opinion is that this is a bug (via omission) that the patch
introduced and that said patch author or committer should work to rectify.=
=20
Our XML/xpath handling is a large brick shy of a load if we do not provide =
a
way to make Unescaped text extraction possible.

David J.



--
View this message in context: http://postgresql.nabble.com/Re-BUG-8469-Xpat=
h-behaviour-unintuitive-arguably-wrong-tp5836629p5836676.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

Re: BUG #8469: Xpath behaviour unintuitive / arguably wrong

From
Вилен Тамбовцев
Date:
<div dir="ltr">Well, I had a conversation with Bruce Momjian he said that they added this bug to TODO list. Almost 2
yearshas passed since and the bug is still here. IMO this bug is actually a blocker for XML code part since xpath is
corefunctionality. -- Вилен Тамбовцев > Date: Tue, 3 Feb 2015 23:22:54 -0700 > From: david.g.johnston@gmail.com
>To: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #8469: Xpath behaviour unintuitive / arguably wrong >
>Вилен Тамбовцев-2 wrote >> Actually, xpath()'s output is not original XML, at least there must be a >>
functionlike XML_UNESCAPE() for current behavior >> otherwise there is no way to get proper xml with xpath >
>Based upon the link provided this has been acknowledged. > > My personal opinion is that this is a bug (via
omission)that the patch > introduced and that said patch author or committer should work to rectify. > Our
XML/xpathhandling is a large brick shy of a load if we do not provide a > way to make Unescaped text extraction
possible.> > David J. > > > > -- > View this message in context:
http://postgresql.nabble.com/Re-BUG-8469-Xpath-behaviour-unintuitive-arguably-wrong-tp5836629p5836676.html> Sent
fromthe PostgreSQL - bugs mailing list archive at Nabble.com. > > > -- > Sent via pgsql-bugs mailing list
(pgsql-bugs@postgresql.org)> To make changes to your subscription: >
http://www.postgresql.org/mailpref/pgsql-bugs</div>