Thread: [PATCH] Bug in XPATH() if expression returns a scalar value

[PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
Hi

The in-core XPATH() function currently only handles XPath expressions which
return a node set correctly. For XPath expressions which return boolean,
numeric or string values, XPATH() returns an empty array. (This is the case
for XPath expressions whose top-level syntactic construct isn't a path
expression but rather one of the built-in functions like name() or count()).

The XPath expression 'name(/*)', for example, is supposed to return 'root'
when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
however, currently returns an empty array.

A patch which fixes that is attached. It makes XPATH() return a single-element
array containing a textual representation of the string, numeric value or boolean
for XPath expressions which don't return a node set. For numeric and boolean
results, the textual representation is obtained by calling float8out() respectively
boolout(). These function's results are assumed to always be well-form XML. For
string results, xml_in() is used to verify that the string returned by the XPath
expression is a well-formed XML fragment. This is required because XPATH() could
otherwise be used to insert invalid XML fragments into columns of type XML.

The patch add a few tests of non-nodeset returning XPath expressions to the
XML regression tests.

I'm not 100% clear on whether I should add this to the next commit fest or not -
after all, it's more a bug-fix rather than a new feature. But to prevent this
from getting lost, I'll add it unless someone tells me not to.

best regards,
Florian Pflug

Attachment

Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
Sorry for the self-reply but I figured it'd be helpful to add information
that I discovered only after my initial post.

On May30, 2011, at 15:17 , Florian Pflug wrote:
> The XPath expression 'name(/*)', for example, is supposed to return 'root'
> when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
> however, currently returns an empty array.

In the mean while, I've discovered that this was discussed previously about
a year ago here: http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php

The basic confusion seems to be whether XPATH() is supposed to work on
everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression",
or only on what that document calls a "Location Path".

The difference is basically that "Location Paths" server purely as
predicates, i.e. *select* a subset of nodes from an XML fragment, while
"Expressions" can produce node sets *or* arbitrary scalar values
(boolean, numeric or string).

According to the thread from last summer, XLST handles this by defining
*two* constructs which evaluate XPath expressions, one for those which
return node sets (<xsl:template match="...">) and one for those which
return scalar values (<xsl:value-of select="...">).

My patch makes XPATH() work for both nodset-returning
*and* scalar-value-returning expressions. This has the advantage
of being simpler, but it does force the scalar values produced
by an XPath expression to be valid XML fragments. For boolean and
numeric values this isn't a problem, but it does limit what you
can do with string-returning XPath expressions.

If people deem this to be a problem, we could instead add a separate
function XPATH_VALUE() that returns VARCHAR, and make people use that
for scalar-value-returning expressions. However, to avoid confusion,
XPATH() should then be taught to raise an error if used for scalar-value
returning expressions, instead of silently returning an empty array as
it does now.

Thoughts, anyone?

best regards,
Florian Pflug



Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
"Ross J. Reedstrom"
Date:
On Tue, May 31, 2011 at 04:19:29PM +0200, Florian Pflug wrote:
> Sorry for the self-reply but I figured it'd be helpful to add information
> that I discovered only after my initial post.
> 
> On May30, 2011, at 15:17 , Florian Pflug wrote:
> > The XPath expression 'name(/*)', for example, is supposed to return 'root'
> > when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres,
> > however, currently returns an empty array.
> 
> In the mean while, I've discovered that this was discussed previously about
> a year ago here:
>   http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php
> 
> The basic confusion seems to be whether XPATH() is supposed to work on
> everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression",
> or only on what that document calls a "Location Path".
> 
> The difference is basically that "Location Paths" server purely as
> predicates, i.e. *select* a subset of nodes from an XML fragment, while
> "Expressions" can produce node sets *or* arbitrary scalar values
> (boolean, numeric or string).
> 
> According to the thread from last summer, XLST handles this by defining
> *two* constructs which evaluate XPath expressions, one for those which
> return node sets (<xsl:template match="...">) and one for those which
> return scalar values (<xsl:value-of select="...">).
>
> My patch makes XPATH() work for both nodset-returning
> *and* scalar-value-returning expressions. This has the advantage
> of being simpler, but it does force the scalar values produced
> by an XPath expression to be valid XML fragments. For boolean and
> numeric values this isn't a problem, but it does limit what you
> can do with string-returning XPath expressions.
> 
> If people deem this to be a problem, we could instead add a separate
> function XPATH_VALUE() that returns VARCHAR, and make people use that
> for scalar-value-returning expressions. However, to avoid confusion,
> XPATH() should then be taught to raise an error if used for scalar-value
> returning expressions, instead of silently returning an empty array as
> it does now.
> 
> Thoughts, anyone?

I think it's important to note here that the nodeset returning nature of
XPATH in XSLT is a context setting functionality: these nodes are then
further processed by the template. In the postgresql case, the
distinction between returning a value and doing further processing isn't
so clear. My one use-cases tend toward processing a table full of
records with an XML field, using the XPATH to select out fragments and
records ids into a secondary table for further processing/analysis.

What you describe, making XPATH return something for the scalar
functions, is sorely needed. Constraining the return values to be valid
XML fragments is the sort of wart that makes XML processing in
postgresql seem odd to those familiar with other tools, though. How
about naming the other function XPATH_VALUE_OF, just to make it the XSLT
connection clear?

Ross
-- 
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Systems Engineer & Admin, Research Scientist        phone: 713-348-6166
Connexions                  http://cnx.org            fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE


Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
On May31, 2011, at 19:15 , Ross J. Reedstrom wrote:
> What you describe, making XPATH return something for the scalar
> functions, is sorely needed. Constraining the return values to be valid
> XML fragments is the sort of wart that makes XML processing in
> postgresql seem odd to those familiar with other tools, though.

I've now changes things so that the results of scalar-value
returning XPath expressions are correctly entity-encoded (i.e.,
a literal "<" gets translated to "<").

After realizing that this is necessary even for node-set
returning XPath expressions (see my other mail from today),
because they may very well select text nodes, I came to the
conclusion that doing this unconditionally (well, except for
element nodes obviously) is the least surprising behaviour.

The following subsumes the behavior with this and the patch
from my other e-mail applied.

SELECT
        (XPATH('namespace-uri(/*)', x))[1] AS namespace,
        (XPATH('/*/@value', x))[1] AS value,
        (XPATH('/*/text()', x))[1] AS text
FROM (VALUES (XMLELEMENT(name "root",
        XMLATTRIBUTES('<n' AS xmlns, '<v' AS value),
        '<t'
))) v(x);

 namespace | value | text
-----------+-------+-------
 <n     | <v | <t

Without the patch from the other mail, the "namespace" result
stays the same, but "value" and "text" are "<v" and "<t"
respectively.

Updated patch is attached

best regards,
Florian Pflug

PS: Btw, while trying this I think I found another problem. If you
do

SELECT (XPATH(
        '/*',
        XMLELEMENT(NAME "root",
                   XMLATTRIBUTES('<n' AS xmlns,
                                 '<v' AS value))
))[1];

you get

              xpath
----------------------------------
 <root xmlns="<n" value="<v"/>

i.e. the "<" in the namespace URI isn't quoted properly.
Trying to cast that value to text and back to xml fails.
Funnily enough, if you skip the XPATH() call, things work properly

SELECT XMLELEMENT(NAME "root",
                  XMLATTRIBUTES('<n' AS xmlns,
                                '<v' AS value))

gives

             xmlelement
-------------------------------------
 <root xmlns="<n" value="<v"/>

I'll start a new thread for this issue...

Attachment

Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Peter Eisentraut
Date:
On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
> If people deem this to be a problem, we could instead add a separate
> function XPATH_VALUE() that returns VARCHAR, and make people use that
> for scalar-value-returning expressions.

Why not replicate what contrib/xml2 provides, namely

xpath_string()
xpath_number()
xpath_bool()

That way, types are preserved.

> However, to avoid confusion,
> XPATH() should then be taught to raise an error if used for
> scalar-value
> returning expressions, instead of silently returning an empty array as
> it does now.

Sounds reasonable.



Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
On Jun6, 2011, at 14:56 , Peter Eisentraut wrote:
> On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
>> If people deem this to be a problem, we could instead add a separate
>> function XPATH_VALUE() that returns VARCHAR, and make people use that
>> for scalar-value-returning expressions.
> 
> Why not replicate what contrib/xml2 provides, namely
> 
> xpath_string()
> xpath_number()
> xpath_bool()
> 
> That way, types are preserved.


But then you lose the ability to evaluate user-supplied
XPath expressions, because there's no way of telling which of these
function to use.

Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean
if XPATH() implements the more lenient behaviour I proposed, that
seems like a bad tradeoff overall.

best regards,
Florian Pflug



Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
On Jun8, 2011, at 10:14 , Florian Pflug wrote:
> On Jun6, 2011, at 14:56 , Peter Eisentraut wrote:
>> On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote:
>>> If people deem this to be a problem, we could instead add a separate
>>> function XPATH_VALUE() that returns VARCHAR, and make people use that
>>> for scalar-value-returning expressions.
>>
>> Why not replicate what contrib/xml2 provides, namely
>>
>> xpath_string()
>> xpath_number()
>> xpath_bool()
>>
>> That way, types are preserved.
>
> But then you lose the ability to evaluate user-supplied
> XPath expressions, because there's no way of telling which of these
> function to use.
>
> Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean
> if XPATH() implements the more lenient behaviour I proposed, that
> seems like a bad tradeoff overall.


Patch rebased onto HEAD.

best regards,
Florian Pflug

Attachment

Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Peter Eisentraut
Date:
On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote:
> But then you lose the ability to evaluate user-supplied
> XPath expressions, because there's no way of telling which of these
> function to use.

Perhaps having both variants, one type-safe and one not, would work.  I
don't agree with doing away with type-safety completely for the sake of
convenience.




Re: [PATCH] Bug in XPATH() if expression returns a scalar value

From
Florian Pflug
Date:
On Jun13, 2011, at 21:24 , Peter Eisentraut wrote:
> On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote:
>> But then you lose the ability to evaluate user-supplied
>> XPath expressions, because there's no way of telling which of these
>> function to use.
>
> Perhaps having both variants, one type-safe and one not, would work.  I
> don't agree with doing away with type-safety completely for the sake of
> convenience.

In theory, I agree.

In practice, however, XPath 1.0 isn't very strongly typed itself. The
built-in types are all auto-converted into one another (As if
string(), number(), boolean() had been called). Also, only three of
the functions defined by XPath 1.0 seem interesting. Here's the break-down

The functions returning "string" are string(): Converts arbitrary values to strings local-name(): Name of node-set's
firsttop-level w/o namespace prefix namespace-uri(): Namespace of node-set's first top-level name(): Namespace prefix
andname of node-set's first top-level node concat() starts-with() contains() substring-before() substring-after()
substring()string-length() translate()  
For all of these function postgres provides corresponding SQL functions,
which the exception of local-name() namespace-uri() name()

In fact, these three functions are the raison d'être for my patch and this thread.
I needed to find the name of a tag returned by an XPath expression, and to my
dismay discovered that XPATH('local-name(...)', ...) returns an empty array. The
only reason I added support for boolean and numeric XPath expressions at all was
for the sake of completeness.

Here's the rest of the functions defined by XPath 1.0. I'm convinces that none
of them are particularly useful as top-level functions, and therefore believe
that adding XPATH_BOOLEAN() and XPATH_NUMBER() is largely overkill.

The functions returning "number" are number(): Converts arbitrary values to numbers last() position() count() sum():
Sumover a node-set after implicit conversion of nodes to numbers floor() ceiling() round() operators +, -, *, div, mod 

The functions returning "boolean" are boolean(): Converts arbitrary to boolean not() true() false() operators or, and,
=,!=, <=, <, >=, > 

best regards,
Florian Pflug