Thread: [PATCH] Bug in XPATH() if expression returns a scalar value
Hi The in-core XPATH() function currently only handles XPath expressions which return a node set correctly. For XPath expressions which return boolean, numeric or string values, XPATH() returns an empty array. (This is the case for XPath expressions whose top-level syntactic construct isn't a path expression but rather one of the built-in functions like name() or count()). The XPath expression 'name(/*)', for example, is supposed to return 'root' when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres, however, currently returns an empty array. A patch which fixes that is attached. It makes XPATH() return a single-element array containing a textual representation of the string, numeric value or boolean for XPath expressions which don't return a node set. For numeric and boolean results, the textual representation is obtained by calling float8out() respectively boolout(). These function's results are assumed to always be well-form XML. For string results, xml_in() is used to verify that the string returned by the XPath expression is a well-formed XML fragment. This is required because XPATH() could otherwise be used to insert invalid XML fragments into columns of type XML. The patch add a few tests of non-nodeset returning XPath expressions to the XML regression tests. I'm not 100% clear on whether I should add this to the next commit fest or not - after all, it's more a bug-fix rather than a new feature. But to prevent this from getting lost, I'll add it unless someone tells me not to. best regards, Florian Pflug
Attachment
Sorry for the self-reply but I figured it'd be helpful to add information that I discovered only after my initial post. On May30, 2011, at 15:17 , Florian Pflug wrote: > The XPath expression 'name(/*)', for example, is supposed to return 'root' > when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres, > however, currently returns an empty array. In the mean while, I've discovered that this was discussed previously about a year ago here: http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php The basic confusion seems to be whether XPATH() is supposed to work on everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression", or only on what that document calls a "Location Path". The difference is basically that "Location Paths" server purely as predicates, i.e. *select* a subset of nodes from an XML fragment, while "Expressions" can produce node sets *or* arbitrary scalar values (boolean, numeric or string). According to the thread from last summer, XLST handles this by defining *two* constructs which evaluate XPath expressions, one for those which return node sets (<xsl:template match="...">) and one for those which return scalar values (<xsl:value-of select="...">). My patch makes XPATH() work for both nodset-returning *and* scalar-value-returning expressions. This has the advantage of being simpler, but it does force the scalar values produced by an XPath expression to be valid XML fragments. For boolean and numeric values this isn't a problem, but it does limit what you can do with string-returning XPath expressions. If people deem this to be a problem, we could instead add a separate function XPATH_VALUE() that returns VARCHAR, and make people use that for scalar-value-returning expressions. However, to avoid confusion, XPATH() should then be taught to raise an error if used for scalar-value returning expressions, instead of silently returning an empty array as it does now. Thoughts, anyone? best regards, Florian Pflug
On Tue, May 31, 2011 at 04:19:29PM +0200, Florian Pflug wrote: > Sorry for the self-reply but I figured it'd be helpful to add information > that I discovered only after my initial post. > > On May30, 2011, at 15:17 , Florian Pflug wrote: > > The XPath expression 'name(/*)', for example, is supposed to return 'root' > > when applied to the XML fragment '<root><nested/><nested/></root>'. Postgres, > > however, currently returns an empty array. > > In the mean while, I've discovered that this was discussed previously about > a year ago here: > http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php > > The basic confusion seems to be whether XPATH() is supposed to work on > everything that http://www.w3.org/TR/xpath/ consideres to be an "Expression", > or only on what that document calls a "Location Path". > > The difference is basically that "Location Paths" server purely as > predicates, i.e. *select* a subset of nodes from an XML fragment, while > "Expressions" can produce node sets *or* arbitrary scalar values > (boolean, numeric or string). > > According to the thread from last summer, XLST handles this by defining > *two* constructs which evaluate XPath expressions, one for those which > return node sets (<xsl:template match="...">) and one for those which > return scalar values (<xsl:value-of select="...">). > > My patch makes XPATH() work for both nodset-returning > *and* scalar-value-returning expressions. This has the advantage > of being simpler, but it does force the scalar values produced > by an XPath expression to be valid XML fragments. For boolean and > numeric values this isn't a problem, but it does limit what you > can do with string-returning XPath expressions. > > If people deem this to be a problem, we could instead add a separate > function XPATH_VALUE() that returns VARCHAR, and make people use that > for scalar-value-returning expressions. However, to avoid confusion, > XPATH() should then be taught to raise an error if used for scalar-value > returning expressions, instead of silently returning an empty array as > it does now. > > Thoughts, anyone? I think it's important to note here that the nodeset returning nature of XPATH in XSLT is a context setting functionality: these nodes are then further processed by the template. In the postgresql case, the distinction between returning a value and doing further processing isn't so clear. My one use-cases tend toward processing a table full of records with an XML field, using the XPATH to select out fragments and records ids into a secondary table for further processing/analysis. What you describe, making XPATH return something for the scalar functions, is sorely needed. Constraining the return values to be valid XML fragments is the sort of wart that makes XML processing in postgresql seem odd to those familiar with other tools, though. How about naming the other function XPATH_VALUE_OF, just to make it the XSLT connection clear? Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Systems Engineer & Admin, Research Scientist phone: 713-348-6166 Connexions http://cnx.org fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
On May31, 2011, at 19:15 , Ross J. Reedstrom wrote: > What you describe, making XPATH return something for the scalar > functions, is sorely needed. Constraining the return values to be valid > XML fragments is the sort of wart that makes XML processing in > postgresql seem odd to those familiar with other tools, though. I've now changes things so that the results of scalar-value returning XPath expressions are correctly entity-encoded (i.e., a literal "<" gets translated to "<"). After realizing that this is necessary even for node-set returning XPath expressions (see my other mail from today), because they may very well select text nodes, I came to the conclusion that doing this unconditionally (well, except for element nodes obviously) is the least surprising behaviour. The following subsumes the behavior with this and the patch from my other e-mail applied. SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace, (XPATH('/*/@value', x))[1] AS value, (XPATH('/*/text()', x))[1] AS text FROM (VALUES (XMLELEMENT(name "root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value), '<t' ))) v(x); namespace | value | text -----------+-------+------- <n | <v | <t Without the patch from the other mail, the "namespace" result stays the same, but "value" and "text" are "<v" and "<t" respectively. Updated patch is attached best regards, Florian Pflug PS: Btw, while trying this I think I found another problem. If you do SELECT (XPATH( '/*', XMLELEMENT(NAME "root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value)) ))[1]; you get xpath ---------------------------------- <root xmlns="<n" value="<v"/> i.e. the "<" in the namespace URI isn't quoted properly. Trying to cast that value to text and back to xml fails. Funnily enough, if you skip the XPATH() call, things work properly SELECT XMLELEMENT(NAME "root", XMLATTRIBUTES('<n' AS xmlns, '<v' AS value)) gives xmlelement ------------------------------------- <root xmlns="<n" value="<v"/> I'll start a new thread for this issue...
Attachment
On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote: > If people deem this to be a problem, we could instead add a separate > function XPATH_VALUE() that returns VARCHAR, and make people use that > for scalar-value-returning expressions. Why not replicate what contrib/xml2 provides, namely xpath_string() xpath_number() xpath_bool() That way, types are preserved. > However, to avoid confusion, > XPATH() should then be taught to raise an error if used for > scalar-value > returning expressions, instead of silently returning an empty array as > it does now. Sounds reasonable.
On Jun6, 2011, at 14:56 , Peter Eisentraut wrote: > On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote: >> If people deem this to be a problem, we could instead add a separate >> function XPATH_VALUE() that returns VARCHAR, and make people use that >> for scalar-value-returning expressions. > > Why not replicate what contrib/xml2 provides, namely > > xpath_string() > xpath_number() > xpath_bool() > > That way, types are preserved. But then you lose the ability to evaluate user-supplied XPath expressions, because there's no way of telling which of these function to use. Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean if XPATH() implements the more lenient behaviour I proposed, that seems like a bad tradeoff overall. best regards, Florian Pflug
On Jun8, 2011, at 10:14 , Florian Pflug wrote: > On Jun6, 2011, at 14:56 , Peter Eisentraut wrote: >> On tis, 2011-05-31 at 16:19 +0200, Florian Pflug wrote: >>> If people deem this to be a problem, we could instead add a separate >>> function XPATH_VALUE() that returns VARCHAR, and make people use that >>> for scalar-value-returning expressions. >> >> Why not replicate what contrib/xml2 provides, namely >> >> xpath_string() >> xpath_number() >> xpath_bool() >> >> That way, types are preserved. > > But then you lose the ability to evaluate user-supplied > XPath expressions, because there's no way of telling which of these > function to use. > > Since XPATH_BOOL() can be emulated by doing XPATH(...)::text::boolean > if XPATH() implements the more lenient behaviour I proposed, that > seems like a bad tradeoff overall. Patch rebased onto HEAD. best regards, Florian Pflug
Attachment
On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote: > But then you lose the ability to evaluate user-supplied > XPath expressions, because there's no way of telling which of these > function to use. Perhaps having both variants, one type-safe and one not, would work. I don't agree with doing away with type-safety completely for the sake of convenience.
On Jun13, 2011, at 21:24 , Peter Eisentraut wrote: > On ons, 2011-06-08 at 10:14 +0200, Florian Pflug wrote: >> But then you lose the ability to evaluate user-supplied >> XPath expressions, because there's no way of telling which of these >> function to use. > > Perhaps having both variants, one type-safe and one not, would work. I > don't agree with doing away with type-safety completely for the sake of > convenience. In theory, I agree. In practice, however, XPath 1.0 isn't very strongly typed itself. The built-in types are all auto-converted into one another (As if string(), number(), boolean() had been called). Also, only three of the functions defined by XPath 1.0 seem interesting. Here's the break-down The functions returning "string" are string(): Converts arbitrary values to strings local-name(): Name of node-set's firsttop-level w/o namespace prefix namespace-uri(): Namespace of node-set's first top-level name(): Namespace prefix andname of node-set's first top-level node concat() starts-with() contains() substring-before() substring-after() substring()string-length() translate() For all of these function postgres provides corresponding SQL functions, which the exception of local-name() namespace-uri() name() In fact, these three functions are the raison d'être for my patch and this thread. I needed to find the name of a tag returned by an XPath expression, and to my dismay discovered that XPATH('local-name(...)', ...) returns an empty array. The only reason I added support for boolean and numeric XPath expressions at all was for the sake of completeness. Here's the rest of the functions defined by XPath 1.0. I'm convinces that none of them are particularly useful as top-level functions, and therefore believe that adding XPATH_BOOLEAN() and XPATH_NUMBER() is largely overkill. The functions returning "number" are number(): Converts arbitrary values to numbers last() position() count() sum(): Sumover a node-set after implicit conversion of nodes to numbers floor() ceiling() round() operators +, -, *, div, mod The functions returning "boolean" are boolean(): Converts arbitrary to boolean not() true() false() operators or, and, =,!=, <=, <, >=, > best regards, Florian Pflug