Thread: xpath does not seem to escape HTML correctly

xpath does not seem to escape HTML correctly

From

John Lamb

Date:

04 September 2014, 00:09:01

I am parsing XML into a table using xpath and having issues with regards to the HTML encoded characters, it seems that some escaped HTML codes are being converted but others are not.

Prior to injection into XML the raw data is : !"£$%^&*()<>

The raw XML is as follows : <PLAN>!"£$%^&*()<></PLAN>

When I run this through xpath as suggested in the post below I find that some codes are converted and some aren't

select (xpath('/PLAN/text()', ('<PLAN>!"£$%^&*()<></PLAN>')::xml))[1]::text

The result is this : !"£$%^&*()<>

It seems the &, < and > chars are not being handled but quote and GBP symbols are OK, when I run the query in the post below the "magic" and "toaster" works but I still get "s&witch"

http://www.postgresql.org/message-id/jm6hla$bld$1@reversiblemaps.ath.cx

I am using PostgreSQL 9.3 on windows but see this behaivour on 9.1 on linux.

Does anyone have any ideas what the problem is here or do you think it's some kind of bug ?

Re: xpath does not seem to escape HTML correctly

From

David G Johnston

Date:

04 September 2014, 01:50:09

John Lamb wrote
> I am parsing XML into a table using xpath and having issues with regards
> to
> the HTML encoded characters, it seems that some escaped HTML codes are
> being converted but others are not.
>
> Prior to injection into XML the raw data is : !"£$%^&*()<>
>
> The raw XML is as follows :
> <PLAN>
> !"£$%^&*()<>
> </PLAN>
> When I run this through xpath as suggested in the post below I find that
> some codes are converted and some aren't
>
> select (xpath('/PLAN/text()',
> ('
> <PLAN>
> !"£$%^&*()<>
> </PLAN>
> ')::xml))[1]::text
>
> The result is this : !"£$%^&*()<>
>
> It seems the &, < and > chars are not being handled but quote and GBP
> symbols are OK, when I run the query in the post below the "magic" and
> "toaster" works but I still get "s&witch"
>
> http://www.postgresql.org/message-id/jm6hla$bld$

> 1@.ath

>
> I am using PostgreSQL 9.3 on windows but see this behaivour on 9.1 on
> linux.
>
> Does anyone have any ideas what the problem is here or do you think it's
> some kind of bug ?

Discussed at the following.

http://www.postgresql.org/message-id/CAAY5AM1L83y79rtOZAUJioREO6n4=XAFKcGu6qO3hCZE1yJytg@mail.gmail.com

Note: I have not read the referred-to threads in the above thread...they
would be authoritative for reasoning but the following facts seem pertinent.

The behavior is indeed intentional - though it can be argued that this part
of the code could use some new features.

If &, <, and > were to be resolved then the resultant text would not be
valid xml.

Specifically - and the reasons those entities must be encoded in xml - there
is the possibility of those special characters imparting their meaning into
the document and causing it to be malformed: and so it is deemed malformed
by definition instead.

None of the other entities you show have that potential and so can be
directly output in the resulting text - assuming the xml engine recognizes
them, which it does for these.

The root of the issue is that "xpath(...)" has no comprehension whether you
are returning "text" or xml as a result of the matching - and so it picks
xml.  Thus, if you have special knowledge that what you are retrieving
should be "text". you need to send the output to some other function that
will decode the "text" using whatever encoding scheme you know that "text"
to conform to.

The missing feature currently is that there is no native PostgreSQL function
to decode html encoded "text".

If you were to go from xml to some encoding you also would need to pass the
xml result through an encoded to ensure that entities that xml is allowed to
unescape get re-escaped in the new format.

There is probably much more to this should you read the other threads but
that's it from me for now.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/xpath-does-not-seem-to-escape-HTML-correctly-tp5817659p5817667.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.