Thread: xpath does not seem to escape HTML correctly
I am parsing XML into a table using xpath and having issues with regards to the HTML encoded characters, it seems that some escaped HTML codes are being converted but others are not.
Prior to injection into XML the raw data is : !"£$%^&*()<>
The raw XML is as follows : <PLAN>!"£$%^&*()<></PLAN>
When I run this through xpath as suggested in the post below I find that some codes are converted and some aren't
select (xpath('/PLAN/text()', ('<PLAN>!"£$%^&*()<></PLAN>')::xml))[1]::text
The result is this : !"£$%^&*()<>
It seems the &, < and > chars are not being handled but quote and GBP symbols are OK, when I run the query in the post below the "magic" and "toaster" works but I still get "s&witch"
I am using PostgreSQL 9.3 on windows but see this behaivour on 9.1 on linux.
Does anyone have any ideas what the problem is here or do you think it's some kind of bug ?
John Lamb wrote > I am parsing XML into a table using xpath and having issues with regards > to > the HTML encoded characters, it seems that some escaped HTML codes are > being converted but others are not. > > Prior to injection into XML the raw data is : !"£$%^&*()<> > > The raw XML is as follows : > <PLAN> > !"£$%^&*()<> > </PLAN> > When I run this through xpath as suggested in the post below I find that > some codes are converted and some aren't > > select (xpath('/PLAN/text()', > (' > <PLAN> > !"£$%^&*()<> > </PLAN> > ')::xml))[1]::text > > The result is this : !"£$%^&*()<> > > It seems the &, < and > chars are not being handled but quote and GBP > symbols are OK, when I run the query in the post below the "magic" and > "toaster" works but I still get "s&witch" > > http://www.postgresql.org/message-id/jm6hla$bld$ > 1@.ath > > I am using PostgreSQL 9.3 on windows but see this behaivour on 9.1 on > linux. > > Does anyone have any ideas what the problem is here or do you think it's > some kind of bug ? Discussed at the following. http://www.postgresql.org/message-id/CAAY5AM1L83y79rtOZAUJioREO6n4=XAFKcGu6qO3hCZE1yJytg@mail.gmail.com Note: I have not read the referred-to threads in the above thread...they would be authoritative for reasoning but the following facts seem pertinent. The behavior is indeed intentional - though it can be argued that this part of the code could use some new features. If &, <, and > were to be resolved then the resultant text would not be valid xml. Specifically - and the reasons those entities must be encoded in xml - there is the possibility of those special characters imparting their meaning into the document and causing it to be malformed: and so it is deemed malformed by definition instead. None of the other entities you show have that potential and so can be directly output in the resulting text - assuming the xml engine recognizes them, which it does for these. The root of the issue is that "xpath(...)" has no comprehension whether you are returning "text" or xml as a result of the matching - and so it picks xml. Thus, if you have special knowledge that what you are retrieving should be "text". you need to send the output to some other function that will decode the "text" using whatever encoding scheme you know that "text" to conform to. The missing feature currently is that there is no native PostgreSQL function to decode html encoded "text". If you were to go from xml to some encoding you also would need to pass the xml result through an encoded to ensure that entities that xml is allowed to unescape get re-escaped in the new format. There is probably much more to this should you read the other threads but that's it from me for now. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/xpath-does-not-seem-to-escape-HTML-correctly-tp5817659p5817667.html Sent from the PostgreSQL - sql mailing list archive at Nabble.com.