[PATCH] Add CANONICAL option to xmlserialize - Mailing list pgsql-hackers

From Jim Jones
Subject [PATCH] Add CANONICAL option to xmlserialize
Date
Msg-id 2f04079e-b237-1a2f-0682-9f60ca59805e@uni-muenster.de
Whole thread Raw
In response to [PoC] Add CANONICAL option to xmlserialize  (Jim Jones <jim.jones@uni-muenster.de>)
Responses Re: [PATCH] Add CANONICAL option to xmlserialize  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On 27.02.23 14:16, I wrote:
> Hi,
>
> In order to compare pairs of XML documents for equivalence it is 
> necessary to convert them first to their canonical form, as described 
> at W3C Canonical XML 1.1.[1] This spec basically defines a standard 
> physical representation of xml documents that have more then one 
> possible representation, so that it is possible to compare them, e.g. 
> forcing UTF-8 encoding, entity reference replacement, attributes 
> normalization, etc.
>
> Although it is not part of the XML/SQL standard, it would be nice to 
> have the option CANONICAL in xmlserialize. Additionally, we could also 
> add the attribute WITH [NO] COMMENTS to keep or remove xml comments 
> from the documents.
>
> Something like this:
>
> WITH t(col) AS (
>  VALUES
>   ('<?xml version="1.0" encoding="ISO-8859-1"?>
>   <!DOCTYPE doc SYSTEM "doc.dtd" [
>   <!ENTITY val "42">
>   <!ATTLIST xyz attr CDATA "default">
>   ]>
>
>   <!-- ordering of attributes -->
>   <foo ns:c = "3" ns:b = "2" ns:a = "1"
>     xmlns:ns="http://postgresql.org">
>
>     <!-- Normalization of whitespace in start and end tags -->
>     <!-- Elimination of superfluous namespace declarations,
>          as already declared in <foo> -->
>  <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >
>
>     <!-- Empty element conversion to start-end tag pair -->
>     <empty/>
>
>     <!-- Effect of transcoding from a sample encoding to UTF-8 -->
>     <iso8859>©</iso8859>
>
>     <!-- Addition of default attribute -->
>     <!-- Whitespace inside tag preserved -->
>     <xyz> 321 </xyz>
>   </foo>
>   <!-- comment outside doc -->'::xml)
> )
> SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
> xmlserialize
>
--------------------------------------------------------------------------------------------------------------------------------------------------------

>
>  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" 
> ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz 
> attr="default"> 321 </xyz></foo>
> (1 row)
>
> -- using WITH COMMENTS
>
> WITH t(col) AS (
>  VALUES
>   (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
>     xmlns:ns="http://postgresql.org">
>     <!-- very important comment -->
>     <xyz> 321 </xyz>
>   </foo>'::xml)
> )
> SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
> xmlserialize
>
------------------------------------------------------------------------------------------------------------------------

>
>  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- 
> very important comment --><xyz> 321 </xyz></foo>
> (1 row)
>
>
> Another option would be to simply create a new function, e.g. 
> xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this 
> would be the right approach.
>
> Attached a very short draft. What do you think?
>
> Best, Jim
>
> 1- https://www.w3.org/TR/xml-c14n11/

The attached version includes documentation and tests to the patch.

I hope things are clearer now :)

Best, Jim

Attachment

pgsql-hackers by date:

Previous
From: Ankit Kumar Pandey
Date:
Subject: Re: [Question] Similar Cost but variable execution time in sort
Next
From: Melanie Plageman
Date:
Subject: Re: Should vacuum process config file reload more often