Re: [PATCH] Add CANONICAL option to xmlserialize - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: [PATCH] Add CANONICAL option to xmlserialize
Date
Msg-id CAFj8pRDN7yEPhKrr85ZWP_udF20R7qs0nvwsZhzQGpVp9fPRUg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Add CANONICAL option to xmlserialize  (Jim Jones <jim.jones@uni-muenster.de>)
Responses Re: [PATCH] Add CANONICAL option to xmlserialize
List pgsql-hackers


po 26. 8. 2024 v 11:32 odesílatel Jim Jones <jim.jones@uni-muenster.de> napsal:
Hi Pavel

On 25.08.24 20:57, Pavel Stehule wrote:
>
> There is unwanted white space in the patch
>
> -<-><--><-->xmlFreeDoc(doc);
> +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
> XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
> + <>{
> +<-><-->xmlChar    *xmlbuf = NULL;
> +<-><-->int         nbytes;
> +<-><-->int    
>
I missed that one. Just removed it, thanks!
> 1. the xml is serialized to UTF8 string every time, but when target
> type is varchar or text, then it should be every time encoded to
> database encoding. Is not possible to hold utf8 string in latin2
> database varchar.
I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
be on the safe side

if(format ==XMLSERIALIZE_CANONICAL ||format
==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
GetDatabaseEncoding(), NULL, NULL, NULL);
... or you mean something else?

Maybe I was confused by the initial message.
 

> 2. The proposed feature can increase some confusion in implementation
> of NO IDENT. I am not an expert on this area, so I checked other
> databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
> clause is very similar to the proposed "CANONICAL". Unfortunately,
> there is different behaviour of NO IDENT - Oracle's really removes
> formatting, Postgres does nothing.

Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
of mine.
It basically "does nothing" and prints the xml as is, e.g.

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
                xmlserialize                
--------------------------------------------
 <foo>                                     +
   <bar>                                   +
     <val z="1" a="8"><![CDATA[0&1]]></val>+
   </bar>                                  +
 </foo>                                    +
 
(1 row)

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
                         xmlserialize                         
--------------------------------------------------------------
 <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
                         xmlserialize                         
--------------------------------------------------------------
 <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)

.. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
sorting attributes and replacing CDATA strings by its value:

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
                     xmlserialize                     
------------------------------------------------------
 <foo><bar><val a="8" z="1">0&amp;1</val></bar></foo>
(1 row)

xmlserialize CANONICAL does not exist in any other database and it's not
part of the SQL/XML standard.

Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
it is not entirely clear to me if SQL/XML states that NO INDENT must
remove the indentation from xml strings.
It says:

"INDENT — the choice of whether to “pretty-print” the serialized XML by
means of indentation, either
True or False.
....
i) If <XML serialize indent> is specified and does not contain NO, then
let IND be True.
ii) Otherwise, let IND be False."

When I wrote the patch I assumed it meant to leave the xml as is .. but
I might be wrong.
Perhaps it would be best if we open a new thread for this topic.

I think so there should be specified the target of CANONICAL - it is a partial replacement of NO INDENT or it produces format  just for comparing? The CANONICAL format is not probably extra standardized, because libxml2 removes indenting, but examples in https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes sense just for local operations.

I like this functionality, and it is great so the functionality from libxml2 can be used, but I think, so the fact that there are  four not compatible implementations of xmlserialize is messy. Can be nice, if  we find some intersection between SQL/XML, Oracle  instead of new proprietary syntax. 

In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?

My objection against CANONICAL so SQL/XML and Oracle allows to parametrize XMLSERIALIZE more precious and before implementing new feature, we should to clean table and say, what we want to have in XMLSERIALIZE.

An alternative of enhancing of XMLSERIALIZE I can imagine just function "to_canonical(xml, without_comments bool default false)". In this case we don't need to solve relations against SQL/XML or Oracle.


Thank you for reviewing this patch. Much appreciated!

Best,

--
Jim

1 - https://www.w3.org/TR/xml-c14n11/
2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Cleaning up threading code
Next
From: shveta malik
Date:
Subject: Re: Conflict detection and logging in logical replication