Re: [PATCH] Add CANONICAL option to xmlserialize - Mailing list pgsql-hackers
From | Jim Jones |
---|---|
Subject | Re: [PATCH] Add CANONICAL option to xmlserialize |
Date | |
Msg-id | 8e904363-586c-4de9-9763-f9d8362f8306@uni-muenster.de Whole thread Raw |
In response to | Re: [PATCH] Add CANONICAL option to xmlserialize (Pavel Stehule <pavel.stehule@gmail.com>) |
Responses |
Re: [PATCH] Add CANONICAL option to xmlserialize
|
List | pgsql-hackers |
On 26.08.24 12:30, Pavel Stehule wrote: > I think so there should be specified the target of CANONICAL - it is a > partial replacement of NO INDENT or it produces format just for > comparing? The CANONICAL format is not probably extra standardized, > because libxml2 removes indenting, but examples in > https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes > sense just for local operations. My idea with CANONICAL was not to replace NO INDENT. The intent was to format xml strings in an standardized way, so that they can be compared. For instance, removing comments, sorting attributes, converting CDATA strings, converting empty elements to start-end tag pairs, removing white spaces between elements, etc ... The W3C recommendation for Canonical XML[1] dictates the following regarding the removal of whitespaces between elements : * Whitespace outside of the document element and within start and end tags is normalized * All whitespace in character content is retained (excluding characters removed during line feed normalization) > > I like this functionality, and it is great so the functionality from > libxml2 can be used, but I think, so the fact that there are four not > compatible implementations of xmlserialize is messy. Can be nice, if > we find some intersection between SQL/XML, Oracle instead of new > proprietary syntax. > > In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ? No. XMLSERIALIZE ... NO INDENT is supposed, as the name suggests, to serialize an xml string without indenting it. One could argue that not indenting can be translated as removing indentation, but I couldn't find anything concrete about this in the SQL/XML spec. If it's indeed the case, we should correct XMLSERIALIZE .. NO INDENT, but it is unrelated to this patch. CANONICAL serializes a physical representation of an xml document. In a nutshell, XMLSERIALIZE ... CANONICAL sort of "rewrites" the xml string with the following rules (list from the W3C recommendation): * The document is encoded in UTF-8 * Line breaks normalized to #xA on input, before parsing * Attribute values are normalized, as if by a validating processor * Character and parsed entity references are replaced * CDATA sections are replaced with their character content * The XML declaration and document type declaration are removed * Empty elements are converted to start-end tag pairs * Whitespace outside of the document element and within start and end tags is normalized * All whitespace in character content is retained (excluding characters removed during line feed normalization) * Attribute value delimiters are set to quotation marks (double quotes) * Special characters in attribute values and character content are replaced by character references * Superfluous namespace declarations are removed from each element * Default attributes are added to each element * Fixup of xml:base attributes [C14N-Issues] is performed * Lexicographic order is imposed on the namespace declarations and attributes of each element btw: Oracle's SIZE =, HIDE DEFAULTS, and SHOW DEFAULTS are not part of the SQL/XML standard either :) > My objection against CANONICAL so SQL/XML and Oracle allows to > parametrize XMLSERIALIZE more precious and before implementing new > feature, we should to clean table and say, what we want to have in > XMLSERIALIZE. > > An alternative of enhancing of XMLSERIALIZE I can imagine just > function "to_canonical(xml, without_comments bool default false)". In > this case we don't need to solve relations against SQL/XML or Oracle. To create a separated serialization function would be IMHO way less elegant than to parametrize XMLSERIALIZE, but it would be something I could live with in case we decide to go down this path. Thanks! -- Jim 1 - https://www.w3.org/TR/xml-c14n11/
pgsql-hackers by date: