Re: [PATCH] Add CANONICAL option to xmlserialize - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: [PATCH] Add CANONICAL option to xmlserialize
Date
Msg-id CAFj8pRDrgOoJzxxOAswGcr7E+JZ-1SOoX+Oy3_RTPV=Jg4YGHw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Add CANONICAL option to xmlserialize  (Jim Jones <jim.jones@uni-muenster.de>)
Responses Re: [PATCH] Add CANONICAL option to xmlserialize
List pgsql-hackers


út 27. 8. 2024 v 13:57 odesílatel Jim Jones <jim.jones@uni-muenster.de> napsal:


On 26.08.24 16:59, Pavel Stehule wrote:
>
> 1. what about behaviour of NO INDENT - the implementation is not too
> old, so it can be changed if we want (I think), and it is better to do
> early than too late

While checking the feasibility of removing indentation with NO INDENT I
may have found a bug in XMLSERIALIZE ... INDENT.
xmlSaveToBuffer seems to ignore elements if there are whitespaces
between them:

SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);
  xmlserialize   
-----------------
 <foo>          +
   <bar>42</bar>+
 </foo>         +
 
(1 row)

SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
INDENT);
        xmlserialize        
----------------------------
 <foo> <bar>42</bar> </foo>+
 
(1 row)

I'll take a look at it.

+1


Regarding removing indentation: yes, it would be possible with libxml2.
The question is if it would be right to do so.
> 2. Are we able to implement SQL/XML syntax with libxml2?
>
> 3. Are we able to implement Oracle syntax with libxml2? And there are
> benefits other than higher possible compatibility?
I guess it would be beneficial if you're migrating from oracle to
postgres - or the other way around. It certainly wouldn't hurt, but so
far I personally had little use for the oracle's extra xmlserialize
features.
>
> 4. Can there be some possible collision (functionality, syntax) with
> CANONICAL?
I couldn't find anything in the SQL/XML spec that might refer to
canonocal xml.
>
> 5. SQL/XML XMLSERIALIZE supports other target types than varchar. I
> can imagine XMLSERIALIZE with CANONICAL to bytea (then we don't need
> to force database encoding). Does it make sense? Are the results
> comparable?
|
As of pg16 bytea is not supported. Currently type| can be |character|,
|character varying|, or |text - also their other flavours like 'name'.

I know, but theoretically, there can be some benefit for CANONICAL if pg supports bytea there. Lot of databases still use non utf8 encoding.

It is a more theoretical question - if pg supports different types there in future  (because SQL/XML or Oracle), then CANONICAL can be used without limit, or CANONICAL can be used just for text? And you are sure, so you can compare text X text, instead xml X xml?

+SELECT xmlserialize(CONTENT doc AS text CANONICAL) = xmlserialize(CONTENT doc AS text CANONICAL WITH COMMENTS) FROM xmltest_serialize;
+ ?column?
+----------
+ t
+ t
+(2 rows)

Maybe I am a little bit confused by these regress tests, because at the end it is not too useful - you compare two identical XML, and WITH COMMENTS and WITHOUT COMMENTS is tested elsewhere. I tried to search for a sense of this test.  Better to use really different documents (columns) instead.

Regards

Pavel
 

|

--
Jim

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: schema variables
Next
From: Heikki Linnakangas
Date:
Subject: Primary and standby setting cross-checks