On 14.03.23 18:40, Tom Lane wrote:
> Jim Jones <jim.jones@uni-muenster.de> writes:
>> [ v22-0001-Add-pretty-printed-XML-output-option.patch ]
> I poked at this for awhile and ran into a problem that I'm not sure
> how to solve: it misbehaves for input with embedded DOCTYPE.
>
> regression=# SELECT xmlserialize(DOCUMENT '<!DOCTYPE a><a/>' as text indent);
> xmlserialize
> --------------
> <!DOCTYPE a>+
> <a></a> +
>
> (1 row)
The issue was the flag XML_SAVE_NO_EMPTY. It was forcing empty elements
to be serialized with start-end tag pairs. Removing it did the trick ...
postgres=# SELECT xmlserialize(DOCUMENT '<!DOCTYPE a><a/>' AS text INDENT);
xmlserialize
--------------
<!DOCTYPE a>+
<a/> +
(1 row)
... but as a side effect empty start-end tags will be now serialized as
empty elements
postgres=# SELECT xmlserialize(CONTENT '<foo><bar></bar></foo>' AS text
INDENT);
xmlserialize
--------------
<foo> +
<bar/> +
</foo>
(1 row)
It seems to be the standard behavior of other xml indent tools
(including Oracle)
> regression=# SELECT xmlserialize(CONTENT '<!DOCTYPE a><a/>' as text indent);
> xmlserialize
> --------------
>
> (1 row)
>
> The bad result for CONTENT is because xml_parse() decides to
> parse_as_document, but xmlserialize_indent has no idea that happened
> and tries to use the content_nodes list anyway. I don't especially
> care for the laissez faire "maybe we'll set *content_nodes and maybe
> we won't" API you adopted for xml_parse, which seems to be contributing
> to the mess. We could pass back more info so that xmlserialize_indent
> knows what really happened.
I added a new (nullable) parameter to the xml_parse function that will
return the actual XmlOptionType used to parse the xml data. Now
xmlserialize_indent knows how the data was really parsed:
postgres=# SELECT xmlserialize(CONTENT '<!DOCTYPE a><a/>' AS text INDENT);
xmlserialize
--------------
<!DOCTYPE a>+
<a/> +
(1 row)
I added test cases for these queries.
v23 attached.
Thanks!
Best, Jim