Hi all,
As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT should accept all valid XML. At this time, XML with a DOCTYPE declaration is not accepted with this setting even though it is considered valid XML. I'd like to work on a patch to address this issue and make it work as advertised.
I traced the source of the error to line ~1500 in /src/backend/utils/adt/xml.c
res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string + count, NULL);
It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE element makes the XML not well-balanced. From:
This function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise
I see xmlParseBalancedChunkMemoryRecover that might provide the functionality needed. That function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise In case recover is set to 1, the nodelist will not be empty even if the parsed chunk is not well balanced, assuming the parsing succeeded to some extent.
I haven't tested yet to see if this parses the data w/ DOCTYPE successfully yet. If it does, I don't think it would be difficult to update the check on res_code to not fail. I'm making another assumption that there is a distinct code from libxml to differentiate from other errors, but I couldn't find those codes quickly. The current check is this:
if (res_code != 0 || xmlerrcxt->err_occurred)
Does this sound reasonable? Have I missed some major aspect? If this is on the right track I can work on creating a patch to move this forward.
Thanks,
Ryan Lambert
RustProof Labs