Fix XML handling with DOCTYPE - Mailing list pgsql-hackers

From Ryan Lambert
Subject Fix XML handling with DOCTYPE
Date
Msg-id CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
Whole thread Raw
Responses Re: Fix XML handling with DOCTYPE
Re: Fix XML handling with DOCTYPE
List pgsql-hackers
Hi all,


As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT should accept all valid XML.  At this time, XML with a DOCTYPE declaration is not accepted with this setting even though it is considered valid XML.  I'd like to work on a patch to address this issue and make it work as advertised.

I traced the source of the error to line ~1500 in /src/backend/utils/adt/xml.c

res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string + count, NULL);

It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE element makes the XML not well-balanced.  From:


This function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise

I see xmlParseBalancedChunkMemoryRecover that might provide the functionality needed. That function returns:

0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise In case recover is set to 1, the nodelist will not be empty even if the parsed chunk is not well balanced, assuming the parsing succeeded to some extent.

I haven't tested yet to see if this parses the data w/ DOCTYPE successfully yet.  If it does, I don't think it would be difficult to update the check on res_code to not fail.  I'm making another assumption that there is a distinct code from libxml to differentiate from other errors, but I couldn't find those codes quickly.  The current check is this:

if (res_code != 0 || xmlerrcxt->err_occurred)

Does this sound reasonable?  Have I missed some major aspect?  If this is on the right track I can work on creating a patch to move this forward.

Thanks,

Ryan Lambert
RustProof Labs

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Next
From: Chapman Flack
Date:
Subject: Re: Fix XML handling with DOCTYPE