Re: Another issue with invalid XML values - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Another issue with invalid XML values
Date
Msg-id 20110620204548.GC17037@tornado.leadboat.com
Whole thread Raw
In response to Re: Another issue with invalid XML values  (Florian Pflug <fgp@phlo.org>)
Responses Re: Another issue with invalid XML values
List pgsql-hackers
On Mon, Jun 20, 2011 at 09:15:51PM +0200, Florian Pflug wrote:
> On Jun20, 2011, at 19:57 , Noah Misch wrote:
> > I tested this patch using two systems, a CentOS 5 box with
> > libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2
> > 2.6.31.dfsg-2ubuntu1.6.  Both failed to build with this error:
> > 
> > xml.c: In function `pg_xml_init':
> > xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function)
> > xml.c:934: error: (Each undeclared identifier is reported only once
> > xml.c:934: error: for each function it appears in.)
> > make[4]: *** [xml.o] Error 1
> > 
> > It seems `xmlStructuredErrorContext' was added rather recently.  It's not
> > obvious which version added it, but 2.7.8 has it and 2.7.2 does not.  Looking
> > at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for
> > both structured and generic handler functions.  Based on that, I tried with a
> > change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our
> > xml.c.  I ran the test suite and hit some failures in the xml test; see
> > attached regression.diffs.  I received warnings where the tests expected
> > nothing or expected errors. 
> 
> Great :-(. I wonder if maybe I should simply remove the part of the patch
> which try to restore the error handler and context in pg_xml_done(). I've
> added that because I feared that some 3rd-party libraries setup their libxml
> error handle just once, not before every library class. The code previously
> didn't care about this either, but since we're using structured instead of
> generic error reporting now, we'd now collide with 3-rd party libs which
> also use structured error handling, whereas previously that would have worked.
> OTOH, once there's more than one such library...
> 
> Whats your stand on this?

Assuming it's sufficient to save/restore xmlStructuredErrorContext when it's
available and xmlGenericErrorContext otherwise, I would just add an Autoconf
check to identify which one to use.

> > I couldn't reconcile this description with the code.  I do see that
> > xpath_internal() uses XML_PURPOSE_OTHER, but so does xmlelement().  Why is
> > xmlelement() also ready for the strictest error reporting?
> 
> xmlelement() doesn't use libxml's parser, only the xml writer. I didn't
> want to suppress any errors the writer might raise (I'm not sure it raises
> error at all, though).

Makes sense.

> > Also note that we may be able to be more strict when xmloption = document.
> 
> Yeah, probably. I think that better done in a separate patch, though - I've
> tried not to change existing behaviour with this patch except where absolutely
> necessary (i.e. for XPATH)

Likewise.

> > Ideally, an invalid namespace URI should always be an error.  While namespace
> > prefixes could become valid as you assemble a larger document, a namespace
> > URI's validity is context-free.
> 
> I agree. That, however, would require xmlelement() to check all xmlns*
> attributes, and I didn't find a way to do that with libxml() (Well,
> other than to parse the element after creating it, but that's a huge
> kludge). So I left that issue for a later time...

Likewise.

> > Remembering to put a pg_xml_done() in every relevant elog(ERROR, ...) path
> > seems fragile.  How about using RegisterXactCallback/RegisterSubXactCallback
> > in pgxml_parser_init() to handle those cases?  You can also use it to Assert
> > at non-abort transaction shutdown that pg_xml_done() was called as needed.
> 
> Hm, isn't that a bit fragile too? It seems that PG_TRY/PG_CATCH blocks don't
> necessarily create sub-transactions, do they?

PG_TRY never creates a subtransaction.  However, elog(ERROR, ...) aborts
either the subtransaction, or if none, the top-level transaction.  Therefore,
registering the same callback with both RegisterXactCallback and
RegisterSubXactCallback ensures that it gets called for any elog/ereport
non-local exit.  Then you only explicitly call pg_xml_done() in the non-error
path.

However, I see now that in the core xml.c, every error-condition pg_xml_done()
appears in a PG_CATCH block.  That's plenty reliable ...

> Also, most elog() paths already contained xmlFree() calls, because libxml
> seemingly cannot be made to use context-based memory management. So one
> already needs to be pretty careful when touching these...

... and this is a good reason to keep doing it that way.  So, scratch that idea.

> > Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR,
> > emphasizing that it wraps a conditional.
> 
> Hm, I think it was named XML_CHECK_ERROR at one point, and I changed it
> because I felt it didn't convey the fact that it's actually ereport()
> and not just return false on error. But I agree that XML_REPORT_ERROR isn't
> very self-explanatory, either. What about XML_REPORT_PENDING_ERROR()
> or XML_CHECK_AND_REPORT_ERROR()?

XML_CHECK_AND_REPORT_ERROR() seems fine.  Or XML_CHECK_AND_EREPORT()?

> > On Thu, Jun 09, 2011 at 06:11:46PM +0200, Florian Pflug wrote:
> >> *************** xml_parse(text *data, XmlOptionType xmlo
> >> *** 1175,1182 ****
> >>      int32        len;
> >>      xmlChar    *string;
> >>      xmlChar    *utf8string;
> >> !     xmlParserCtxtPtr ctxt;
> >> !     xmlDocPtr    doc;
> >> 
> >>      len = VARSIZE(data) - VARHDRSZ;        /* will be useful later */
> >>      string = xml_text2xmlChar(data);
> >> --- 1244,1251 ----
> >>      int32        len;
> >>      xmlChar    *string;
> >>      xmlChar    *utf8string;
> >> !     xmlParserCtxtPtr ctxt = NULL;
> >> !     xmlDocPtr    doc = NULL;
> >> 
> >>      len = VARSIZE(data) - VARHDRSZ;        /* will be useful later */
> >>      string = xml_text2xmlChar(data);
> > 
> > Is this change needed?
> 
> For "doc" it definitely is. Since we can no report an error even if
> the parser returned a valid doc, there's now a "if (doc) xmlFree(doc)"
> in the PG_CATCH patch.
> 
> For "ctxt", strictly, speaking no. But only because "ctxt = xmlNewParserCtxt()"
> is the first line in the PG_TRY block, and therefore made the "xmlFree(ctxt)"
> unconditional. But doing it this way seems more consistent and less error-prone.

I see it now; thanks.

> > Is there something about this patch's other changes that make it necessary to
> > strip multiple newlines, vs. one newline as we did before, and to do it in a
> > different place in the code?
> 
> There previously wasn't a separate function for that. I made it strip off
> multiple newlines because that allowed me to keep appendStringInfoLineSeparator()
> simple while still making sure that the end result is exactly one newline
> at the end of the string. Previously, the coded depended on libxml always
> adding a newline at the end of messages, which I'm not even sure it true
> for messages reported via the structured error handler.

Fair enough.

> > It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually
> > use.  How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2
> > functions that can't error?  Also, I think it would be better to require a
> > call to pg_xml_done() in all cases, just to keep things simple.  Actually, I
> > wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof.  It
> > doesn't save much.
> 
> Yeah, that name really is a perfect example of horrible naming ;-)
> 
> The pg_xml_done()-not-required semantics are necessary though. The problem is
> that parse_xml_decl() is sometimes called after the caller has already called
> pg_xml_init() himself. Not always though, so one cannot simply remove the
> pg_xml_init() call from parse_xml_decl() - it might then use libxml without
> it being initialized. So I made pg_xml_init() skip the error-handler setup
> when purpose == XML_PURPOSE_NONE. This requires then, however, that pg_xml_done()
> is *not* called, because it'd restore settings that were never saved (well,
> actually it'd tripe an assert that protects against that...).
> 
> But yeah, this is all quite contorted. Maybe pg_xml_init() should simply be
> split into two functions, one which initialized the library and one which
> sets up error handling. That means that two calls instead of one are required,
> but it makes the purpose of the individual functions much clearer...

Oh, tricky.  That's an option, and the error-handler-initialization function
could just call the more-primitive one, so most callers would not need to
know.  Another way is to make xml_error_initialized a counter.  If it's called
a second time, increment and do nothing more.  Only unregister the handler
when it's decremented to zero.  No strong preference here.

> >> !     XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */,
> > 
> > It's fine to set the flag for legacy users, considering they could just not
> > read it.  The important distinction seems to be that we don't emit warnings or
> > notices in this case.  Is that correct?  If so, the comment should reflect
> > that emphasis.  Then, consider updating the flag unconditionally.
> 
> I took me a while to remember while I did it that way, but I think I have now.
> 
> I initialled wanted to add an Assert(!xml_error_occurred) to catch any
> missing XML_REPORT_ERROR() calls. I seems to have forgotten to do that,
> though...
> 
> So I guess I should either refrain from setting the flag as you suggested,
> or add such an Assert(). I think I very slightly prefer the latter, what
> do you think?

I do like the idea of that assert.  How about setting the flag anyway, but
making it "Assert(xml_purpose == XML_PURPOSE_LEGACY || !xml_error_occurred)"?

> >> *** a/src/test/regress/expected/xml.out
> >> --- b/src/test/regress/expected/xml.out
> >> *************** INSERT INTO xmltest VALUES (3, '<wrong')
> >> *** 8,14 ****
> >>  ERROR:  invalid XML content
> >>  LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> >>                                         ^
> >> ! DETAIL:  Entity: line 1: parser error : Couldn't find end of Start Tag wrong line 1
> >>  <wrong
> >>        ^
> >>  SELECT * FROM xmltest;
> >> --- 8,14 ----
> >>  ERROR:  invalid XML content
> >>  LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> >>                                         ^
> >> ! DETAIL:  line 1: Couldn't find end of Start Tag wrong line 1
> > 
> > Any reason to change the error text this way?
> 
> The "Entity:" prefix is added by libxml only for messages reported
> to the generic error handler. It *doesn't* refer to entities as defined
> in xml, but instead used in place of the file  name if libxml
> doesn't have that at hand (because it's parsing from memory).
> 
> So that "Entity:" prefix really doesn't have any meaning whatsoever.

So xmlParserPrintFileContext() sends different content to the generic error
handler from what "natural" calls to the handler would send?

> I believe that the compatibility risk is pretty low here, and removing
> the meaningless prefix makes the error message are bit nicer to read.
> But if you are concerned that there maybe applications out there who
> parse the error text, we could of course add it back. I must say that
> I don't really know what our policy regarding error message stability is...

If the libxml2 API makes it a major pain to preserve exact messages, I
wouldn't worry.

Thanks,
nm


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Fixed string in German translation that causes segfault.
Next
From: Jeroen Vermeulen
Date:
Subject: Re: [BUG] Denormal float values break backup/restore