Re: [HACKERS] possible encoding issues with libxml2 functions - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [HACKERS] possible encoding issues with libxml2 functions
Date
Msg-id 20170820193402.GA4047044@rfd.leadboat.com
Whole thread Raw
In response to Re: [HACKERS] possible encoding issues with libxml2 functions  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: [HACKERS] possible encoding issues with libxml2 functions
Re: [HACKERS] possible encoding issues with libxml2 functions
List pgsql-hackers
On Sun, Aug 20, 2017 at 10:54:57AM +0200, Pavel Stehule wrote:
> 2017-08-20 9:21 GMT+02:00 Noah Misch <noah@leadboat.com>:
> > On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
> > > On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > > > One possible fix - and similar technique is used more times in xml.c code
> > > > .. xmlroot
> > >
> > > > +   /* remove header */
> > > > +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
> > >
> > > > -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> > > > +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
> >
> > > I like this parse_xml_decl() approach better.  Would you update
> > > your patch to use it and to add the test case I give above?
> >
> > Again, would you make those two edits?
> 
> Now, I am not sure so it is correct fix. We will fix case, when server is
> in UTF8, but maybe we will break some cases when server is not in UTF8
> (although it is broken already).

That's right.

> I am think so correct solution is encoding to UTF8 and passing encoding
> parameter. It will works safely in all cases - and we don't need cut
> header. We should not to cut header if server encoding is not in UTF8 and
> we don't pass encoding as parameter. When we pass encoding as parameter,
> then cutting header is not necessary.

xpath-bugfix.patch affected only xml values containing an xml declaration with
"encoding" attribute.  In UTF8 databases, this latest proposal
(xpath-parsing-error-fix.patch) is equivalent to xpath-bugfix.patch.  In
non-UTF8 databases, xpath-parsing-error-fix.patch affects all xml values
containing non-ASCII data.  In a LATIN1 database, the following works today
but breaks under your latest proposal:
 SELECT xpath('text()', ('<x>' || convert_from('\xc2b0', 'LATIN1') || '</x>')::xml);

It's acceptable to break that, since the documentation explicitly disclaims
support for it.  xpath-bugfix.patch breaks different use cases, which are
likewise acceptable to break.  See my 2017-08-08 review for details.

We have xpath-bugfix.patch and xpath-parsing-error-fix.patch.  Both are
equivalent under supported use cases (xpath in UTF8 databases).  Among
non-supported use cases, they each make different things better and different
things worse.  We should prefer to back-patch the version harming fewer
applications.  I expect non-ASCII data is more common than xml declarations
with "encoding" attribute, so xpath-bugfix.patch will harm fewer applications.

Having said that, I now see a third option.  Condition this thread's patch's
effects on GetDatabaseEncoding()==PG_UTF8.  That way, we fix supported cases,
and we remain bug-compatible in unsupported cases.  I think that's better than
the other options discussed so far.  If you agree, please send a patch based
on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change and the
two edits I described earlier.

Thanks,
nm



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] POC: Sharing record typmods between backends
Next
From: Markus Sintonen
Date:
Subject: Fwd: [HACKERS] [PATCH] Pattern based listeners for asynchronousmessaging (LISTEN/NOTIFY)