Re: [HACKERS] possible encoding issues with libxml2 functions - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: [HACKERS] possible encoding issues with libxml2 functions
Date
Msg-id CAFj8pRBx5rqUvssw+TvhTLoqeTxQtL1KV7OaMt1joHHp1pOwqg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] possible encoding issues with libxml2 functions  (Noah Misch <noah@leadboat.com>)
Responses Re: [HACKERS] possible encoding issues with libxml2 functions
List pgsql-hackers
Hi

2017-08-08 2:10 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> 2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> > > 2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > > > Please add a test case.
>
> I am sending test case.
>
> I am afraid so this cannot be part of regress tests due strong dependency
> on locale win1250.

Right.  Based on that, I've distilled this portable test case:

-- Round-trip non-ASCII data through xpath().
DO $$
DECLARE
        xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
        degree_symbol text;
        res xml[];
BEGIN
        -- Per the documentation, xpath() doesn't work on non-ASCII data when
        -- the server encoding is not UTF8.  The EXCEPTION block below,
        -- currently dead code, will be relevant if we remove this limitation.
        IF current_setting('server_encoding') <> 'UTF8' THEN
                RAISE LOG 'skip: encoding % unsupported for xml',
                        current_setting('server_encoding');
                RETURN;
        END IF;

        degree_symbol := convert_from('\xc2b0', 'UTF8');
        res := xpath('text()', (xml_declaration ||
                '<x>' || degree_symbol || '</x>')::xml);
        IF degree_symbol <> res[1]::text THEN
                RAISE 'expected % (%), got % (%)',
                        degree_symbol, convert_to(degree_symbol, 'UTF8'),
                        res[1], convert_to(res[1]::text, 'UTF8');
        END IF;
EXCEPTION
        -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
        WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
        -- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
        WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
END
$$

yes, probably libXML2 try to do check from utf8 encoding to header specified encoding. 

 
> > > > Why not use xml_parse() instead of calling xmlCtxtReadMemory()
> > > > directly?  The answer is probably in the archives, because someone
> > > > understood the problem enough to document "Some XML-related functions
> > > > may not work at all on non-ASCII data when the server encoding is not
> > > > UTF-8. This is known to be an issue for xpath() in particular."
> > >
> > > Probably there are two possible issues
> >
> > Would you research in the archives to confirm?
> >
> > > 1. what I touched - recv function does encoding to database encoding -
> > > but document encoding is not updated.
> >
> > Using xml_parse() would fix that, right?
> >
>
> It should to help, because it cuts a header - but it does little bit more
> work, than we would - it check if xml is doc or not too.

That's true.  xpath() currently works on both DOCUMENT and CONTENT xml values.
If xpath() used xml_parse(), this would stop working:

  SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo []><x>bar</x>'));

> One possible fix - and similar technique is used more times in xml.c code
> .. xmlroot

> +   /* remove header */
> +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);

> -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -

> another possibility is using xml_out_internal - that is used in XMLTABLE
> function - what was initial fix.
>
> xml_out_internal uses parse_xml_decl too, but it is little bit more
> expensive due call print_xml_decl that has not any effect in this case
> (where only encoding is interesting) - but it can has sense, when server
> encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 -
> so now, I am think the correct solution should be using xml_out_internal -
> because it appends a header with server encoding to XML doc

As you may have been implying here, your patch never adds an xml declaration
that bears an encoding.  (That is because it passes targetencoding=0 to
xml_out_internal().)  If we were going to fix xpath() to support non-ASCII
characters in non-UTF8 databases, we would not do that by adding xml
declarations.  We would do our own conversion to UTF8, like xml_parse() does.
In that light, I like this parse_xml_decl() approach better.  Would you update
your patch to use it and to add the test case I give above?


I need to do some recapitulation (my analyses was wrong):

a) all values created by  xml_in iterface are in database encoding - input string is stored without any change. xml_parse is called only due validation.

b) inside xml_parse, the input is converted to UTF8, and document is read by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8" and removed decl section.

So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document, wellformated_xml) the database encoding is not important

c) xml_recv function does validation by xml_parse and translation to database encoding.

Now I don't see a difference between @b and @c - so my hypotheses about necessity to use recv interface was wrong. 

So Looks like libXML2 behave - when we push document with encoding decl, then this library expects document in this encoding

So test case is simple

 -- should to work
select xpath('/enprimeur/vino/id'/, '<enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

 -- should to fail
select xpath('/enprimeur/vino/id'/, '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

I didn't expect this error on recv API, because we do implicit encoding to database encoding - and I expected implicit removing of encoding declaration. The problematic char is ok, the issue is different length

Other functions is working - so I have a expectation so xpath should to work

postgres=# select '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'::xml;
┌────────────────────────────────────────────────────────────────────┐
│                                xml                                 │
╞════════════════════════════════════════════════════════════════════╡
│ <enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur> │
└────────────────────────────────────────────────────────────────────┘
(1 row)

postgres=# select '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>' is document
postgres-# ;
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)
postgres=# select xml_is_well_formed_document('<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌─────────────────────────────┐
│ xml_is_well_formed_document │
╞═════════════════════════════╡
│ t                           │
└─────────────────────────────┘
(1 row)

postgres=# select xml_is_well_formed_content('<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌────────────────────────────┐
│ xml_is_well_formed_content │
╞════════════════════════════╡
│ t                          │
└────────────────────────────┘
(1 row)

d) XMLEXISTS doesn't work too .. because it share code wit xpath function



This change can make things worse in a non-UTF8 database.  The following
happens to work today in a LATIN1 database, but it will cease to work:

SELECT xpath('string-length()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
                '<x>' || chr(176) || '</x>')::xml);

However, extracting actual text already gives wrong answers, because we treat
the UTF8 response from libxml2 as though it were LATIN1:

SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
                '<x>' || chr(176) || '</x>')::xml);

I plan to back-patch this.  The documentation says "Some XML-related functions
may not work at all on non-ASCII data when the server encoding is not
UTF-8. This is known to be an issue for xpath() in particular."  Your patch
fixes a case where even a UTF8 database mishandles non-ASCII data.  (Sometimes
it mishandles the data silently.  Other times, it raises an error.)  It's
worth further breaking a case where we already disclaim support to repair a
supported case.  Other opinions?

Isn't the most correct solution to call xml_parse function?

Regards

Pavel

pgsql-hackers by date:

Previous
From: Douglas Doole
Date:
Subject: Re: [HACKERS] [PATCH] Push limit to sort through a subquery
Next
From: David Steele
Date:
Subject: Re: [HACKERS] Update low-level backup documentation to match actual behavior