Re: possible encoding issues with libxml2 functions - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: possible encoding issues with libxml2 functions
Date
Msg-id CAFj8pRAbR5bjBD6WakseDt265WT7E+G=YXndGbk9CTYZvX9peQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] possible encoding issues with libxml2 functions  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers


2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> 2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > Please add a test case.

I am sending test case.

I am afraid so this cannot be part of regress tests due strong dependency on locale win1250.
 
>
> It needs a application - currently there is not possibility to import XML
> document via recv API :(

I think xml_in() can create every value that xml_recv() can create; xml_recv()
is just more convenient given diverse source encodings.  If you make your
application store the value into a table, does "pg_dump --inserts" emit code
that reproduces the same value?  If so, you can use that in your test case.
If not, please provide precise instructions (code, SQL commands) for
reproducing the bug manually.

I was wrong - both methods produces broken internal XML document

        <?xml version="1.0" encoding="windows-1250"?>
        <enprimeur>
                    <vino>
                        <id>1</id>
                        <nazev>Alter Ego de Palmer</nazev>
                        <vyrobce>63</vyrobce>
                        <rocnik>2012</rocnik>
                        <cena0375>0</cena0375>
                        <cena1500>0</cena1500>
                        <cena3000>0</cena3000>
                        <cena6000>0</cena6000>
                        <cena0750>1425</cena0750>
                        <cenastart>1085</cenastart>
                        <min0375>0</min0375>
                        <min0750>0</min0750>
                        <odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 % Petit Verdot</odrudy>
                        <bestin>2017 - 2026</bestin>
                        <klas>2</klas>
                        <sklad0375>0</sklad0375>
                        <sklad0750>0</sklad0750>
                        <sklad1500>0</sklad1500>
                        <sklad3000>0</sklad3000>
                        <sklad6000>0</sklad6000>
                        <alk>13,4 %</alk>
                        <remark>Premiant oblasti Margaux Ch. Palmer tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de Palmer 2012 se může p
ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno, nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá, pevná a svěží, taniny
 vykazují fenomenální strukturu, takto krásné spojení všech aromatických a chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve sklepě.</remark>
                        <rating>Robert Parker: /100
        TOPVINO SCORE: 92-94/100
        James Suckling: 92-93/100
        Wine Spectator: 90-93/100</rating>
                        <zalozeno></zalozeno>
                        <rozloha></rozloha>
                        <stari></stari>
                        <puda></puda>
                        <produkce></produkce>
                        <zrani></zrani>
                        <active>1</active>
                        <stitky>
                                <stitek>8</stitek>
                                <stitek>1</stitek>
                                </stitky>
                        </vino>
        </enprimeur>

Document is well encoded from win1250 to UTF8 - it is well readable, but the header is wrong - it is not in win1250 now (after encoding). This xml is attached (in original form (win1250 encoding)).

Import with

create table target(x xml);
\set xml `cat ~/Downloads/e6.xml`
postgres=# set client_encoding to win1250;
postgres=# insert into target values(:'xml');

disconnect or reset client encoding

Document should be correctly inserted. 

Then run a query

postgres=# select * from target, xpath('/enprimeur/vino', x);
ERROR:  could not parse XML document
DETAIL:  input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
encoder error
line 25: Premature end of data in tag remark line 25
line 25: Premature end of data in tag vino line 3
line 25: Premature end of data in tag enprimeur line 2


select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns id int, nazev text, remark text);

XMLTABLE is not vulnerable against this bug and result should be correct.

when you do 

select x from target

you can see a correct xml without header

but select x::text from target; shows wrong header

 

> > Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
> > The
> > answer is probably in the archives, because someone understood the problem
> > enough to document "Some XML-related functions may not work at all on
> > non-ASCII data when the server encoding is not UTF-8. This is known to be
> > an
> > issue for xpath() in particular."
>
>
> Probably there are two possible issues

Would you research in the archives to confirm?

> 1. what I touched - recv function does encoding to database encoding - but
> document encoding is not updated.

Using xml_parse() would fix that, right?

It should to help, because it cuts a header - but it does little bit more work, than we would - it check if xml is doc or not too. 

In mostly functions there the important work is done in parse_xml_decl function

One possible fix - and similar technique is used more times in xml.c code .. xmlroot

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 2f87151..45faf83 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
    Datum      *ns_names_uris;
    bool       *ns_names_uris_nulls;
    int         ns_count;
+   size_t      header_len;
 
    /*
     * Namespace mappings are passed as text[].  If an empty array is passed
@@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
                 errmsg("empty XPath expression")));
 
    string = pg_xmlCharStrndup(datastr, len);
+
+   /* remove header */
+   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
+
    xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
 
    xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
        if (ctxt == NULL || xmlerrcxt->err_occurred)
            xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
                        "could not allocate parser context");
-       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len - header_len, NULL, NULL, 0);
        if (doc == NULL || xmlerrcxt->err_occurred)
            xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
                        "could not parse XML document");




another possibility is using xml_out_internal - that is used in XMLTABLE function - what was initial fix.

xml_out_internal uses parse_xml_decl too, but it is little bit more expensive due call print_xml_decl that has not any effect in this case (where only encoding is interesting) - but it can has sense, when server encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 - so now, I am think the correct solution should be using xml_out_internal - because it appends a header with server encoding to XML doc

Regards

Pavel

 

> 2. there are not possibility to encode from document encoding to database
> encoding.

Both xml_in() and xml_recv() require the value to be representable in the
database encoding, so I don't think this particular problem can remain by the
time we reach an xpath_internal() call.

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_stat_wal_write statistics view
Next
From: Tom Lane
Date:
Subject: Re: PoC plpgsql - possibility to force custom or generic plan