Re: possible encoding issues with libxml2 functions - Mailing list pgsql-hackers
From | Pavel Stehule |
---|---|
Subject | Re: possible encoding issues with libxml2 functions |
Date | |
Msg-id | CAFj8pRAbR5bjBD6WakseDt265WT7E+G=YXndGbk9CTYZvX9peQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] possible encoding issues with libxml2 functions (Noah Misch <noah@leadboat.com>) |
List | pgsql-hackers |
2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> 2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
> > Please add a test case.
I am sending test case.
I am afraid so this cannot be part of regress tests due strong dependency on locale win1250.
>
> It needs a application - currently there is not possibility to import XML
> document via recv API :(
I think xml_in() can create every value that xml_recv() can create; xml_recv()
is just more convenient given diverse source encodings. If you make your
application store the value into a table, does "pg_dump --inserts" emit code
that reproduces the same value? If so, you can use that in your test case.
If not, please provide precise instructions (code, SQL commands) for
reproducing the bug manually.
I was wrong - both methods produces broken internal XML document
<?xml version="1.0" encoding="windows-1250"?>
<enprimeur>
<vino>
<id>1</id>
<nazev>Alter Ego de Palmer</nazev>
<vyrobce>63</vyrobce>
<rocnik>2012</rocnik>
<cena0375>0</cena0375>
<cena1500>0</cena1500>
<cena3000>0</cena3000>
<cena6000>0</cena6000>
<cena0750>1425</cena0750>
<cenastart>1085</cenastart>
<min0375>0</min0375>
<min0750>0</min0750>
<odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 % Petit Verdot</odrudy>
<bestin>2017 - 2026</bestin>
<klas>2</klas>
<sklad0375>0</sklad0375>
<sklad0750>0</sklad0750>
<sklad1500>0</sklad1500>
<sklad3000>0</sklad3000>
<sklad6000>0</sklad6000>
<alk>13,4 %</alk>
<remark>Premiant oblasti Margaux Ch. Palmer tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de Palmer 2012 se může p
ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno, nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá, pevná a svěží, taniny
vykazují fenomenální strukturu, takto krásné spojení všech aromatických a chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve sklepě.</remark>
<rating>Robert Parker: /100
TOPVINO SCORE: 92-94/100
James Suckling: 92-93/100
Wine Spectator: 90-93/100</rating>
<zalozeno></zalozeno>
<rozloha></rozloha>
<stari></stari>
<puda></puda>
<produkce></produkce>
<zrani></zrani>
<active>1</active>
<stitky>
<stitek>8</stitek>
<stitek>1</stitek>
</stitky>
</vino>
</enprimeur>
Document is well encoded from win1250 to UTF8 - it is well readable, but the header is wrong - it is not in win1250 now (after encoding). This xml is attached (in original form (win1250 encoding)).
Import with
create table target(x xml);
\set xml `cat ~/Downloads/e6.xml`
postgres=# set client_encoding to win1250;
postgres=# insert into target values(:'xml');
disconnect or reset client encoding
Document should be correctly inserted.
Then run a query
postgres=# select * from target, xpath('/enprimeur/vino', x);
ERROR: could not parse XML document
DETAIL: input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
encoder error
line 25: Premature end of data in tag remark line 25
line 25: Premature end of data in tag vino line 3
line 25: Premature end of data in tag enprimeur line 2
select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns id int, nazev text, remark text);
XMLTABLE is not vulnerable against this bug and result should be correct.
when you do
select x from target
you can see a correct xml without header
but select x::text from target; shows wrong header
> > Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
> > The
> > answer is probably in the archives, because someone understood the problem
> > enough to document "Some XML-related functions may not work at all on
> > non-ASCII data when the server encoding is not UTF-8. This is known to be
> > an
> > issue for xpath() in particular."
>
>
> Probably there are two possible issues
Would you research in the archives to confirm?
> 1. what I touched - recv function does encoding to database encoding - but
> document encoding is not updated.
Using xml_parse() would fix that, right?
It should to help, because it cuts a header - but it does little bit more work, than we would - it check if xml is doc or not too.
In mostly functions there the important work is done in parse_xml_decl function
One possible fix - and similar technique is used more times in xml.c code .. xmlroot
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 2f87151..45faf83 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
Datum *ns_names_uris;
bool *ns_names_uris_nulls;
int ns_count;
+ size_t header_len;
/*
* Namespace mappings are passed as text[]. If an empty array is passed
@@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
errmsg("empty XPath expression")));
string = pg_xmlCharStrndup(datastr, len);
+
+ /* remove header */
+ parse_xml_decl(string, &header_len, NULL, NULL, NULL);
+
xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len - header_len, NULL, NULL, 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
another possibility is using xml_out_internal - that is used in XMLTABLE function - what was initial fix.
xml_out_internal uses parse_xml_decl too, but it is little bit more expensive due call print_xml_decl that has not any effect in this case (where only encoding is interesting) - but it can has sense, when server encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 - so now, I am think the correct solution should be using xml_out_internal - because it appends a header with server encoding to XML doc
Regards
Pavel
> 2. there are not possibility to encode from document encoding to database
> encoding.
Both xml_in() and xml_recv() require the value to be representable in the
database encoding, so I don't think this particular problem can remain by the
time we reach an xpath_internal() call.
Attachment
pgsql-hackers by date: