Thread: XPATH vs. server_encoding != UTF-8

XPATH vs. server_encoding != UTF-8

From

Florian Pflug

Date:

23 July 2011, 15:49:48

Hi

The current thread about JSON and the ensuing discussion about the
XML types' behaviour in non-UTF8 databases made me try out how well
XPATH() copes with that situation. The code, at least, looks
suspicious - XPATH neither verifies that the server encoding is UTF-8,
not does it pass the server encoding on to libxml's xpath functions.

So I created a database with encoding ISO-8859-1 (LATIN1), and did
(which aclient encoding matching my terminal's settings)

  CREATE TABLE X (d XML);
  INSERT INTO X VALUES ('<r a="ä"/>');

i.e, I inserted the XML document <r a="ä"/>, but without using
an entity reference for the german Umlaut-A. Then I attempted to extract
the length of r's attribute "a" with the XPATH /r/@a, both with the XPath
function string-length (which works now! yay!) and with postgres'
LENGTH() function.

  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

The XPATH() function itself doesn't complain, but libxml does - it expects
UTF-8 encoded data, and screams bloody murder when it encounters the
ISO-8859-1-encoded Umlaut-A

  ERROR:  could not parse XML document
  DETAIL:  line 1: Input is not proper UTF-8, indicate encoding !
  Bytes: 0xE4 0x22 0x2F 0x3E
  <r a="ä"/>

That might seem fine on the surface - we did, after all, error out instead
of producing potentially non-sensical results. However, libxml's ability to
detect this error relies on it's ability to distinguish between UTF-8 and
non-UTF-8 encoded strings. Which, of course, doesn't work in the general case.

So for my next try, I deliberately set client_encoding to ISO-8859-1, even
though my terminal uses UTF-8, removed all data from table X, and did

  INSERT INTO X VALUES ('<r a="ä"/>');

again. The effect is that is that X now contains ISO-8859-1 encoded data
which *happens* to look like valid UTF-8. After changing the client_encoding
back to UTF-8, the value we just inserted looks like that

  <r a="Ã¤"/>

Now I invoked the XPATH query from above again.

  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

As predicted, it doesn't raise an error this time, since libxml is unable
to distinguish the ISO-8859-1 string '<r a="Ã¤"/' from valid UTF-8. But the
result is still wrongs, since the string-length() function counts 'Ã¤' as just
one character, when it reality it are of course contains two.

 xpath_length | pg_length
--------------+-----------
 1            |         2

The easiest way to fix this would be to make XPATH() flat-out refuse to
do anything if the server encoding isn't UTF-8. But that seems a bit harsh -
things actually do work correctly as long as the XML document contains only
ASCII characters, and existing applications might depend on that.

So what I think we should do is tell libxml that the encoding is ASCII
if the server encoding isn't UTF-8. With that change, the query above
produces

  ERROR:  could not parse XML document
  DETAIL:  encoder error

which seems sane. Replacing the data in X with ASCII-only data makes the
error go away, and the result is then correct also.

  DELETE FROM X;
  INSERT INTO X VALUES ('<r a="a"/>');
  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

gives

 xpath_length | pg_length
--------------+-----------
 1            |         1

Proof-of-concept patch attached, but doesn't yet include documentation
updates.

Comments? Thoughts? Suggestions?

best regards,
Florian Pflug

Attachment

xpath_nonutf8.patch

Re: XPATH vs. server_encoding != UTF-8

From

Florian Pflug

Date:

23 July 2011, 16:52:46

[Resent with pgsql-hackers re-added to the recipient list.
I presume you didn't remove it on purpose]

On Jul23, 2011, at 18:11 , Joey Adams wrote:
> On Sat, Jul 23, 2011 at 11:49 AM, Florian Pflug <fgp@phlo.org> wrote:
>> So what I think we should do is tell libxml that the encoding is ASCII
>> if the server encoding isn't UTF-8. With that change, the query above
>> produces
>
> I haven't had time to digest this situation, but there is a function
> called pg_encoding_to_char for getting a string representation of the
> encoding.  However, it might not produce a string that libxml
> understands in all cases.
>
> Would it be better to tell libxml the server encoding, whatever it may be?

Ultimately, yes. However, I figured if it was as easy as translating our
encoding names to those of libxml, the current code would probably do that
instead of converting the XML to UTF-8 before validating it.
(Validation and XPATH processing use a different code path there!)

I'm also not aware of any actual complaints about XPATH's restriction
to UTF-8, and it's not a case that I personally care for, so I'm
a bit hesitant to put in the time and energy required to extend it to
other encodings.

But once I had stumbled over this, I didn't want to ignore it all together,
so looked for simple way to make the current behaviour more bullet-proof.
The patch accomplishes that, I think, and without any major change in
behaviour. You only observe the difference if you indeed have non-UTF-8
XMLs which look like valid UTF-8.

> In the JSON encoding discussion, the last idea (the one I was planning
> to go with) was to allow non-ASCII characters in any server encoding
> (like ä in ISO-8859-1), but not allow non-ASCII escapes (like \u00E4)
> unless the server encoding is UTF-8.

Yeah, that's how I understood your proposal, and it seems sensible.

> I think your patch would more
> closely match the opposite: allow any escapes, but only allow ASCII
> text if the server encoding is not UTF-8.

Yeah, but only for XPATH(). XML input validation uses a different
code path, and seems to convert the XML to UTF-8 before verifying
it's well-formedness with libxml (as you already discovered previously).

The difference between JSON and XML here is that the XML types has to
live with libxml's idiosyncrasies and restrictions. If we could make
libxml use our encoding and text handling infrastructure, the UTF-8
restrictions would probably not exist. But as it stands, libxml has
it's own machinery for dealing with encodings...

I wonder, BTW, what happens if you attempt to store an XML containing a
character not representable in UNICODE. If the conversion to UTF-8 simply
replaces it with a placeholder, we'd be fine, since just a replacement
cannot affect the well-formedness of an XML. If OTOH it raised an error,
that'd be a bit unfortunate...

best regards,
Florian Pflug

Re: XPATH vs. server_encoding != UTF-8

From

Peter Eisentraut

Date:

23 July 2011, 20:50:03

On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote:
> The current thread about JSON and the ensuing discussion about the
> XML types' behaviour in non-UTF8 databases made me try out how well
> XPATH() copes with that situation. The code, at least, looks
> suspicious - XPATH neither verifies that the server encoding is UTF-8,
> not does it pass the server encoding on to libxml's xpath functions.

This issue is on the Todo list, and there are some archive links there.

Re: XPATH vs. server_encoding != UTF-8

From

Florian Pflug

Date:

23 July 2011, 23:25:27

On Jul23, 2011, at 22:49 , Peter Eisentraut wrote:

> On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote:
>> The current thread about JSON and the ensuing discussion about the
>> XML types' behaviour in non-UTF8 databases made me try out how well
>> XPATH() copes with that situation. The code, at least, looks
>> suspicious - XPATH neither verifies that the server encoding is UTF-8,
>> not does it pass the server encoding on to libxml's xpath functions.
>
> This issue is on the Todo list, and there are some archive links there.

Thanks for the pointer, but I think the discussion there doesn't
really apply here.

First, I didn't suggest (or implement) full support for XPATH() together
with server encodings other than UTF-8. My suggested patch simply
closes a hole in the implementation of the current behaviour. Instead of
relying on libxml to be able to detect that the encoding isn't UTF-8, it
relies on it only to detect that the encoding isn't ASCII. Since supported
server encodings are supersets of ASCII, the latter is trivial.

xml.c also seems to have changed quite a bite since this was last
discussed. Tom Lane argued against the proposed patch on the grounds
that there are many more places in xml.c which pass strings to libxml
without charset conversion. However, looking at it now, it seems that
all XML validation goes through xml_parse(), which actually converts
the XML to UTF-8. Only XPATH contains a separate code path, and chooses
to ignore encoding issues all together.

best regards,
Florian Pflug

Re: XPATH vs. server_encoding != UTF-8

From

Florian Pflug

Date:

24 July 2011, 17:51:42

On Jul24, 2011, at 01:25 , Florian Pflug wrote:
> On Jul23, 2011, at 22:49 , Peter Eisentraut wrote:
>
>> On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote:
>>> The current thread about JSON and the ensuing discussion about the
>>> XML types' behaviour in non-UTF8 databases made me try out how well
>>> XPATH() copes with that situation. The code, at least, looks
>>> suspicious - XPATH neither verifies that the server encoding is UTF-8,
>>> not does it pass the server encoding on to libxml's xpath functions.
>>
>> This issue is on the Todo list, and there are some archive links there.
>
> Thanks for the pointer, but I think the discussion there doesn't
> really apply here.

Upon further reflection, I came to realize that it in fact does apply.

All the non-XPath related XML *parsing* seems to go through xml_parse(),
but we also use libxml to write XML, making XMLELEMENT() and friends
equally susceptible to all kinds of encoding trouble. For the fun of it,
try the following in a ISO-8859-1 database (which client_encoding correctly
set up, so the umlaut-a reaches the backend unharmed)
 select xmlelement(name "r", xmlattributes('ä' as a));

you get
   xmlelement
-------------------<r a="䀀"/>

Well, actually, you only get that about 9 times out of 10. Sometimes
you instead get
       xmlelement
---------------------------<r a="䀁\x01\x01"/>

It seems the libxml reads past the terminating zero byte if it's
preceeded by an invalid UTF-8 byte sequence (like 0xe4 0x00 in the example
above). Ouch!

Also, passing encoding ASCII to libxml's parser doesn't prevent it from
expanding entity references referring to characters outside the ASCII
range. So even with my patch applied you can make XPATH() return wrong
results. For example (0xe4 is the unicode codepoint representing umlaut-a)
 select xpath('/r/@a', '<r a="ä"/>'::xml);

gives (*with* my patch applied)
xpath
-------{Ã¤}

So scratch the whole idea. There doesn't seem to be a simple way to
make the XML type work sanely in a non-UTF-8 setting :-(. Apart from
simple input and output that is, which already seems to work correctly
regardless of the server encoding.

BTW, for the sake of getting this into the archives just in case someone
decides to fix this and stumbles over this thread:

It seems to me that the easiest way to fix XML generation in the non-UTF-8
case would be to cease using libxml for emitting XML at all. The only
non-trivial use of libxml there is the escaping of attribute values, and
we do already have our own escape_xml() function - it just needs to be
taught the additional escapes needed for attribute values. (libxml is
also used to convert binary values to base64 or hexadecimal notation,
but there're no encoding issues there)

best regards,
Florian Pflug