XPATH vs. server_encoding != UTF-8 - Mailing list pgsql-hackers

From Florian Pflug
Subject XPATH vs. server_encoding != UTF-8
Date
Msg-id AA288A66-0451-4AC4-9DBD-1AFC7892F25D@phlo.org
Whole thread Raw
Responses Re: XPATH vs. server_encoding != UTF-8
List pgsql-hackers
Hi

The current thread about JSON and the ensuing discussion about the
XML types' behaviour in non-UTF8 databases made me try out how well
XPATH() copes with that situation. The code, at least, looks
suspicious - XPATH neither verifies that the server encoding is UTF-8,
not does it pass the server encoding on to libxml's xpath functions.

So I created a database with encoding ISO-8859-1 (LATIN1), and did
(which aclient encoding matching my terminal's settings)

  CREATE TABLE X (d XML);
  INSERT INTO X VALUES ('<r a="ä"/>');

i.e, I inserted the XML document <r a="ä"/>, but without using
an entity reference for the german Umlaut-A. Then I attempted to extract
the length of r's attribute "a" with the XPATH /r/@a, both with the XPath
function string-length (which works now! yay!) and with postgres'
LENGTH() function.

  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

The XPATH() function itself doesn't complain, but libxml does - it expects
UTF-8 encoded data, and screams bloody murder when it encounters the
ISO-8859-1-encoded Umlaut-A

  ERROR:  could not parse XML document
  DETAIL:  line 1: Input is not proper UTF-8, indicate encoding !
  Bytes: 0xE4 0x22 0x2F 0x3E
  <r a="ä"/>

That might seem fine on the surface - we did, after all, error out instead
of producing potentially non-sensical results. However, libxml's ability to
detect this error relies on it's ability to distinguish between UTF-8 and
non-UTF-8 encoded strings. Which, of course, doesn't work in the general case.

So for my next try, I deliberately set client_encoding to ISO-8859-1, even
though my terminal uses UTF-8, removed all data from table X, and did

  INSERT INTO X VALUES ('<r a="ä"/>');

again. The effect is that is that X now contains ISO-8859-1 encoded data
which *happens* to look like valid UTF-8. After changing the client_encoding
back to UTF-8, the value we just inserted looks like that

  <r a="ä"/>

Now I invoked the XPATH query from above again.

  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

As predicted, it doesn't raise an error this time, since libxml is unable
to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the
result is still wrongs, since the string-length() function counts 'ä' as just
one character, when it reality it are of course contains two.

 xpath_length | pg_length
--------------+-----------
 1            |         2

The easiest way to fix this would be to make XPATH() flat-out refuse to
do anything if the server encoding isn't UTF-8. But that seems a bit harsh -
things actually do work correctly as long as the XML document contains only
ASCII characters, and existing applications might depend on that.

So what I think we should do is tell libxml that the encoding is ASCII
if the server encoding isn't UTF-8. With that change, the query above
produces

  ERROR:  could not parse XML document
  DETAIL:  encoder error

which seems sane. Replacing the data in X with ASCII-only data makes the
error go away, and the result is then correct also.

  DELETE FROM X;
  INSERT INTO X VALUES ('<r a="a"/>');
  SELECT
    (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
    LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
  FROM X;

gives

 xpath_length | pg_length
--------------+-----------
 1            |         1

Proof-of-concept patch attached, but doesn't yet include documentation
updates.

Comments? Thoughts? Suggestions?

best regards,
Florian Pflug

Attachment

pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: proposal: a validator for configuration files
Next
From: Florian Pflug
Date:
Subject: Re: XPATH vs. server_encoding != UTF-8