Re: xPath in a database with LATIN1 encoding - Mailing list pgsql-admin

From Holger Jakobs
Subject Re: xPath in a database with LATIN1 encoding
Date
Msg-id caa976c4-6070-7eb8-1570-dee06b876a58@jakobs.com
Whole thread Raw
In response to Re: xPath in a database with LATIN1 encoding  (Jorge Silva <jorge.silva93@gmail.com>)
Responses Re: xPath in a database with LATIN1 encoding  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-admin


Am 24.04.21 um 20:20 schrieb Jorge Silva:
Thanks for the quick reply, Tom. I am trying something simpler. I am trying to find a way to run the xPath function with a xml file type, which has latin characters, such as:
SELECT xpath(‘//xml_test/text()’, convert_from(convert_to(‘<xml_test>çã</xml_test>','utf-8’),'utf-8')::xml)

This line returns the same error, as follows:
[Code: 0, SQL State: 2200M]  ERROR: could not parse XML document
  Detail: line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0xE3 0x3C 0x2F
<xml_test>çã</xml_test>
          ^
This happens because the “text”, which is the output from convert_from() function is encoded with LATIN1, the database encoding set, and not UTF-8.  Is there a way that a text variable is not encoded as the database encoding set, but some other encoding set, such as UTF-8? From what I’ve searched for, it seems that it would be something similar to nvarchar that only exists in Microsoft SQL Server.



On 24 Apr 2021, at 13:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Silva <jorge.silva93@gmail.com> writes:
The characters that it is not recognizing are both “ç” and “ã” because I think they are encoded differently in latin1 and utf-8. Is it possible to somehow use the xPath function with special characters in the XML and in a database which is not encoded with utf-8?

I don't have a lot of expertise in this area, but I think you need
an explicit encoding indicator in the xml header, a la

   <?xml encoding="latin1"?> ...

On the whole, the xml type is definitely easier to use with database
encoding set to utf8.  I think you'll be paying for encoding conversion
every time we interact with libxml, for instance.

regards, tom lane

The point is that the XML standard dictates that every document without an explicit encoding, must be encodied in UTF-8. So, whenever you want to use a different encoding, this has to be stated in the XML header, as indicated in Tom Lane's answer.

So the encoding of XML columns isn't actually the one of the database (although UTF-8 is best in any case), but simply is the XML standard. Your client encoding may vary, though, but this would limit the characters which can be transferred.

Regards,

Holger


-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Attachment

pgsql-admin by date:

Previous
From: Jorge Silva
Date:
Subject: Re: xPath in a database with LATIN1 encoding
Next
From: adnan salam
Date:
Subject: Facing issue while upgrading from 11 ro 13.2