Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode? - Mailing list pgsql-general

From Vick Khera
Subject Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Date
Msg-id CALd+dcfA2-p2CquiokLPxQKWzFP-ggtQ7uqcab3ozYsdajkGAQ@mail.gmail.com
Whole thread Raw
In response to Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
List pgsql-general

On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> A PostgreSQL database with encoding=UTF8 just accepts the whole
> range of Unicode, regardless that a character is defined for the
> code or not.

Interesting... when I converted my application and database to utf8 encoding, I discovered that Postgres is picky about UTF-8. Specifically the UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d. This looks like a proper character but in fact is not a defined character code point.

Given the above unicode table:

insert into unicode(id, string) values(1, E'\xed\xa0\x8d');
ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xa0 0x8d

So I think when you present an actual string of UTF8 encoded characters, Postgres does refuse characters unknown. However, as you observe, inserting the unicode code point directly does not produce an error:

insert into unicode(id, string) values(1, U&'\d80d');
INSERT 0 1

I discovered this when that specific byte sequence was found in my database during the conversion. I have no idea what my customer entered in the form to make that sequence, but it was part of the Vietnamese spelling of Ho Chi Minh City as best I could figure.

pgsql-general by date:

Previous
From: Yogesh Sharma
Date:
Subject: Re: [GENERAL] Request to share approach during REINDEX operation
Next
From: Vick Khera
Date:
Subject: Re: [GENERAL] Request to share approach during REINDEX operation