On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> A PostgreSQL database with encoding=UTF8 just accepts the whole
> range of Unicode, regardless that a character is defined for the
> code or not.
Interesting... when I converted my application and database to utf8 encoding, I discovered that Postgres is picky about UTF-8. Specifically the UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d. This looks like a proper character but in fact is not a defined character code point.
Given the above unicode table:
insert into unicode(id, string) values(1, E'\xed\xa0\x8d');
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0x8d
So I think when you present an actual string of UTF8 encoded characters, Postgres does refuse characters unknown. However, as you observe, inserting the unicode code point directly does not produce an error:
insert into unicode(id, string) values(1, U&'\d80d');
INSERT 0 1
I discovered this when that specific byte sequence was found in my database during the conversion. I have no idea what my customer entered in the form to make that sequence, but it was part of the Vietnamese spelling of Ho Chi Minh City as best I could figure.