Thread: Unicode restriction

Unicode restriction

From
Oliver Elphick
Date:
In src/backend/utils/mb/wchar.c there is a check to exclude Unicode
characters above 0x10000.  I can't see anything to explain this
restriction, except possibly this in the release notes for 7.2:
       Reject invalid multibyte character sequences (Tatsuo)

It does not explain why part of the Unicode character range is invalid. 
There is a Debian bug report from someone whose client is trying to
store characters in the excluded range.  What would be needed to enable
support for it?

-- 
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
========================================   "Love is patient, love is kind. It does not envy, it     does not boast, it
isnot proud. It is not rude, it is     not self seeking, it is not easily angered, it keeps     no record of wrongs.
Lovedoes not delight in evil but     rejoices with the truth. It always protects, always     trusts, always hopes,
alwaysperseveres."                                I Corinthians 13:4-7 
 



Re: Unicode restriction

From
Tatsuo Ishii
Date:
> In src/backend/utils/mb/wchar.c there is a check to exclude Unicode
> characters above 0x10000.  I can't see anything to explain this
> restriction, except possibly this in the release notes for 7.2:
> 
>         Reject invalid multibyte character sequences (Tatsuo)
> 
> It does not explain why part of the Unicode character range is invalid. 
> There is a Debian bug report from someone whose client is trying to
> store characters in the excluded range.  What would be needed to enable
> support for it?

Before 7.4, to be handled by regex routines, UTF-8 are converted to
ISO 10646. There was a limitaion in regex routines in that they cannot
handle multibyte characters > 2bytes. In another word only 16bit UCS-2
are supported. That's why ISO 10646 > 0x10000 is rejected.

I'm not sure if the regex routines include in 7.4 or later has this
restrictions or not. If not, probably we could remove the check (with
losing data compatibilty).
--
Tatsuo Ishii


Re: Unicode restriction

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Before 7.4, to be handled by regex routines, UTF-8 are converted to
> ISO 10646. There was a limitaion in regex routines in that they cannot
> handle multibyte characters > 2bytes. In another word only 16bit UCS-2
> are supported. That's why ISO 10646 > 0x10000 is rejected.

> I'm not sure if the regex routines include in 7.4 or later has this
> restrictions or not. If not, probably we could remove the check (with
> losing data compatibilty).

It looks to me like the regex routines now use pg_wchar, so I don't
think we need the restriction any longer.
        regards, tom lane