Thread: UNICODE characters above 0x10000

UNICODE characters above 0x10000

From
"John Hansen"
Date:
I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John



Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x10000

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html



Attachment

Re: UNICODE characters above 0x10000

From
Tom Lane
Date:
"John Hansen" <john@geeknet.com.au> writes:
> Attached, as promised, small patch removing the limitation, adding
> correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

            regards, tom lane

Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:
> Attached, as promised, small patch removing the limitation, adding
> correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

            regards, tom lane



Attachment

Re: UNICODE characters above 0x10000

From
Tom Lane
Date:
"John Hansen" <john@geeknet.com.au> writes:
> My apologies for not reading the code properly.

> Attached patch using pg_utf_mblen() instead of an indexed table.
> It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

            regards, tom lane

Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
Possibly, since I got it wrong once more....
About to give up, but attached, Updated patch.


Regards,

John Hansen

-----Original Message-----
From: Oliver Elphick [mailto:olly@lfix.co.uk]
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
> I'm not an expert on this stuff, so I don't know what the UTF8 spec
> actually says.  But I do think you are fixing the code at the wrong
level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
                 ========================================
     "Be still before the LORD and wait patiently for him;
      do not fret when men succeed in their ways, when they
      carry out their wicked schemes."
                            Psalms 37:7




Attachment

Re: UNICODE characters above 0x10000

From
Dennis Bjorklund
Date:
On Sat, 7 Aug 2004, Tom Lane wrote:

> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.

I can give some general info about utf-9. This is how it is encoded:

character            encoding
-------------------  ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the first byte starts with a 1 then the number of ones give the
length of the utf-8 sequence. And the rest of the bytes in the sequence
always starts with 10 (this makes it possble to look anywhere in the
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that
is illegal and should be tested for and rejected. So the longest utf-8
sequence is 6 bytes (and the longest character needs 4 bytes (or 31
bits)).

--
/Dennis Björklund


Re: UNICODE characters above 0x10000

From
Tom Lane
Date:
Oliver Elphick <olly@lfix.co.uk> writes:
> glibc provides various routines (mb...) for handling Unicode.  How many
> of our supported platforms don't have these?

Every one that doesn't use glibc.  Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).

            regards, tom lane

Re: UNICODE characters above 0x10000

From
Oliver Elphick
Date:
On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
> Now it's entirely possible that the underlying support is a few bricks
> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
                 ========================================
     "Be still before the LORD and wait patiently for him;
      do not fret when men succeed in their ways, when they
      carry out their wicked schemes."
                            Psalms 37:7


Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:
> My apologies for not reading the code properly.

> Attached patch using pg_utf_mblen() instead of an indexed table.
> It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

            regards, tom lane



Attachment

Re: UNICODE characters above 0x10000

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> ... This also means that the start byte can never start with 7 or 8
> ones, that is illegal and should be tested for and rejected. So the
> longest utf-8 sequence is 6 bytes (and the longest character needs 4
> bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width).  I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width.  The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

            regards, tom lane

Re: UNICODE characters above 0x10000

From
Dennis Bjorklund
Date:
On Sat, 7 Aug 2004, Tom Lane wrote:

> question at hand is whether we can support 32-bit characters or not ---
> and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple
testing, make sure regexps work and then treat anything else that might
not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun
at all :-)

My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6
byte utf-8 sequences even if some characters are not currently assigned.

--
/Dennis Björklund


Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
This should do it.

Regards,

John Hansen

-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

> question at hand is whether we can support 32-bit characters or not
> --- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then
treatanything else that might not work, as bugs to be fixed later on when found. 

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the
unicodeorg. However, the part that interprets the unicode strings are in the os so different os'es can give different
results.So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently
assigned.

--
/Dennis Björklund




Attachment

Re: [PATCHES] UNICODE characters above 0x10000

From
Tatsuo Ishii
Date:
> Dennis Bjorklund <db@zigo.dhs.org> writes:
> > ... This also means that the start byte can never start with 7 or 8
> > ones, that is illegal and should be tested for and rejected. So the
> > longest utf-8 sequence is 6 bytes (and the longest character needs 4
> > bytes (or 31 bits)).
>
> Tatsuo would know more about this than me, but it looks from here like
> our coding was originally designed to support only 16-bit-wide internal
> characters (ie, 16-bit pg_wchar datatype width).  I believe that the
> regex library limitation here is gone, and that as far as that library
> is concerned we could assume a 32-bit internal character width.  The
> question at hand is whether we can support 32-bit characters or not ---
> and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

Re: UNICODE characters above 0x10000

From
Christopher Kings-Lynne
Date:
> Now it's entirely possible that the underlying support is a few bricks
> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.

Surely there are UTF-8 codes that are at least 3 bytes.  I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris


Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
4 actually,
10FFFF needs four bytes:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
10FFFF = 00001010 11111111 11111111

Fill in the blanks, starting from the bottom, you get:
11110000 10101111 10111111 10111111

Regards,

John Hansen

-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
> I'm not an expert on this stuff, so I don't know what the UTF8 spec
> actually says.  But I do think you are fixing the code at the wrong
level.

Surely there are UTF-8 codes that are at least 3 bytes.  I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris




Re: UNICODE characters above 0x10000

From
Oliver Elphick
Date:
On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
> Oliver Elphick <olly@lfix.co.uk> writes:
> > glibc provides various routines (mb...) for handling Unicode.  How many
> > of our supported platforms don't have these?
>
> Every one that doesn't use glibc.  Don't bother proposing a glibc-only
> solution (and that's from someone who works for a glibc-only company;
> you don't even want to think about the push-back you'll get from other
> quarters).

No. that's not what I was proposing.  My suggestion was to use these
routines if they are sufficiently widely implemented, and our own
routines where standard ones are not available.

The man page for mblen says
"CONFORMING TO
       ISO/ANSI C, UNIX98"

Is glibc really the only C library to conform?

If using the mb... routines isn't feasible, IBM's ICU library
(http://oss.software.ibm.com/icu/) is available under the X licence,
which is compatible with BSD as far as I can see.  Besides character
conversion, ICU can also do collation in various locales and encodings.
My point is, we shouldn't be writing a new set of routines to do half a
job if there are already libraries available to do all of it.

--
Oliver Elphick                                          olly@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
                 ========================================
     "Be still before the LORD and wait patiently for him;
      do not fret when men succeed in their ways, when they
      carry out their wicked schemes."
                            Psalms 37:7


Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
> -----Original Message-----
> From: Oliver Elphick [mailto:olly@lfix.co.uk]
> Sent: Sunday, August 08, 2004 7:43 AM
> To: Tom Lane
> Cc: John Hansen; Hackers; Patches
> Subject: Re: [HACKERS] UNICODE characters above 0x10000
>
> On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
> > Oliver Elphick <olly@lfix.co.uk> writes:
> > > glibc provides various routines (mb...) for handling Unicode.  How

> > > many of our supported platforms don't have these?
> >
> > Every one that doesn't use glibc.  Don't bother proposing a
glibc-only
> > solution (and that's from someone who works for a glibc-only
company;
> > you don't even want to think about the push-back you'll get from
other
> > quarters).
>
> No. that's not what I was proposing.  My suggestion was to
> use these routines if they are sufficiently widely
> implemented, and our own routines where standard ones are not
> available.
>
> The man page for mblen says
> "CONFORMING TO
>        ISO/ANSI C, UNIX98"
>
> Is glibc really the only C library to conform?
>
> If using the mb... routines isn't feasible, IBM's ICU library
> (http://oss.software.ibm.com/icu/) is available under the X
> licence, which is compatible with BSD as far as I can see.
> Besides character conversion, ICU can also do collation in
> various locales and encodings.
> My point is, we shouldn't be writing a new set of routines to
> do half a job if there are already libraries available to do
> all of it.
>

This sounds like a brilliant move, if anything.

> --
> Oliver Elphick
> olly@lfix.co.uk
> Isle of Wight
> http://www.lfix.co.uk/oliver
> GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F
> A543 10EA
>                  ========================================
>      "Be still before the LORD and wait patiently for him;
>       do not fret when men succeed in their ways, when they
>       carry out their wicked schemes."
>                             Psalms 37:7
>
>
>

Kind Regards,

John Hansen


Re: UNICODE characters above 0x10000

From
Tom Lane
Date:
"John Hansen" <john@geeknet.com.au> writes:
> Ahh, but that's not the case. You cannot just delete the check, since
> not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
> appear in a byte sequence for instance.

Well, this is still working at the wrong level.  The code that's in
pg_verifymbstr is mainly intended to enforce the *system wide*
assumption that multibyte characters must have the high bit set in
every byte.  (We do not support encodings without this property in
the backend, because it breaks code that looks for ASCII characters
... such as the main parser/lexer ...)  It's not really intended to
check that the multibyte character is actually legal in its encoding.

The "special UTF-8 check" was never more than a very quick-n-dirty hack
that was in the wrong place to start with.  We ought to be getting rid
of it not institutionalizing it.  If you want an exact encoding-specific
check on the legitimacy of a multibyte sequence, I think the right way
to do it is to add another function pointer to pg_wchar_table entries to
let each encoding have its own check routine.  Perhaps this could be
defined so as to avoid a separate call to pg_mblen inside the loop, and
thereby not add any new overhead.  I'm thinking about an API something
like

    int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present at
*str, and -N if an invalid character is present at *str and
it would be appropriate to display N bytes in the complaint.
(N must be <= len in either case.)  This would reduce the main
loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

            regards, tom lane

Re: UNICODE characters above 0x10000

From
"John Hansen"
Date:
> Well, this is still working at the wrong level.  The code
> that's in pg_verifymbstr is mainly intended to enforce the
> *system wide* assumption that multibyte characters must have
> the high bit set in every byte.  (We do not support encodings
> without this property in the backend, because it breaks code
> that looks for ASCII characters ... such as the main
> parser/lexer ...)  It's not really intended to check that the
> multibyte character is actually legal in its encoding.
>

Ok, point taken.

> The "special UTF-8 check" was never more than a very
> quick-n-dirty hack that was in the wrong place to start with.
>  We ought to be getting rid of it not institutionalizing it.
> If you want an exact encoding-specific check on the
> legitimacy of a multibyte sequence, I think the right way to
> do it is to add another function pointer to pg_wchar_table
> entries to let each encoding have its own check routine.
> Perhaps this could be defined so as to avoid a separate call
> to pg_mblen inside the loop, and thereby not add any new
> overhead.  I'm thinking about an API something like
>
>     int validate_mbchar(const unsigned char *str, int len)
>
> with result +N if a valid character N bytes long is present
> at *str, and -N if an invalid character is present at *str
> and it would be appropriate to display N bytes in the complaint.
> (N must be <= len in either case.)  This would reduce the
> main loop of pg_verifymbstr to a call of this function and an
> error-case-handling block.
>

Sounds like a plan...

>             regards, tom lane
>
>

Regards,

John Hansen