Thread: Re: [HACKERS] Unicode combining characters

Re: [HACKERS] Unicode combining characters

From
Patrice Hédé
Date:
Hi,

I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:

- most of the functionality is only activated when MULTIBYTE is
  defined,

- check valid UTF-8 characters, client-side only yet, and only on
  output, you still can send invalid UTF-8 to the server (so, it's
  only partly compliant to Unicode 3.1, but that's better than
  nothing).

- formats with the correct number of columns (that's why I made it in
  the first place after all), but only for UNICODE. However, the code
  allows to plug-in routines for other encodings, as Tatsuo did for
  the other multibyte functions.

- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
  characters (characters with values >= 0x10000, which are encoded on
  four bytes).

- doesn't depend on the locale capabilities of the glibc (useful for
  remote telnet).

I would like somebody to check it closely, as it is my first patch to
pgsql.  Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.

Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .

Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....

err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)

I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)

Here is the patch !

Patrice.

--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
  --  Isn't it weird  how scientists  can imagine  all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
  -- What would _you_ call the creation of the universe ?
  -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Attachment

Re: [HACKERS] Unicode combining characters

From
Tatsuo Ishii
Date:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
>   characters (characters with values >= 0x10000, which are encoded on
>   four bytes).

After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
bytes, we are in trouble. Current regex implementaion does not handle
4 byte width charsets.
--
Tatsuo Ishii

Re: [HACKERS] Unicode combining characters

From
Patrice Hédé
Date:
* Tatsuo Ishii <t-ishii@sra.co.jp> [011009 18:38]:
> > - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> >   characters (characters with values >= 0x10000, which are encoded on
> >   four bytes).
>
> After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
> bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
> bytes, we are in trouble. Current regex implementaion does not handle
> 4 byte width charsets.

*sigh* yes, it does encode to four bytes :(

Three solutions then :

1) we support these supplementary characters, knowing that they won't
   work with regexes,

2) I back out the change, but then anyone using these characters will
   get something weird, since the decoding would be faulty (they would
   be handled as 3 bytes UTF-8 chars, and then the fourth byte would
   become a "faulty char"... not very good, as the 3-byte version is
   still not a valid UTF-8 code !),

3) we fix the regex engine within the next 24 hours, before the beta
   deadline is activated :/

I must say that I doubt that anyone will use these characters in the
next few months : these are mostly chinese extended characters, with
old italic, deseret, and gothic scripts, and bysantine and western
musical symbols, as well as the mathematical alphanumerical symbols.

I would prefer solution 1), as I think it is better to allow these
characters, even with a temporary restriction on the regex, than to
fail completely on them. As for solution 3), we may still work at it
in the next few months :) [I haven't even looked at the regex engine
yet, so I don't know the implications of what I have just said !]

What do you think ?

Patrice

--
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/

Re: [HACKERS] Unicode combining characters

From
Tatsuo Ishii
Date:
> > After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
> > bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
> > bytes, we are in trouble. Current regex implementaion does not handle
> > 4 byte width charsets.
>
> *sigh* yes, it does encode to four bytes :(
>
> Three solutions then :
>
> 1) we support these supplementary characters, knowing that they won't
>    work with regexes,
>
> 2) I back out the change, but then anyone using these characters will
>    get something weird, since the decoding would be faulty (they would
>    be handled as 3 bytes UTF-8 chars, and then the fourth byte would
>    become a "faulty char"... not very good, as the 3-byte version is
>    still not a valid UTF-8 code !),
>
> 3) we fix the regex engine within the next 24 hours, before the beta
>    deadline is activated :/
>
> I must say that I doubt that anyone will use these characters in the
> next few months : these are mostly chinese extended characters, with
> old italic, deseret, and gothic scripts, and bysantine and western
> musical symbols, as well as the mathematical alphanumerical symbols.
>
> I would prefer solution 1), as I think it is better to allow these
> characters, even with a temporary restriction on the regex, than to
> fail completely on them. As for solution 3), we may still work at it
> in the next few months :) [I haven't even looked at the regex engine
> yet, so I don't know the implications of what I have just said !]
>
> What do you think ?

I think 2) is not very good, and we should reject these 4-bytes UTF-8
strings. After all, we are not ready for them.

BTW, other part of your patches looks good. Peter, what do you think?
--
Tatsuo Ishii

Re: [HACKERS] Unicode combining characters

From
Patrice Hédé
Date:
> > 1) we support these supplementary characters, knowing that they won't
> >    work with regexes,
> >
> > 2) I back out the change, but then anyone using these characters will
> >    get something weird, since the decoding would be faulty (they would
> >    be handled as 3 bytes UTF-8 chars, and then the fourth byte would
> >    become a "faulty char"... not very good, as the 3-byte version is
> >    still not a valid UTF-8 code !),
> >
> > 3) we fix the regex engine within the next 24 hours, before the beta
> >    deadline is activated :/
> >
> > What do you think ?
>
> I think 2) is not very good, and we should reject these 4-bytes UTF-8
> strings. After all, we are not ready for them.

If we still recognise them as 4-byte UTF-8 chars (in order to parse
the next char correctly) and reject them as invalid chars, that should
be OK :)

> BTW, other part of your patches looks good. Peter, what do you think?

Nice to hear :)

Patrice

--
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/