Thread: Unicode combining characters

Unicode combining characters

From

Patrice Hédé

Date:

24 September 2001, 17:52:28

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :
lang | length | length |   text    |   text
------+--------+--------+-----------+-----------isl  |      7 |      6 | álíta | áleitisl  |      7 |      7 | álíta |
álitumisl |      7 |      7 | álíta | álitiðisl  |      5 |      4 | maður    | mannisl  |      5 |      7 | maður    |
mönnumisl |      5 |      5 | maður    | mannaisl  |      5 |      4 | óska    | -aði
 

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'á', 'í', 'ó' and 'ö' are using combining marks, while 'ð' is not.

As a reminder, a combining mark in Unicode is that á is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have the same result with char_length()], and returns
thenumber of chars (as it is however advertised to do), rather the length of the string.
 

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).


=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).


This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted plperl.

and then :

- provide the data in tables (system and/or user) - which should be available across databases,

- load the data from the original text files provided in Unicode (and other as needed), if the functionality is
compiledinto the server.
 

- I believe the basic unicode information should be standard, and the locales should be provided as contrib/ files to
beplugged in as needed.
 

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation of the key may be expensive, some may want to
indexon the result (I would :) ),
 

- provide a collation algorithm, using the two previous facilities, which can do primary to tertiary collation (cf
TR#10for a detailed explanation).
 

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?


Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

-- 
Patrice HÉDÉ ------------------------------- patrice à islande.org ----- --  Isn't it weird  how scientists  can
imagine all the matter of the
 
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the
universe? -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
 
------------------------------------------ http://www.islande.org/ -----

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

24 September 2001, 20:58:14

> So, this shows two problems :
> 
> - length() on the server side doesn't handle correctly Unicode [I have
>   the same result with char_length()], and returns the number of chars
>   (as it is however advertised to do), rather the length of the
>   string.

This is a known limitation.

> - the psql frontend makes the same mistake.
>
> I am using version 7.1.3 (debian sid), so it may have been corrected
> in the meantime (in this case, I apologise, but I have only recently
> started again to use PostgreSQL and I haven't followed -hackers long
> enough).
> 
> 
> => I think fixing psql shouldn't be too complicated, as the glibc
> should be providing the locale, and return the right values (is this
> the case ? and what happens for combined latin + chinese characters
> for example ? I'll have to try that later). If it's not fixed already,
> do you want me to look at this ? [it will take some time, as I haven't
> set up any development environment for postgres yet, and I'm away for
> one week from thursday].

Sounds great.

> I was wondering if some people have already thought about this, or
> already done something, or if some of you are interested in this. If
> nobody does anything, I'll do something eventually, probably before
> Christmas (I don't have much time for this, and I don't need the
> functionality right now), but if there is an interest, I could team
> with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

> Anyway, I'm open to suggestions :
> 
> - implement it in C, in the core,
> 
> - implement it in C, as contributed custom functions,

This may be a good starting point.

> I can't really accept a solution which would rely on the underlaying
> libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

> The main functions I foresee are :
> 
> - provide a normalisation function to all 4 forms,
> 
> - provide a collation_key(text, language) function, as the calculation
>   of the key may be expensive, some may want to index on the result (I
>   would :) ),
> 
> - provide a collation algorithm, using the two previous facilities,
>   which can do primary to tertiary collation (cf TR#10 for a detailed
>   explanation).
> 
> I haven't looked at PostgreSQL code yet (shame !), so I may be
> completely off-track, in which case I'll retract myself and won't
> bother you again (on that subject, that is ;) )...
> 
> Comments ?
--
Tatsuo Ishii

Re: Unicode combining characters

From

Oleg Bartunov

Date:

25 September 2001, 05:53:46

Looks like a good project for 7.3
Probably the best starting point would be to develope contrib/unicode
with smooth transition to core.
Oleg
On Mon, 24 Sep 2001, Patrice [iso-8859-15] HИdИ wrote:

> Hi all,
>
> while working on a new project involving PostgreSQL and making some
> tests, I have come up with the following output from psql :
>
>  lang | length | length |   text    |   text
> ------+--------+--------+-----------+-----------
>  isl  |      7 |      6 | АlМta | Аleit
>  isl  |      7 |      7 | АlМta | Аlitum
>  isl  |      7 |      7 | АlМta | АlitiП
>  isl  |      5 |      4 | maПur    | mann
>  isl  |      5 |      7 | maПur    | mЖnnum
>  isl  |      5 |      5 | maПur    | manna
>  isl  |      5 |      4 | Сska    | -aПi
>
> [the misalignment is what I got, it's not a copy-paste error]
>
> This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
> querying a database created with -E UNICODE (by the way, these are
> icelandic words :) ).
>
> What you see above is misleading, since it's not possible to see that
> 'А', 'М', 'С' and 'Ж' are using combining marks, while 'П' is not.
>
> As a reminder, a combining mark in Unicode is that А is actually
> encoded as a + ' (where ' is the acute combining mark).
>
> Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
> instead of <c3 a1> [UTF16: 00E1].
>
> The "length" fields are what is returned by length(a.text) and
> length(b.text).
>
> So, this shows two problems :
>
> - length() on the server side doesn't handle correctly Unicode [I have
>   the same result with char_length()], and returns the number of chars
>   (as it is however advertised to do), rather the length of the
>   string.
>
> - the psql frontend makes the same mistake.
>
> I am using version 7.1.3 (debian sid), so it may have been corrected
> in the meantime (in this case, I apologise, but I have only recently
> started again to use PostgreSQL and I haven't followed -hackers long
> enough).
>
>
> => I think fixing psql shouldn't be too complicated, as the glibc
> should be providing the locale, and return the right values (is this
> the case ? and what happens for combined latin + chinese characters
> for example ? I'll have to try that later). If it's not fixed already,
> do you want me to look at this ? [it will take some time, as I haven't
> set up any development environment for postgres yet, and I'm away for
> one week from thursday].
>
> => regarding the backend, it may be more complex, as the underlaying
> system may not provide any UTF-8 locale to use (!= from being UTF-8
> aware : an administrator may have decided that UTF-8 locales are
> useless on a server, as only root connections are made, and he wants
> only the C locale on the console - I've seen that quite often ;) ).
>
>
> This brings me to another subject : I will need to support the full
> Unicode collation algorithm (UCA, as described in TR#10 [1] of the
> Unicode consortium), and I will need to be able to sort according to
> locales which may not be installed on the backend server (some of
> which may not even be recognised by GNU libc, which supports already
> more than 140 locales -- artificial languages would be an example). I
> will also need to be able to normalise the unicode strings (TR#15 [2])
> so that I don't have some characters in legacy codepoints [as 00E1
> above], and others with combining marks.
>
> There is today an implementation in perl of the needed functionality,
> in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
> :( ). But as they are Perl modules, the untrusted version of perl,
> plperlu, will be needed, and it's a pity for what I consider a core
> functionality in the future (not that plperlu isn't a good thing - I
> can't wait for it ! - but that an untrusted pl language is needed to
> support normalisation and collation).
>
> Note also that there are a lot of data associated with these
> algorithms, as you could expect.
>
> I was wondering if some people have already thought about this, or
> already done something, or if some of you are interested in this. If
> nobody does anything, I'll do something eventually, probably before
> Christmas (I don't have much time for this, and I don't need the
> functionality right now), but if there is an interest, I could team
> with others and develop it faster :)
>
> Anyway, I'm open to suggestions :
>
> - implement it in C, in the core,
>
> - implement it in C, as contributed custom functions,
>
> - implement it in perl (by reusing Unicode:: work), in a trusted plperl,
>
> - implement it in perl, calling Unicode:: modules, in an untrusted
>   plperl.
>
> and then :
>
> - provide the data in tables (system and/or user) - which should be
>   available across databases,
>
> - load the data from the original text files provided in Unicode (and
>   other as needed), if the functionality is compiled into the server.
>
> - I believe the basic unicode information should be standard, and the
>   locales should be provided as contrib/ files to be plugged in as
>   needed.
>
> I can't really accept a solution which would rely on the underlaying
> libc, as it may not provide the necessary locales (or maybe, then,
> have a way to override the collating tables by user tables - actually,
> this would be certainly the best solution if it's in the core, as the
> tables will put an extra burden on the distribution and the
> installation footprint, especially if the tables are already there,
> for glibc, for perl5.6+, for other software dealing with Unicode).
>
> The main functions I foresee are :
>
> - provide a normalisation function to all 4 forms,
>
> - provide a collation_key(text, language) function, as the calculation
>   of the key may be expensive, some may want to index on the result (I
>   would :) ),
>
> - provide a collation algorithm, using the two previous facilities,
>   which can do primary to tertiary collation (cf TR#10 for a detailed
>   explanation).
>
> I haven't looked at PostgreSQL code yet (shame !), so I may be
> completely off-track, in which case I'll retract myself and won't
> bother you again (on that subject, that is ;) )...
>
> Comments ?
>
>
> Patrice.
>
> [1] http://www.unicode.org/unicode/reports/tr10/
>
> [2] http://www.unicode.org/unicode/reports/tr15/
>
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Unicode combining characters

From

Patrice Hédé

Date:

25 September 2001, 14:21:50

Hi,

* Tatsuo Ishii <t-ishii@sra.co.jp> [010925 18:18]:
> > So, this shows two problems :
> > 
> > - length() on the server side doesn't handle correctly Unicode [I
> >   have the same result with char_length()], and returns the number
> >   of chars (as it is however advertised to do), rather the length
> >   of the string.
> 
> This is a known limitation.

To solve this, we could use wcwidth() (there is a custom
implementation for the systems which don't have it in the glibc). I'll
have a look at it later.

> > - the psql frontend makes the same mistake.

Same thing here.

I have just installed the CVS and downloaded the development version
(thanks Baldvin), tested that the stock version compiles fine, and
I'll now have a look at how to make this work. :) I'll send a patch
when I have this working here.

> Sounds great.

[Unicode normalisation and collation in the backend]

> I'm very interested in your point. I will start studying [1][2] after
> the beta freeze.
> 
> > Anyway, I'm open to suggestions :
> > 
> > - implement it in C, in the core,
> > 
> > - implement it in C, as contributed custom functions,
> 
> This may be a good starting point.
> 
> > I can't really accept a solution which would rely on the underlaying
> > libc, as it may not provide the necessary locales (or maybe, then,
> 
> I totally agree here.

As Oleg suggested, I will try to aim for 7.3, first with a version in
contrib, and later, if the implementation is fine, it could be moved
to the core (or not ? Though it would be nice to make sure every
PostgreSQL installation which supports unicode has it, so that users
won't need to have administrative rights to use the functionality).

I think I will go for a C version, and probably the collation and
normalisation data in tables, with some way to override the defaults
with secondary tables... I'll report as soon as I have something +/-
working.

> --
> Tatsuo Ishii

Patrice.

-- 
Patrice HÉDÉ ------------------------------- patrice à islande org ----- --  Isn't it weird  how scientists  can
imagine all the matter of the
 
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the
universe? -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
 
------------------------------------------ http://www.islande.org/ -----

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

25 September 2001, 21:03:23

> > > - length() on the server side doesn't handle correctly Unicode [I
> > >   have the same result with char_length()], and returns the number
> > >   of chars (as it is however advertised to do), rather the length
> > >   of the string.
> > 
> > This is a known limitation.
> 
> To solve this, we could use wcwidth() (there is a custom
> implementation for the systems which don't have it in the glibc). I'll
> have a look at it later.

And wcwidth() depends on the locale. That is the another reason we
could not use it.

> As Oleg suggested, I will try to aim for 7.3, first with a version in
> contrib, and later, if the implementation is fine, it could be moved
> to the core (or not ? Though it would be nice to make sure every
> PostgreSQL installation which supports unicode has it, so that users
> won't need to have administrative rights to use the functionality).

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed. That would be a good timing to merge your
Unicode stuffs into the main source tree.
--
Tatsuo Ishii

Re: Unicode combining characters

From

Thomas Lockhart

Date:

26 September 2001, 02:17:40

> I would like to see SQL99's charset, collate functionality for 7.3 (or
> later). If this happens, current multibyte implementation would be
> dramatically changed...

I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see some
advances in the SQL99 direction on charset etc.
                   - Thomas

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

26 September 2001, 03:40:28

> > I would like to see SQL99's charset, collate functionality for 7.3 (or
> > later). If this happens, current multibyte implementation would be
> > dramatically changed...
> 
> I'm *still* interested in working on this (an old story I know). I'm
> working on date/time stuff for 7.2, but hopefully 7.3 will see some
> advances in the SQL99 direction on charset etc.

BTW, I see "CHARACTER SET" in gram.y. Does current already support
that syntax?
--
Tatsuo Ishii

Re: Unicode combining characters

From

Thomas Lockhart

Date:

26 September 2001, 08:41:30

> BTW, I see "CHARACTER SET" in gram.y. Does current already support
> that syntax?

Yes and no. gram.y knows about CHARACTER SET, but only for the long
form, the clause is in the wrong position (it preceeds the length
specification) and it does not do much useful (generates a data type
based on the character set name which does not get recognized farther
back). Examples:

thomas=# create table t1 (c varchar(20) character set sql_ascii);
ERROR:  parser: parse error at or near "character"
thomas=# create table t1 (c character varying character set sql_ascii
(20));
ERROR:  Unable to locate type name 'varsql_ascii' in catalog

I'm pretty sure I'll get shift/reduce troubles when trying to move that
clause to *after* the length specifier. I'll try to do something with
the syntax for 7.2 once I've finished the date/time stuff.
                        - Thomas

Re: Unicode combining characters

From

Bruce Momjian

Date:

01 October 2001, 13:56:23

Can someone give me TODO items for this discussion?

> > So, this shows two problems :
> > 
> > - length() on the server side doesn't handle correctly Unicode [I have
> >   the same result with char_length()], and returns the number of chars
> >   (as it is however advertised to do), rather the length of the
> >   string.
> 
> This is a known limitation.
> 
> > - the psql frontend makes the same mistake.
> >
> > I am using version 7.1.3 (debian sid), so it may have been corrected
> > in the meantime (in this case, I apologise, but I have only recently
> > started again to use PostgreSQL and I haven't followed -hackers long
> > enough).
> > 
> > 
> > => I think fixing psql shouldn't be too complicated, as the glibc
> > should be providing the locale, and return the right values (is this
> > the case ? and what happens for combined latin + chinese characters
> > for example ? I'll have to try that later). If it's not fixed already,
> > do you want me to look at this ? [it will take some time, as I haven't
> > set up any development environment for postgres yet, and I'm away for
> > one week from thursday].
> 
> Sounds great.
> 
> > I was wondering if some people have already thought about this, or
> > already done something, or if some of you are interested in this. If
> > nobody does anything, I'll do something eventually, probably before
> > Christmas (I don't have much time for this, and I don't need the
> > functionality right now), but if there is an interest, I could team
> > with others and develop it faster :)
> 
> I'm very interested in your point. I will start studying [1][2] after
> the beta freeze.
> 
> > Anyway, I'm open to suggestions :
> > 
> > - implement it in C, in the core,
> > 
> > - implement it in C, as contributed custom functions,
> 
> This may be a good starting point.
> 
> > I can't really accept a solution which would rely on the underlaying
> > libc, as it may not provide the necessary locales (or maybe, then,
> 
> I totally agree here.
> 
> > The main functions I foresee are :
> > 
> > - provide a normalisation function to all 4 forms,
> > 
> > - provide a collation_key(text, language) function, as the calculation
> >   of the key may be expensive, some may want to index on the result (I
> >   would :) ),
> > 
> > - provide a collation algorithm, using the two previous facilities,
> >   which can do primary to tertiary collation (cf TR#10 for a detailed
> >   explanation).
> > 
> > I haven't looked at PostgreSQL code yet (shame !), so I may be
> > completely off-track, in which case I'll retract myself and won't
> > bother you again (on that subject, that is ;) )...
> > 
> > Comments ?
> --
> Tatsuo Ishii
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

01 October 2001, 21:16:00

> Can someone give me TODO items for this discussion?

What about:

Improve Unicode combined character handling
--
Tatsuo Ishii

> > > So, this shows two problems :
> > > 
> > > - length() on the server side doesn't handle correctly Unicode [I have
> > >   the same result with char_length()], and returns the number of chars
> > >   (as it is however advertised to do), rather the length of the
> > >   string.
> > 
> > This is a known limitation.
> > 
> > > - the psql frontend makes the same mistake.
> > >
> > > I am using version 7.1.3 (debian sid), so it may have been corrected
> > > in the meantime (in this case, I apologise, but I have only recently
> > > started again to use PostgreSQL and I haven't followed -hackers long
> > > enough).
> > > 
> > > 
> > > => I think fixing psql shouldn't be too complicated, as the glibc
> > > should be providing the locale, and return the right values (is this
> > > the case ? and what happens for combined latin + chinese characters
> > > for example ? I'll have to try that later). If it's not fixed already,
> > > do you want me to look at this ? [it will take some time, as I haven't
> > > set up any development environment for postgres yet, and I'm away for
> > > one week from thursday].
> > 
> > Sounds great.
> > 
> > > I was wondering if some people have already thought about this, or
> > > already done something, or if some of you are interested in this. If
> > > nobody does anything, I'll do something eventually, probably before
> > > Christmas (I don't have much time for this, and I don't need the
> > > functionality right now), but if there is an interest, I could team
> > > with others and develop it faster :)
> > 
> > I'm very interested in your point. I will start studying [1][2] after
> > the beta freeze.
> > 
> > > Anyway, I'm open to suggestions :
> > > 
> > > - implement it in C, in the core,
> > > 
> > > - implement it in C, as contributed custom functions,
> > 
> > This may be a good starting point.
> > 
> > > I can't really accept a solution which would rely on the underlaying
> > > libc, as it may not provide the necessary locales (or maybe, then,
> > 
> > I totally agree here.
> > 
> > > The main functions I foresee are :
> > > 
> > > - provide a normalisation function to all 4 forms,
> > > 
> > > - provide a collation_key(text, language) function, as the calculation
> > >   of the key may be expensive, some may want to index on the result (I
> > >   would :) ),
> > > 
> > > - provide a collation algorithm, using the two previous facilities,
> > >   which can do primary to tertiary collation (cf TR#10 for a detailed
> > >   explanation).
> > > 
> > > I haven't looked at PostgreSQL code yet (shame !), so I may be
> > > completely off-track, in which case I'll retract myself and won't
> > > bother you again (on that subject, that is ;) )...
> > > 
> > > Comments ?
> > --
> > Tatsuo Ishii
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> > 
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

Re: Unicode combining characters

From

Bruce Momjian

Date:

01 October 2001, 21:23:32

> > Can someone give me TODO items for this discussion?
> 
> What about:
> 
> Improve Unicode combined character handling

Done.  I can't update the web version because I don't have permission.

Also, have we decided if multibyte should be the configure default now?


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

02 October 2001, 02:50:03

> Also, have we decided if multibyte should be the configure default now?

Not sure.

Anyway I have tested LIKE/REGEX query test using current. The query
executed is:

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';

and

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) ~ 'aaa';

Here is the result:
no MB        with MB
LIKE    0.09 msec    0.08 msec
REGEX    0.09 msec    0.10 msec

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
--
Tatsuo Ishii

Re: Unicode combining characters

From

Bruce Momjian

Date:

02 October 2001, 12:34:13

If no one can find a case where multibyte is slower, I think we should
enable it by default.  Comments?


> > Also, have we decided if multibyte should be the configure default now?
> 
> Not sure.
> 
> Anyway I have tested LIKE/REGEX query test using current. The query
> executed is:
> 
> explain analyze select '0000000 5089 474e...( 16475
> bytes long text containing only 0-9a-z chars) like 'aaa';
> 
> and
> 
> explain analyze select '0000000 5089 474e...( 16475
> bytes long text containing only 0-9a-z chars) ~ 'aaa';
> 
> Here is the result:
> 
>     no MB        with MB
> LIKE    0.09 msec    0.08 msec
> REGEX    0.09 msec    0.10 msec
> 
> LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
> little bit slow. Probably this is due the wide character conversion
> overhead.
> --
> Tatsuo Ishii
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tom Lane

Date:

02 October 2001, 13:53:05

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> If no one can find a case where multibyte is slower, I think we should
> enable it by default.  Comments?

Well, he just did point out such a case:

>>              no MB        with MB
>> LIKE      0.09 msec    0.08 msec
>> REGEX    0.09 msec    0.10 msec

But I agree with your conclusion.  If the worst penalty we can find is
that a regex comparison operator is 10% slower, we may as well turn it
on by default.  Most people will never notice the difference, and anyone
who really cares can always turn it off again.
        regards, tom lane

Re: Unicode combining characters

From

Bruce Momjian

Date:

02 October 2001, 14:01:56

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > If no one can find a case where multibyte is slower, I think we should
> > enable it by default.  Comments?
> 
> Well, he just did point out such a case:
> 
> >>              no MB        with MB
> >> LIKE      0.09 msec    0.08 msec
> >> REGEX    0.09 msec    0.10 msec
> 
> But I agree with your conclusion.  If the worst penalty we can find is
> that a regex comparison operator is 10% slower, we may as well turn it
> on by default.  Most people will never notice the difference, and anyone
> who really cares can always turn it off again.

But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference, or is it because the LIKE
optimization is off for multibyte.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tom Lane

Date:

02 October 2001, 14:09:55

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> But the strange thing is that LIKE is faster, perhaps meaning his
> measurements can't even see the difference,

Yeah, I suspect there's 10% or more noise in these numbers.  But then
one could read the results as saying we can't reliably measure any
difference at all ...

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes.  I don't
trust fractional-second time measurements on Unix boxen; too much chance
of bogus results due to activity of other processes.
        regards, tom lane

Re: Unicode combining characters

From

Peter Eisentraut

Date:

02 October 2001, 18:09:10

Tatsuo Ishii writes:

> LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
> little bit slow. Probably this is due the wide character conversion
> overhead.

Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

02 October 2001, 21:01:36

> Yeah, I suspect there's 10% or more noise in these numbers.  But then
> one could read the results as saying we can't reliably measure any
> difference at all ...
> 
> I'd feel more confident if the measurements were done using operators
> repeated enough times to yield multiple-second runtimes.  I don't
> trust fractional-second time measurements on Unix boxen; too much chance
> of bogus results due to activity of other processes.

Any idea to do that? I tried to do a measurements using something like
"SELECT * FROM t1 WHERE very-long-string-column LIKE 'aaa'", but I'm
afraid the I/O time masks the difference...
--
Tatsuo Ishii

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

02 October 2001, 21:11:20

> > LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
> > little bit slow. Probably this is due the wide character conversion
> > overhead.
> 
> Could this conversion be optimized to recognize when it's dealing with a
> single-byte character encoding?

Not sure, will look into...
--
Tatsuo Ishii

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 00:15:13

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> I'd feel more confident if the measurements were done using operators
>> repeated enough times to yield multiple-second runtimes.

> Any idea to do that?

Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like
for (i = 0 to a million)    boolvar := $1 like $2;

Then call it with strings of different lengths and see how the runtime
varies.  You need to apply the LIKE to function parameters, else the
system will probably collapse the LIKE operation to a constant...
        regards, tom lane

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

03 October 2001, 03:41:14

> Maybe something like this: declare a plpgsql function that takes two
> text parameters and has a body like
> 
>     for (i = 0 to a million)
>         boolvar := $1 like $2;
> 
> Then call it with strings of different lengths and see how the runtime
> varies.  You need to apply the LIKE to function parameters, else the
> system will probably collapse the LIKE operation to a constant...

Good idea. I did tests for both LIKE and REGEX using PL/pgsql
functions(see source code below). Here are the result. What I did was
calling the functions with changing taret strings from 32byte to
8192. Times are all in msec.

(1) LIKE

bytes     Without MB    With MB

32      8121.94    8094.73
64      8167.98    8105.24
128      8151.30    8108.61
256      8090.12    8098.20
512      8111.05    8101.07
1024      8110.49    8099.61
2048      8095.32    8106.00
4096      8094.88    8091.19
8192      8123.02    8121.63

(2) REGEX

bytes   Without MB    With MB

32    117.93        119.47
64    126.41        127.61
128    143.97        146.55
256    180.49        183.69
512    255.53        256.16
1024    410.59        409.22
2048    5176.38        5181.99
4096    6000.82        5627.84
8192    6529.15        6547.10

------------- shell script -------------------
for i in 32 64 128 256 512 1024 2048 4096 8192
do
psql -c "explain analyze select liketest(a,'aaa') from (select substring('very_long_text' from 0 for $i) as a) as a"
test
done
------------- shell script -------------------

------------- functions -----------------
drop function liketest(text,text);
create function liketest(text,text) returns bool as '
declarei int;rtn boolean;
begini := 1000000;while i > 0 loop  rtn := $1 like $2;  i := i - 1;end loop;return rtn;
end;
' language 'plpgsql';

drop function regextest(text,text);
create function regextest(text,text) returns bool as '
declarei int;rtn boolean;
begini := 10000;while i > 0 loop  rtn := $1 ~ $2;  i := i - 1;end loop;return rtn;
end;
' language 'plpgsql';
------------- functions -----------------

Re: Unicode combining characters

From

"Zeugswetter Andreas SB SD"

Date:

03 October 2001, 04:39:40

> ------------- shell script -------------------
> for i in 32 64 128 256 512 1024 2048 4096 8192
> do
> psql -c "explain analyze select liketest(a,'aaa') from 
> (select substring('very_long_text' from 0 for $i) as a) as a" test
> done
> ------------- shell script -------------------

I don't think your search string is sufficient for a test. 
With 'aaa' it actually knows that it only needs to look at the 
first three characters of a. Imho you need to try something 
like liketest(a,'%aaa%').

Andreas

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

03 October 2001, 05:30:17

> I don't think your search string is sufficient for a test. 
> With 'aaa' it actually knows that it only needs to look at the 
> first three characters of a. Imho you need to try something 
> like liketest(a,'%aaa%').

Ok. I ran the modified test (now the iteration is reduced to 100000 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:-< There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

MB+LIKE
Total runtime: 1321.58 msec
Total runtime: 1718.03 msec
Total runtime: 2519.97 msec
Total runtime: 4187.05 msec
Total runtime: 7629.24 msec
Total runtime: 14456.45 msec
Total runtime: 17320.14 msec
Total runtime: 17323.65 msec
Total runtime: 17321.51 msec

noMB+LIKE
Total runtime: 964.90 msec
Total runtime: 993.09 msec
Total runtime: 1057.40 msec
Total runtime: 1192.68 msec
Total runtime: 1494.59 msec
Total runtime: 2078.75 msec
Total runtime: 2328.77 msec
Total runtime: 2326.38 msec
Total runtime: 2330.53 msec
--
Tatsuo Ishii

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 10:56:48

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> ... There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...

Seems like a good thing to put on the to-do list.  In the meantime,
we still have the question of whether to enable multibyte in the
default configuration.  I'd still vote YES, as these results seem
to me to demonstrate that there is no wide-ranging performance penalty.
A problem confined to LIKE on long strings isn't a showstopper IMHO.
        regards, tom lane

Re: Unicode combining characters

From

"Zeugswetter Andreas SB SD"

Date:

03 October 2001, 11:24:30

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > ... There seems some problems existing in the
> > implementation. Considering REGEX is not so slow, maybe we should
> > employ the same design as REGEX. i.e. using wide charcters, not
> > multibyte streams...
> 
> Seems like a good thing to put on the to-do list.  In the meantime,
> we still have the question of whether to enable multibyte in the
> default configuration.  I'd still vote YES, as these results seem
> to me to demonstrate that there is no wide-ranging performance
penalty.
> A problem confined to LIKE on long strings isn't a showstopper IMHO.

As I said, with a valid not anchored like expression the performance 
difference was substantial, even for shorter strings it was 37%. 
The test with "like 'aaa'" was not a good test case, and we should not 
deduce anything from that.

Andreas

Re: Unicode combining characters

From

Bruce Momjian

Date:

03 October 2001, 12:05:58

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > ... There seems some problems existing in the
> > implementation. Considering REGEX is not so slow, maybe we should
> > employ the same design as REGEX. i.e. using wide charcters, not
> > multibyte streams...
> 
> Seems like a good thing to put on the to-do list.  In the meantime,
> we still have the question of whether to enable multibyte in the
> default configuration.  I'd still vote YES, as these results seem
> to me to demonstrate that there is no wide-ranging performance penalty.
> A problem confined to LIKE on long strings isn't a showstopper IMHO.
> 

Added to TODO:

* Use wide characters to evaluate regular expressions, for performance
(Tatsuo) 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 13:35:47

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Added to TODO:

> * Use wide characters to evaluate regular expressions, for performance
> (Tatsuo) 

Regexes are fine; it's LIKE that's slow.
        regards, tom lane

Re: Unicode combining characters

From

Bruce Momjian

Date:

03 October 2001, 13:39:56

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Added to TODO:
> 
> > * Use wide characters to evaluate regular expressions, for performance
> > (Tatsuo) 
> 
> Regexes are fine; it's LIKE that's slow.

Oops, thanks.  Changed to:

* Use wide characters to evaluate LIKE, for performance (Tatsuo)          
--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Peter Eisentraut

Date:

03 October 2001, 16:41:56

Tom Lane writes:

> In the meantime, we still have the question of whether to enable
> multibyte in the default configuration.

This would make more sense if all of multibyte, locale, and NLS became
defaults in one release.  I haven't quite sold people in the second item
yet, although I have a design how to do that (see below).  And the third,
well who knows...

Perhaps we could make it a release goal for 7.3 to

* Optimize i18n stuff to have a minimal performance penalty when it's not used. (locale=C etc.)

* Make i18n stuff sufficiently well-behaved to make it the default. (Especially, add initdb options and GUC parameters
toset the locale. Don't rely on environment variables -- too complicated.)

Meanwhile, quadratic performance penalties (or so it seems) for LIKE
expressions aren't exactly a "minor" problem.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 16:45:29

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> In the meantime, we still have the question of whether to enable
>> multibyte in the default configuration.

> Perhaps we could make it a release goal for 7.3

Yeah, that's probably the best way to proceed... it's awfully late
in the 7.2 cycle to be deciding to do this now...
        regards, tom lane

Re: Unicode combining characters

From

Bruce Momjian

Date:

03 October 2001, 18:28:09

> Tom Lane writes:
> 
> > In the meantime, we still have the question of whether to enable
> > multibyte in the default configuration.
> 
> This would make more sense if all of multibyte, locale, and NLS became
> defaults in one release.  I haven't quite sold people in the second item
> yet, although I have a design how to do that (see below).  And the third,
> well who knows...
> 
> Perhaps we could make it a release goal for 7.3 to
> 
> * Optimize i18n stuff to have a minimal performance penalty when it's not
>   used. (locale=C etc.)
> 
> * Make i18n stuff sufficiently well-behaved to make it the default.
>   (Especially, add initdb options and GUC parameters to set the locale.
>   Don't rely on environment variables -- too complicated.)
> 
> Meanwhile, quadratic performance penalties (or so it seems) for LIKE
> expressions aren't exactly a "minor" problem.
> 

Added to TODO:

* Optimize locale to have minimal performance impact when not used
(Peter E)
--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

03 October 2001, 22:18:31

> Ok. I ran the modified test (now the iteration is reduced to 100000 in
> liketest()). As you can see, there's huge difference. MB seems up to
> ~8 times slower:-< There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...
> 
> MB+LIKE
> Total runtime: 1321.58 msec
> Total runtime: 1718.03 msec
> Total runtime: 2519.97 msec
> Total runtime: 4187.05 msec
> Total runtime: 7629.24 msec
> Total runtime: 14456.45 msec
> Total runtime: 17320.14 msec
> Total runtime: 17323.65 msec
> Total runtime: 17321.51 msec
> 
> noMB+LIKE
> Total runtime: 964.90 msec
> Total runtime: 993.09 msec
> Total runtime: 1057.40 msec
> Total runtime: 1192.68 msec
> Total runtime: 1494.59 msec
> Total runtime: 2078.75 msec
> Total runtime: 2328.77 msec
> Total runtime: 2326.38 msec
> Total runtime: 2330.53 msec

I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).

So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.

Here is the MB case result with SQL_ASCII encoding.

Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 23:06:08

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> To accomplish this, I moved MatchText etc. to a separate file and now
> like.c includes it *twice* (similar technique used in regexec()). This
> makes like.o a little bit larger, but I believe this is worth for the
> optimization.

That sounds great.

What's your feeling now about the original question: whether to enable
multibyte by default now, or not?  I'm still thinking that Peter's
counsel is the wisest: plan to do it in 7.3, not today.  But this fix
seems to eliminate the only hard reason we have not to do it today ...
        regards, tom lane

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

03 October 2001, 23:43:38

> What's your feeling now about the original question: whether to enable
> multibyte by default now, or not?  I'm still thinking that Peter's
> counsel is the wisest: plan to do it in 7.3, not today.  But this fix
> seems to eliminate the only hard reason we have not to do it today ...

If SQL99's I18N staffs would be added in 7.3, it means both the
multibyte support and the locale support might disappear, then 2might
be merged into it (not sure about NLS. is that compatible with, for
example, per column charset?)

So, for none multibyte users, the multibyte support would be
one-release-only-functionality: suddenly appears in 7.2 and disappears
in 7.3. I'm afraid this would cause more troubles rather than
usefullness.

What do you think?
--
Tatsuo Ishii

Re: Unicode combining characters

From

Tom Lane

Date:

03 October 2001, 23:50:36

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> What do you think?

I think that we were supposed to go beta a month ago, and so this is
no time to start adding new features to this release.  Let's plan to
make this happen (one way or the other) in 7.3, instead.
        regards, tom lane

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

04 October 2001, 00:11:36

> I think that we were supposed to go beta a month ago, and so this is
> no time to start adding new features to this release.  Let's plan to
> make this happen (one way or the other) in 7.3, instead.

Agreed.
--
Tatsuo Ishii

Re: Unicode combining characters

From

Bruce Momjian

Date:

04 October 2001, 00:30:24

> > Ok. I ran the modified test (now the iteration is reduced to 100000 in
> > liketest()). As you can see, there's huge difference. MB seems up to
> > ~8 times slower:-< There seems some problems existing in the
> > implementation. Considering REGEX is not so slow, maybe we should
> > employ the same design as REGEX. i.e. using wide charcters, not
> > multibyte streams...

Let me add I think our regex code is very slow.  It is the standard BSD
regex library by Henry Spencer.  He rewrote it a few years ago for TCL
8.X and said he was working on a standalone library version.  I have
asked him several times via email over the years but he still has not
released a standalone version of the new optimized regex code.  It is on
our TODO list.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Patrice Hédé

Date:

12 October 2001, 01:32:25

* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:
> 
> Can I ask about the status of this?

I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.

There is still the issue about unicode characters which have code
points above U00FFFF, which probably should be rejected on the server
side. I have yet to update my patch for that. I'll probably do that
tomorrow, as I don't have more time tonight, but I think this will be
trivial, so maybe Tatsuo can do it, if he has some time before that :)

If there are other issues, I'd like to know :)

Regarding the implementation of Unicode functionality (normalisation,
collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...)
on the server side, it's definitely something for 7.3 (though it might
be available sooner). It will probably be just a contributed extension
first. I'm currently making an alpha version of the project I'm
working on in order to have sufficient "real-life" Unicode data to
work with, and make sure the design choices make sense :)

Patrice.

BTW, I tried to find web-accessible archives of pgsql-patches, are
there some, or should each and every discussion be followed-up on
pgsql-hackers (even though the description for pgsql-patches includes
discussions on patches) ?

> > Hi all,
> > 
> > while working on a new project involving PostgreSQL and making some
> > tests, I have come up with the following output from psql :
> > 
> >  lang | length | length |   text    |   text
> > ------+--------+--------+-----------+-----------
> >  isl  |      7 |      6 | _l_ta | _leit
> >  isl  |      7 |      7 | _l_ta | _litum
> >  isl  |      7 |      7 | _l_ta | _liti_
> >  isl  |      5 |      4 | ma_ur    | mann
> >  isl  |      5 |      7 | ma_ur    | m_nnum
> >  isl  |      5 |      5 | ma_ur    | manna
> >  isl  |      5 |      4 | _ska    | -a_i
> > 
> > [the misalignment is what I got, it's not a copy-paste error]
> > 
> > This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
> > querying a database created with -E UNICODE (by the way, these are
> > icelandic words :) ).
> > 
> > What you see above is misleading, since it's not possible to see that
> > '_', '_', '_' and '_' are using combining marks, while '_' is not.
> > 
> > As a reminder, a combining mark in Unicode is that _ is actually
> > encoded as a + ' (where ' is the acute combining mark).
> > 
> > Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
> > instead of <c3 a1> [UTF16: 00E1].
> > 
> > The "length" fields are what is returned by length(a.text) and
> > length(b.text).
> > 
> > So, this shows two problems :
> > 
> > - length() on the server side doesn't handle correctly Unicode [I have
> >   the same result with char_length()], and returns the number of chars
> >   (as it is however advertised to do), rather the length of the
> >   string.
> > 
> > - the psql frontend makes the same mistake.
> > 
> > I am using version 7.1.3 (debian sid), so it may have been corrected
> > in the meantime (in this case, I apologise, but I have only recently
> > started again to use PostgreSQL and I haven't followed -hackers long
> > enough).
> > 
> > 
> > => I think fixing psql shouldn't be too complicated, as the glibc
> > should be providing the locale, and return the right values (is this
> > the case ? and what happens for combined latin + chinese characters
> > for example ? I'll have to try that later). If it's not fixed already,
> > do you want me to look at this ? [it will take some time, as I haven't
> > set up any development environment for postgres yet, and I'm away for
> > one week from thursday].
> > 
> > => regarding the backend, it may be more complex, as the underlaying
> > system may not provide any UTF-8 locale to use (!= from being UTF-8
> > aware : an administrator may have decided that UTF-8 locales are
> > useless on a server, as only root connections are made, and he wants
> > only the C locale on the console - I've seen that quite often ;) ).
> > 
> > 
> > This brings me to another subject : I will need to support the full
> > Unicode collation algorithm (UCA, as described in TR#10 [1] of the
> > Unicode consortium), and I will need to be able to sort according to
> > locales which may not be installed on the backend server (some of
> > which may not even be recognised by GNU libc, which supports already
> > more than 140 locales -- artificial languages would be an example). I
> > will also need to be able to normalise the unicode strings (TR#15 [2])
> > so that I don't have some characters in legacy codepoints [as 00E1
> > above], and others with combining marks.
> > 
> > There is today an implementation in perl of the needed functionality,
> > in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
> > :( ). But as they are Perl modules, the untrusted version of perl,
> > plperlu, will be needed, and it's a pity for what I consider a core
> > functionality in the future (not that plperlu isn't a good thing - I
> > can't wait for it ! - but that an untrusted pl language is needed to
> > support normalisation and collation).
> > 
> > Note also that there are a lot of data associated with these
> > algorithms, as you could expect.
> > 
> > I was wondering if some people have already thought about this, or
> > already done something, or if some of you are interested in this. If
> > nobody does anything, I'll do something eventually, probably before
> > Christmas (I don't have much time for this, and I don't need the
> > functionality right now), but if there is an interest, I could team
> > with others and develop it faster :)
> > 
> > Anyway, I'm open to suggestions :
> > 
> > - implement it in C, in the core,
> > 
> > - implement it in C, as contributed custom functions,
> > 
> > - implement it in perl (by reusing Unicode:: work), in a trusted plperl,
> > 
> > - implement it in perl, calling Unicode:: modules, in an untrusted
> >   plperl.
> > 
> > and then :
> > 
> > - provide the data in tables (system and/or user) - which should be
> >   available across databases,
> > 
> > - load the data from the original text files provided in Unicode (and
> >   other as needed), if the functionality is compiled into the server.
> > 
> > - I believe the basic unicode information should be standard, and the
> >   locales should be provided as contrib/ files to be plugged in as
> >   needed.
> > 
> > I can't really accept a solution which would rely on the underlaying
> > libc, as it may not provide the necessary locales (or maybe, then,
> > have a way to override the collating tables by user tables - actually,
> > this would be certainly the best solution if it's in the core, as the
> > tables will put an extra burden on the distribution and the
> > installation footprint, especially if the tables are already there,
> > for glibc, for perl5.6+, for other software dealing with Unicode).
> > 
> > The main functions I foresee are :
> > 
> > - provide a normalisation function to all 4 forms,
> > 
> > - provide a collation_key(text, language) function, as the calculation
> >   of the key may be expensive, some may want to index on the result (I
> >   would :) ),
> > 
> > - provide a collation algorithm, using the two previous facilities,
> >   which can do primary to tertiary collation (cf TR#10 for a detailed
> >   explanation).
> > 
> > I haven't looked at PostgreSQL code yet (shame !), so I may be
> > completely off-track, in which case I'll retract myself and won't
> > bother you again (on that subject, that is ;) )...
> > 
> > Comments ?
> > 
> > 
> > Patrice.
> > 
> > [1] http://www.unicode.org/unicode/reports/tr10/
> > 
> > [2] http://www.unicode.org/unicode/reports/tr15/

-- 
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/

Re: Unicode combining characters

From

Bruce Momjian

Date:

12 October 2001, 01:42:57

Can I ask about the status of this?


> Hi all,
> 
> while working on a new project involving PostgreSQL and making some
> tests, I have come up with the following output from psql :
> 
>  lang | length | length |   text    |   text
> ------+--------+--------+-----------+-----------
>  isl  |      7 |      6 | _l_ta | _leit
>  isl  |      7 |      7 | _l_ta | _litum
>  isl  |      7 |      7 | _l_ta | _liti_
>  isl  |      5 |      4 | ma_ur    | mann
>  isl  |      5 |      7 | ma_ur    | m_nnum
>  isl  |      5 |      5 | ma_ur    | manna
>  isl  |      5 |      4 | _ska    | -a_i
> 
> [the misalignment is what I got, it's not a copy-paste error]
> 
> This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
> querying a database created with -E UNICODE (by the way, these are
> icelandic words :) ).
> 
> What you see above is misleading, since it's not possible to see that
> '_', '_', '_' and '_' are using combining marks, while '_' is not.
> 
> As a reminder, a combining mark in Unicode is that _ is actually
> encoded as a + ' (where ' is the acute combining mark).
> 
> Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
> instead of <c3 a1> [UTF16: 00E1].
> 
> The "length" fields are what is returned by length(a.text) and
> length(b.text).
> 
> So, this shows two problems :
> 
> - length() on the server side doesn't handle correctly Unicode [I have
>   the same result with char_length()], and returns the number of chars
>   (as it is however advertised to do), rather the length of the
>   string.
> 
> - the psql frontend makes the same mistake.
> 
> I am using version 7.1.3 (debian sid), so it may have been corrected
> in the meantime (in this case, I apologise, but I have only recently
> started again to use PostgreSQL and I haven't followed -hackers long
> enough).
> 
> 
> => I think fixing psql shouldn't be too complicated, as the glibc
> should be providing the locale, and return the right values (is this
> the case ? and what happens for combined latin + chinese characters
> for example ? I'll have to try that later). If it's not fixed already,
> do you want me to look at this ? [it will take some time, as I haven't
> set up any development environment for postgres yet, and I'm away for
> one week from thursday].
> 
> => regarding the backend, it may be more complex, as the underlaying
> system may not provide any UTF-8 locale to use (!= from being UTF-8
> aware : an administrator may have decided that UTF-8 locales are
> useless on a server, as only root connections are made, and he wants
> only the C locale on the console - I've seen that quite often ;) ).
> 
> 
> This brings me to another subject : I will need to support the full
> Unicode collation algorithm (UCA, as described in TR#10 [1] of the
> Unicode consortium), and I will need to be able to sort according to
> locales which may not be installed on the backend server (some of
> which may not even be recognised by GNU libc, which supports already
> more than 140 locales -- artificial languages would be an example). I
> will also need to be able to normalise the unicode strings (TR#15 [2])
> so that I don't have some characters in legacy codepoints [as 00E1
> above], and others with combining marks.
> 
> There is today an implementation in perl of the needed functionality,
> in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
> :( ). But as they are Perl modules, the untrusted version of perl,
> plperlu, will be needed, and it's a pity for what I consider a core
> functionality in the future (not that plperlu isn't a good thing - I
> can't wait for it ! - but that an untrusted pl language is needed to
> support normalisation and collation).
> 
> Note also that there are a lot of data associated with these
> algorithms, as you could expect.
> 
> I was wondering if some people have already thought about this, or
> already done something, or if some of you are interested in this. If
> nobody does anything, I'll do something eventually, probably before
> Christmas (I don't have much time for this, and I don't need the
> functionality right now), but if there is an interest, I could team
> with others and develop it faster :)
> 
> Anyway, I'm open to suggestions :
> 
> - implement it in C, in the core,
> 
> - implement it in C, as contributed custom functions,
> 
> - implement it in perl (by reusing Unicode:: work), in a trusted plperl,
> 
> - implement it in perl, calling Unicode:: modules, in an untrusted
>   plperl.
> 
> and then :
> 
> - provide the data in tables (system and/or user) - which should be
>   available across databases,
> 
> - load the data from the original text files provided in Unicode (and
>   other as needed), if the functionality is compiled into the server.
> 
> - I believe the basic unicode information should be standard, and the
>   locales should be provided as contrib/ files to be plugged in as
>   needed.
> 
> I can't really accept a solution which would rely on the underlaying
> libc, as it may not provide the necessary locales (or maybe, then,
> have a way to override the collating tables by user tables - actually,
> this would be certainly the best solution if it's in the core, as the
> tables will put an extra burden on the distribution and the
> installation footprint, especially if the tables are already there,
> for glibc, for perl5.6+, for other software dealing with Unicode).
> 
> The main functions I foresee are :
> 
> - provide a normalisation function to all 4 forms,
> 
> - provide a collation_key(text, language) function, as the calculation
>   of the key may be expensive, some may want to index on the result (I
>   would :) ),
> 
> - provide a collation algorithm, using the two previous facilities,
>   which can do primary to tertiary collation (cf TR#10 for a detailed
>   explanation).
> 
> I haven't looked at PostgreSQL code yet (shame !), so I may be
> completely off-track, in which case I'll retract myself and won't
> bother you again (on that subject, that is ;) )...
> 
> Comments ?
> 
> 
> Patrice.
> 
> [1] http://www.unicode.org/unicode/reports/tr10/
> 
> [2] http://www.unicode.org/unicode/reports/tr15/
> 
> -- 
> Patrice H_D_ ------------------------------- patrice _ islande.org -----
>   --  Isn't it weird  how scientists  can imagine  all the matter of the
> universe exploding out of a dot smaller than the head of a pin, but they
> can't come up with a more evocative name for it than "The Big Bang" ?
>   -- What would _you_ call the creation of the universe ?
>   -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
> ------------------------------------------ http://www.islande.org/ -----
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Unicode combining characters

From

Tatsuo Ishii

Date:

12 October 2001, 13:04:07

> * Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:
> > 
> > Can I ask about the status of this?
> 
> I have sent a patch a few days ago solving the client-side issue (on
> the pgsql-patches mailing list) for review. I think Tatsuo said it
> looked OK, however he should confirm/infirm this.

I've been waiting for Peter's opnion. His understanding of psql is
much better than me.

> There is still the issue about unicode characters which have code
> points above U00FFFF, which probably should be rejected on the server
> side. I have yet to update my patch for that. I'll probably do that
> tomorrow, as I don't have more time tonight, but I think this will be
> trivial, so maybe Tatsuo can do it, if he has some time before that :)

Rejecting over U00FFFF is considered a bug fix, that means we could
fix after the beta test begins:-)

BTW, have you tried my updates for supporting ISO 8859 characters?
Please let me know if you have troubles especially with "euro". I
don't understand none of these charsets...
--
Tatsuo Ishii

Re: Unicode combining characters

From

Bruce Momjian

Date:

12 October 2001, 13:07:30

> * Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:
> > 
> > Can I ask about the status of this?
> 
> I have sent a patch a few days ago solving the client-side issue (on
> the pgsql-patches mailing list) for review. I think Tatsuo said it
> looked OK, however he should confirm/infirm this.

OK, I saw the client encoding function appear today.  I assume Tatsuo
and friends will finish this up.

> Regarding the implementation of Unicode functionality (normalisation,
> collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...)
> on the server side, it's definitely something for 7.3 (though it might
> be available sooner). It will probably be just a contributed extension
> first. I'm currently making an alpha version of the project I'm
> working on in order to have sufficient "real-life" Unicode data to
> work with, and make sure the design choices make sense :)

If you would like to add some TODO items, please let me know.  Good to
document them even if you can't get to them for a while.

> BTW, I tried to find web-accessible archives of pgsql-patches, are
> there some, or should each and every discussion be followed-up on
> pgsql-hackers (even though the description for pgsql-patches includes
> discussions on patches) ?

I use:
http://fts.postgresql.org/db/mw/

You can discuss on patches or hackers.  I usually do patch discussion on
patches unless I need a larger audience for comments.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [PATCHES] Unicode combining characters

From

Tatsuo Ishii

Date:

14 October 2001, 21:37:35

I have committed part of Patrice's patches with minor fixes.
Uncommitted changes are related to the backend side, and the reason
could be found in the previous discussions (basically this is due to
the fact that current regex code does not support UTF-8 chars >=
0x10000). Instead pg_veryfymbstr() now rejects UTF-8 chars >= 0x10000.
--
Tatsuo Ishii

> Hi,
> 
> I should have sent the patch earlier, but got delayed by other stuff.
> Anyway, here is the patch:
> 
> - most of the functionality is only activated when MULTIBYTE is
>   defined,
> 
> - check valid UTF-8 characters, client-side only yet, and only on
>   output, you still can send invalid UTF-8 to the server (so, it's
>   only partly compliant to Unicode 3.1, but that's better than
>   nothing).
> 
> - formats with the correct number of columns (that's why I made it in
>   the first place after all), but only for UNICODE. However, the code
>   allows to plug-in routines for other encodings, as Tatsuo did for
>   the other multibyte functions.
> 
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
>   characters (characters with values >= 0x10000, which are encoded on
>   four bytes).
> 
> - doesn't depend on the locale capabilities of the glibc (useful for
>   remote telnet).
> 
> I would like somebody to check it closely, as it is my first patch to
> pgsql.  Also, I created dummy .orig files, so that the two files I
> created are included, I hope that's the right way.
> 
> Now, a lot of functionality is NOT included here, but I will keep that
> for 7.3 :) That includes all string checking on the server side (which
> will have to be a bit more optimised ;) ), and the input checking on
> the client side for UTF-8, though that should not be difficult. It's
> just to send the strings through mbvalidate() before sending them to
> the server. Strong checking on UTF-8 strings is mandatory to be
> compliant with Unicode 3.1+ .
> 
> Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
> The euro is coming 1. january 2002 (before 7.3 !) and over 280
> millions people in Europe will need the euro sign and only iso-8859-15
> and iso-8859-16 have it (and unfortunately, I don't think all Unices
> will switch to Unicode in the meantime)....
> 
> err... yes, I know that this is not every single person in Europe that
> uses PostgreSql, so it's not exactly 280m, but it's just a matter of
> time ! ;)
> 
> I'll come back (on pgsql-hackers) later to ask a few questions
> regarding the full unicode support (normalisation, collation,
> regexes,...) on the server side :)
> 
> Here is the patch !
> 
> Patrice.
> 
> -- 
> Patrice H�D� ------------------------------- patrice � islande org -----
>   --  Isn't it weird  how scientists  can imagine  all the matter of the
> universe exploding out of a dot smaller than the head of a pin, but they
> can't come up with a more evocative name for it than "The Big Bang" ?
>   -- What would _you_ call the creation of the universe ?
>   -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
> ------------------------------------------ http://www.islande.org/ -----