Thread: Unicode upper() bug still present

Unicode upper() bug still present

From

Oliver Elphick

Date:

19 October 2003, 20:56:19

There is a bug in Unicode upper() which has been present since 7.2:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=139389

I had thought I had reported it before, but I can't find a record of it.

The attached Perl script illustrates the bug (the script needs DBI).

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight, UK                             http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
                 ========================================
     "For the LORD God is a sun and shield; the LORD will
      give grace and glory; no good thing will he withhold
      from them that walk uprightly."        Psalms 84:11

Attachment

Re: Unicode upper() bug still present

From

Tom Lane

Date:

19 October 2003, 21:36:25

Oliver Elphick <olly@lfix.co.uk> writes:
> There is a bug in Unicode upper() which has been present since 7.2:

We don't support upper/lower in multibyte character sets, and can't as
long as the functionality is dependent on <ctype.h>'s toupper()/tolower().
It's been suggested that we could use <wctype.h> where available.
However there are a bunch of issues that would have to be solved to make
that happen.  (How do we convert between the database character encoding 
and the wctype representation?  How do we even find out what
representation the current locale setting expects to use?)

In short, don't hold your breath ...
        regards, tom lane

Re: Unicode upper() bug still present

From

Hannu Krosing

Date:

20 October 2003, 04:24:10

Tom Lane kirjutas E, 20.10.2003 kell 03:35:
> Oliver Elphick <olly@lfix.co.uk> writes:
> > There is a bug in Unicode upper() which has been present since 7.2:
>
> We don't support upper/lower in multibyte character sets, and can't as
> long as the functionality is dependent on <ctype.h>'s toupper()/tolower().
> It's been suggested that we could use <wctype.h> where available.
> However there are a bunch of issues that would have to be solved to make
> that happen.  (How do we convert between the database character encoding
> and the wctype representation?

How do we do it for sorting ?

> How do we even find out what
> representation the current locale setting expects to use?)

Why not use the same locale settings as for sorting (i.e. databse
encoding) until we have a proper multi-locale support in the backend ?

It seems inconsistent that we do use locale-aware sorts but not
upper/lower.

this is for UNICODE database using locale et_EE.UTF-8

ucdb=# select t, upper(t) from tt order by 1;t | upper
---+-------a | As | SŠ | Šš | šÕ | Õõ | õÄ | Ää | ä
(8 rows)

as you see, the sort order is right, but "some" characters are and some
are not converted the result is a complete mess ;(

-------------------
Hannu

Re: Unicode upper() bug still present

From

Tatsuo Ishii

Date:

20 October 2003, 09:37:59

> Tom Lane kirjutas E, 20.10.2003 kell 03:35:
> > Oliver Elphick <olly@lfix.co.uk> writes:
> > > There is a bug in Unicode upper() which has been present since 7.2:
> > 
> > We don't support upper/lower in multibyte character sets, and can't as
> > long as the functionality is dependent on <ctype.h>'s toupper()/tolower().
> > It's been suggested that we could use <wctype.h> where available.
> > However there are a bunch of issues that would have to be solved to make
> > that happen.  (How do we convert between the database character encoding 
> > and the wctype representation?  
> 
> How do we do it for sorting ?
> 
> > How do we even find out what
> > representation the current locale setting expects to use?)
> 
> Why not use the same locale settings as for sorting (i.e. databse
> encoding) until we have a proper multi-locale support in the backend ?

There's absolutely no relationship between database encoding and
locale. IMO depending on the system locale is a completely wrong
design decision and we should go toward for having our own collate
data.  (I think Oracle does this way)
--
Tatsuo Ishii

Re: Unicode upper() bug still present

From

Hannu Krosing

Date:

20 October 2003, 09:44:10

Tatsuo Ishii kirjutas E, 20.10.2003 kell 15:37:
> > Tom Lane kirjutas E, 20.10.2003 kell 03:35:
> > > Oliver Elphick <olly@lfix.co.uk> writes:
> > > > There is a bug in Unicode upper() which has been present since 7.2:
> > > 
> > > We don't support upper/lower in multibyte character sets, and can't as
> > > long as the functionality is dependent on <ctype.h>'s toupper()/tolower().
> > > It's been suggested that we could use <wctype.h> where available.
> > > However there are a bunch of issues that would have to be solved to make
> > > that happen.  (How do we convert between the database character encoding 
> > > and the wctype representation?  
> > 
> > How do we do it for sorting ?
> > 
> > > How do we even find out what
> > > representation the current locale setting expects to use?)
> > 
> > Why not use the same locale settings as for sorting (i.e. databse
> > encoding) until we have a proper multi-locale support in the backend ?
> 
> There's absolutely no relationship between database encoding and
> locale. 

How does the system then use locale for sorting and not for upper/lower
?

I would have rather expected the opposite, as lower/uper rules are litte
more locale independent than collation.

> IMO depending on the system locale is a completely wrong
> design decision and we should go toward for having our own collate
> data.  

I agree completely. We could probably lift something from IBM's ICU.

-----------------
Hannu

Re: Unicode upper() bug still present

From

Tom Lane

Date:

20 October 2003, 10:07:15

Hannu Krosing <hannu@tm.ee> writes:
>> It's been suggested that we could use <wctype.h> where available.
>> However there are a bunch of issues that would have to be solved to make
>> that happen.  (How do we convert between the database character encoding 
>> and the wctype representation?  

> How do we do it for sorting ?

We don't --- strcoll() handles it all internally.

> It seems inconsistent that we do use locale-aware sorts but not
> upper/lower.

We do have locale-aware upper/lower ... but only in single-byte
encodings.  I think it works for the 7-bit-ASCII subset of multibyte
encodings, too.
        regards, tom lane

Re: Unicode upper() bug still present

From

Tom Lane

Date:

20 October 2003, 10:19:25

Hannu Krosing <hannu@tm.ee> writes:
> Tatsuo Ishii kirjutas E, 20.10.2003 kell 15:37:
>> There's absolutely no relationship between database encoding and
>> locale. 

> How does the system then use locale for sorting and not for upper/lower
> ?

LC_COLLATE and LC_CTYPE are independent settings.  But in any case
Tatsuo is correct about the long-term direction we need to take ---
in order to come anywhere near SQL-standard behavior, we have to support
multiple locales simultaneously, and that means that the standard C
library's API isn't gonna do it.

>> IMO depending on the system locale is a completely wrong
>> design decision and we should go toward for having our own collate
>> data.  

I noticed by chance that glibc has a "reentrant locale" API that seems
to allow for efficient access to multiple locales concurrently.  Perhaps
it would be a reasonable solution to support multiple locales only on
machines that have this library.  If we have to write our own locale
support it's likely to be a long time coming :-(
        regards, tom lane

Re: Unicode upper() bug still present

From

Dennis Bjorklund

Date:

20 October 2003, 15:02:26

On Mon, 20 Oct 2003, Tom Lane wrote:

> I noticed by chance that glibc has a "reentrant locale" API that seems
> to allow for efficient access to multiple locales concurrently.

Where have you found this?

I've been looking for that but have not found it. I run a rh9 system, do
you have something newer? Maybe I have just not looked in the right place
in the documentation.

-- 
/Dennis

Re: Unicode upper() bug still present

From

Peter Eisentraut

Date:

20 October 2003, 15:02:53

Tom Lane writes:

> I noticed by chance that glibc has a "reentrant locale" API that seems
> to allow for efficient access to multiple locales concurrently.  Perhaps
> it would be a reasonable solution to support multiple locales only on
> machines that have this library.  If we have to write our own locale
> support it's likely to be a long time coming :-(

Naturally, I cannot promise anything, but this is at the top of my list
for the next release.  I already have sorted out the specifications and
algorithms and collected locale data for most corners of the world, so
it's just the coding left.  Unfortunately, a real, sustainable fix of this
situations requires us to start at the very bottom, namely the character
set conversion interface, then the gettext interface, then the new locale
library, then integrating the per-column granularity into the
parser/planer/executor.  So you may be looking at a two-release process.

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Unicode upper() bug still present

From

Hannu Krosing

Date:

20 October 2003, 15:13:28

Peter Eisentraut kirjutas E, 20.10.2003 kell 21:02:
> Tom Lane writes:
> 
> > I noticed by chance that glibc has a "reentrant locale" API that seems
> > to allow for efficient access to multiple locales concurrently.  Perhaps
> > it would be a reasonable solution to support multiple locales only on
> > machines that have this library.  If we have to write our own locale
> > support it's likely to be a long time coming :-(
> 
> Naturally, I cannot promise anything, but this is at the top of my list
> for the next release.  I already have sorted out the specifications and
> algorithms and collected locale data for most corners of the world, so
> it's just the coding left. 

Have you checked ICU ( http://oss.software.ibm.com/icu/ ) ?

It seems to have all the needed data at least.

> Unfortunately, a real, sustainable fix of this
> situations requires us to start at the very bottom, namely the character
> set conversion interface, then the gettext interface, then the new locale
> library, then integrating the per-column granularity into the
> parser/planer/executor.  So you may be looking at a two-release process.

---------------
Hannu

Re: Unicode upper() bug still present

From

Tom Lane

Date:

20 October 2003, 16:30:58

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Mon, 20 Oct 2003, Tom Lane wrote:
>> I noticed by chance that glibc has a "reentrant locale" API that seems
>> to allow for efficient access to multiple locales concurrently.

> Where have you found this?

It's present in RH8 --- there is <xlocale.h> plus various other headers
that include it.  Dunno where to look for documentation exactly.
        regards, tom lane

Re: Unicode upper() bug still present

From

Tom Lane

Date:

20 October 2003, 16:38:10

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> If we have to write our own locale
>> support it's likely to be a long time coming :-(

> Naturally, I cannot promise anything, but this is at the top of my list
> for the next release.  I already have sorted out the specifications and
> algorithms and collected locale data for most corners of the world, so
> it's just the coding left.  Unfortunately, a real, sustainable fix of this
> situations requires us to start at the very bottom,

I'm not sure that "supporting our own locale subsystem" really qualifies
as "sustainable" ... can you give an estimate of how big the code +
supporting data is likely to be?

(Perhaps the size of the corresponding parts of glibc would do as an
estimate --- I don't know that offhand, but surely we could find it
out.)

I agree that depending on the system-provided locale behavior has its
downsides, but it has its upsides too; compatibility with the behavior
of everything else on the machine being one big one.  So the idea of
being able to use glibc where available shouldn't be rejected out of
hand, I think.
        regards, tom lane

Re: Unicode upper() bug still present

From

Peter Eisentraut

Date:

20 October 2003, 17:58:46

Tom Lane writes:

> I'm not sure that "supporting our own locale subsystem" really qualifies
> as "sustainable" ... can you give an estimate of how big the code +
> supporting data is likely to be?

It's not much worse than supporting our own character conversion subsystem
(which, btw., is something we could more likely do without, because the
standard system facilities tend to be quite adequate), and certainly much
less worse than maintaining our own set of translated strings.

For the "ctype" category, you can generate the code straight out of the
Unicode tables, with a handfull of hardcoded exception (like the Turkish
i).  For the "collate" category we need about 40 kB of language-specific
data files plus a big master data file that is maintained by the Unicode
consortium.  (Those 40 kB correspond to the 22 files I currently have,
which, together with the big default file, cover about 70 languages.)
The other locale categories aren't of interest for string processing.
The code isn't large, but of course someone needs to write it.  The
algorithms are standardized (Unicode collation algorithm) and have several
existing implementations.  So this isn't something that we would need to
maintain in a vacuum.

(Note that I say Unicode a lot here because those people do a lot of
research and standardization in this area, which is available for free,
but this does not constrain the result to work only with the Unicode
character set.)

> I agree that depending on the system-provided locale behavior has its
> downsides, but it has its upsides too; compatibility with the behavior
> of everything else on the machine being one big one.  So the idea of
> being able to use glibc where available shouldn't be rejected out of
> hand, I think.

I like to think that in the end we can do much better than the POSIX
framework can do.  For instance, the character classification can have
more useful categories, the case conversion can be context-dependent
(which is a requirement in some languages), and users could more directly
add their own collations or parametrize existing ones (because no one ever
seems to agree on the details).

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Unicode upper() bug still present

From

Tom Lane

Date:

20 October 2003, 18:16:30

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> I agree that depending on the system-provided locale behavior has its
>> downsides, but it has its upsides too;

> I like to think that in the end we can do much better than the POSIX
> framework can do.

Sure, we can make it work exactly the way we want.  I'm just wondering
whether the benefits are worth the significant expenditure of effort...

But if you want to do the work, I won't hold you back.
        regards, tom lane

Re: Unicode upper() bug still present

From

Karel Zak

Date:

21 October 2003, 04:51:09

On Mon, Oct 20, 2003 at 10:58:00PM +0200, Peter Eisentraut wrote:

> (Note that I say Unicode a lot here because those people do a lot of
> research and standardization in this area, which is available for free,
> but this does not constrain the result to work only with the Unicode
> character set.)
Why  cannot  do PostgreSQL  as  100%  pure  Unicode system? We  can  doconversion  from/to  others  encodings as
client/server communicationextension, but  internaly in BE  we can  use only pure  Unicode data. Ithink a lot of things
willmore simple...
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/

Re: Unicode upper() bug still present

From

Hannu Krosing

Date:

21 October 2003, 05:46:27

Karel Zak kirjutas T, 21.10.2003 kell 10:50:
> On Mon, Oct 20, 2003 at 10:58:00PM +0200, Peter Eisentraut wrote:
> 
> > (Note that I say Unicode a lot here because those people do a lot of
> > research and standardization in this area, which is available for free,
> > but this does not constrain the result to work only with the Unicode
> > character set.)
> 
>  Why  cannot  do PostgreSQL  as  100%  pure  Unicode system? We  can  do
>  conversion  from/to  others  encodings as  client/server  communication
>  extension, but  internaly in BE  we can  use only pure  Unicode data. I
>  think a lot of things will more simple...

I've heard that some far-east languages have had some issues with 16-bit
UNICODE, but the 32-bit should have fixed it.

I would also support a move to UNICODE (store as SCSU, process as 16 or
32 bit wchars, i/o as UTF-8) for NCHAR/NVARCHAR/NTEXT and pure 7-bit
byte-value ordered ASCII for CHAR/VARCHAR/TEXT.

But this would surely have some issues with backward compatibility.

------------
Hannu

Re: Unicode upper() bug still present

From

Tatsuo Ishii

Date:

21 October 2003, 06:08:28

>  Why  cannot  do PostgreSQL  as  100%  pure  Unicode system? We  can  do
>  conversion  from/to  others  encodings as  client/server  communication
>  extension, but  internaly in BE  we can  use only pure  Unicode data. I
>  think a lot of things will more simple...

Please don't do that. There's a known issue of round trip conversion
between Unicode and other encodings and still the existing encodings
are very important for many users.

Also I think DBMS should not rely on particular encoding
implementation.
--
Tatsuo Ishii

Re: Unicode upper() bug still present

From

Hannu Krosing

Date:

21 October 2003, 06:28:54

Tatsuo Ishii kirjutas T, 21.10.2003 kell 12:07:
> >  Why  cannot  do PostgreSQL  as  100%  pure  Unicode system? We  can  do
> >  conversion  from/to  others  encodings as  client/server  communication
> >  extension, but  internaly in BE  we can  use only pure  Unicode data. I
> >  think a lot of things will more simple...
> 
> Please don't do that. There's a known issue of round trip conversion
> between Unicode and other encodings 

Are these unsolvable even in theory ?

-------------
Hannu

Re: Unicode upper() bug still present

From

Tatsuo Ishii

Date:

21 October 2003, 06:30:48

> Tatsuo Ishii kirjutas T, 21.10.2003 kell 12:07:
> > >  Why  cannot  do PostgreSQL  as  100%  pure  Unicode system? We  can  do
> > >  conversion  from/to  others  encodings as  client/server  communication
> > >  extension, but  internaly in BE  we can  use only pure  Unicode data. I
> > >  think a lot of things will more simple...
> > 
> > Please don't do that. There's a known issue of round trip conversion
> > between Unicode and other encodings 
> 
> Are these unsolvable even in theory ?

Right.
--
Tatsuo Ishii

Automatic conversion from Unicode

From

Michael Brusser

Date:

21 October 2003, 11:51:11

With Pg 7.3.x we initialize database with -E UNICODE option.
I'd like to provide support for automatic conversion to Chinese char-set
by putting "client_encoding big5" into postgresql.conf.

But when I try "\encoding big5" in psql session I get this:
big5: invalid encoding name or conversion procedure not found
I looked into table pg_conversion - it's empty.

Is this something I missed with compile options?
Thanks in advance,
Mike

Re: Automatic conversion from Unicode

From

Tom Lane

Date:

21 October 2003, 12:46:37

Michael Brusser <michael@synchronicity.com> writes:
> I looked into table pg_conversion - it's empty.

We've seen that reported before.  IIRC, it is possible to have
dynamic-linking problems with loading the pg_conversion support
libraries, and the 7.3 version of initdb has a bad habit of sending
the resulting error messages to /dev/null rather than letting you
know there's a problem.

You might tryinitdb -d -D some_junk_directory 2>initdb.err
(beware, this will produce megabytes worth of uninteresting messages
on stderr), and then looking at the last few hundred lines for relevant
error messages.

I spent a little time digging in the archives for previous reports,
without much success, but I remember it's come up before.
        regards, tom lane

Re: Reentrant Locale API

From

butlerm@middle.net (Mark Butler)

Date:

19 November 2003, 15:21:07

> Where have you found this?
> 
> I've been looking for that but have not found it. I run a rh9 system, do
> you have something newer? Maybe I have just not looked in the right place
> in the documentation.


Glibc 2.3 implements both reentrant and a thread local locale APIs. 

The reentrant API provides versions of isalpha, isupper, toupper,
strcoll, and so on that take a separate locale parameter.

The thread locale API is simpler - it adds new a uselocale() function,
that once called places a thread in its own thread specific local,
after which all locale dependent functions use the thread locale
instead of the global one (which is the default for backward
compatibility).

See this paper for details:
  http://people.redhat.com/drepper/lt2002talk.pdf