Re: Unicode upper() bug still present - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Unicode upper() bug still present
Date
Msg-id Pine.LNX.4.44.0310202235580.29086-100000@peter.localdomain
Whole thread Raw
In response to Re: Unicode upper() bug still present  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unicode upper() bug still present  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Unicode upper() bug still present  (Karel Zak <zakkr@zf.jcu.cz>)
List pgsql-hackers
Tom Lane writes:

> I'm not sure that "supporting our own locale subsystem" really qualifies
> as "sustainable" ... can you give an estimate of how big the code +
> supporting data is likely to be?

It's not much worse than supporting our own character conversion subsystem
(which, btw., is something we could more likely do without, because the
standard system facilities tend to be quite adequate), and certainly much
less worse than maintaining our own set of translated strings.

For the "ctype" category, you can generate the code straight out of the
Unicode tables, with a handfull of hardcoded exception (like the Turkish
i).  For the "collate" category we need about 40 kB of language-specific
data files plus a big master data file that is maintained by the Unicode
consortium.  (Those 40 kB correspond to the 22 files I currently have,
which, together with the big default file, cover about 70 languages.)
The other locale categories aren't of interest for string processing.
The code isn't large, but of course someone needs to write it.  The
algorithms are standardized (Unicode collation algorithm) and have several
existing implementations.  So this isn't something that we would need to
maintain in a vacuum.

(Note that I say Unicode a lot here because those people do a lot of
research and standardization in this area, which is available for free,
but this does not constrain the result to work only with the Unicode
character set.)

> I agree that depending on the system-provided locale behavior has its
> downsides, but it has its upsides too; compatibility with the behavior
> of everything else on the machine being one big one.  So the idea of
> being able to use glibc where available shouldn't be rejected out of
> hand, I think.

I like to think that in the end we can do much better than the POSIX
framework can do.  For instance, the character classification can have
more useful categories, the case conversion can be context-dependent
(which is a requirement in some languages), and users could more directly
add their own collations or parametrize existing ones (because no one ever
seems to agree on the details).

-- 
Peter Eisentraut   peter_e@gmx.net



pgsql-hackers by date:

Previous
From: Christopher Browne
Date:
Subject: Re: Vacuum thoughts
Next
From: Tom Lane
Date:
Subject: Re: Unicode upper() bug still present