Thread: Patch for collation using ICU

Patch for collation using ICU

From
Palle Girgensohn
Date:
Hi!

I've put together a patch for using IBM's ICU package for collation.

If your OS does not have full support for collation ur uppercase/lowercase 
in multibyte locales, this might be useful. If you are using a multibyte 
character encoding in your database and want collation, i.e. order by, and 
also lower(), upper() and initcap() to work properly, this patch will do 
just that.

This patch is needed for FreeBSD, since this OS has no support for 
collation of for example unicode locales (that is, wcscoll(3) does not do 
what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the 
patch is *not* necessary for Linux, although IBM claims ICU collation to be 
about twice as fast as glibc for simple western locales.

It adds a configure switch, `--with-icu', which will set up the code to use 
ICU instead of wchar_t and wcscoll.

This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it 
seems to run well. I've not had the time to do any comparative performance 
tests yet, but it seems it is at least not slower than using LATIN1 with 
sv_SE.ISO8859-1 locale, perhaps even faster.

I'd be delighted if some more experienced postgresql hackers would review 
this stuff. The patch is pretty compact, so it's fast reading :)  I'm 
planning to add this patch as an option (tagged "experimental") to 
FreeBSD's postgresql port. Any ideas about whether this is a good idea or 
not?

Any thoughts or ideas are welcome!

Cheers,
Palle

Patch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>

ICU at sourceforge: <http://icu.sf.net/>



Re: Patch for collation using ICU

From
"John Hansen"
Date:
Useful if it's going to support earlier releases of ICU....

Not all os's come with ICU3.2, debian for example, currently has 2.1 in
testing, and 2.6 in unstable.

... John

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
> Palle Girgensohn
> Sent: Friday, March 25, 2005 10:40 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Patch for collation using ICU
>
> Hi!
>
> I've put together a patch for using IBM's ICU package for collation.
>
> If your OS does not have full support for collation ur
> uppercase/lowercase in multibyte locales, this might be
> useful. If you are using a multibyte character encoding in
> your database and want collation, i.e. order by, and also
> lower(), upper() and initcap() to work properly, this patch
> will do just that.
>
> This patch is needed for FreeBSD, since this OS has no
> support for collation of for example unicode locales (that
> is, wcscoll(3) does not do what you expect if you set
> LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not*
> necessary for Linux, although IBM claims ICU collation to be
> about twice as fast as glibc for simple western locales.
>
> It adds a configure switch, `--with-icu', which will set up
> the code to use ICU instead of wchar_t and wcscoll.
>
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable,
> where it seems to run well. I've not had the time to do any
> comparative performance tests yet, but it seems it is at
> least not slower than using LATIN1 with
> sv_SE.ISO8859-1 locale, perhaps even faster.
>
> I'd be delighted if some more experienced postgresql hackers
> would review this stuff. The patch is pretty compact, so it's
> fast reading :)  I'm planning to add this patch as an option
> (tagged "experimental") to FreeBSD's postgresql port. Any
> ideas about whether this is a good idea or not?
>
> Any thoughts or ideas are welcome!
>
> Cheers,
> Palle
>
> Patch at:
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-14.diff>
>
> ICU at sourceforge: <http://icu.sf.net/>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen <john@geeknet.com.au> 
wrote:

> Useful if it's going to support earlier releases of ICU....
>
> Not all os's come with ICU3.2, debian for example, currently has 2.1 in
> testing, and 2.6 in unstable.

Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I 
doubt it would too much difference. Some autoconf sorcery needed, perhaps.

/Palle

>
> ... John
>
>> -----Original Message-----
>> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
>> Palle Girgensohn
>> Sent: Friday, March 25, 2005 10:40 AM
>> To: pgsql-hackers@postgresql.org
>> Subject: [HACKERS] Patch for collation using ICU
>>
>> Hi!
>>
>> I've put together a patch for using IBM's ICU package for collation.
>>
>> If your OS does not have full support for collation ur
>> uppercase/lowercase in multibyte locales, this might be
>> useful. If you are using a multibyte character encoding in
>> your database and want collation, i.e. order by, and also
>> lower(), upper() and initcap() to work properly, this patch
>> will do just that.
>>
>> This patch is needed for FreeBSD, since this OS has no
>> support for collation of for example unicode locales (that
>> is, wcscoll(3) does not do what you expect if you set
>> LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not*
>> necessary for Linux, although IBM claims ICU collation to be
>> about twice as fast as glibc for simple western locales.
>>
>> It adds a configure switch, `--with-icu', which will set up
>> the code to use ICU instead of wchar_t and wcscoll.
>>
>> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable,
>> where it seems to run well. I've not had the time to do any
>> comparative performance tests yet, but it seems it is at
>> least not slower than using LATIN1 with
>> sv_SE.ISO8859-1 locale, perhaps even faster.
>>
>> I'd be delighted if some more experienced postgresql hackers
>> would review this stuff. The patch is pretty compact, so it's
>> fast reading :)  I'm planning to add this patch as an option
>> (tagged "experimental") to FreeBSD's postgresql port. Any
>> ideas about whether this is a good idea or not?
>>
>> Any thoughts or ideas are welcome!
>>
>> Cheers,
>> Palle
>>
>> Patch at:
>> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> 005-03-14.diff>
>>
>> ICU at sourceforge: <http://icu.sf.net/>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings
>>
>>






Re: Patch for collation using ICU

From
"John Hansen"
Date:
> --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
> <john@geeknet.com.au>
> wrote:
>
> > Useful if it's going to support earlier releases of ICU....
> >
> > Not all os's come with ICU3.2, debian for example,
> currently has 2.1
> > in testing, and 2.6 in unstable.
>
> Oh, OK. FreeBSD has only the 3.2 as port. I can check the
> older version, I doubt it would too much difference. Some
> autoconf sorcery needed, perhaps.

Naww, it's no biggie, we'll just need to include ICU with pg I think.
I tried that, there are several functions from ICU that you use, that
are not in ICU2.1

Dono about 2.6.

However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc

... John

>
> /Palle
>
> >
> > ... John
> >
> >> -----Original Message-----
> >> From: pgsql-hackers-owner@postgresql.org
> >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
> >> Girgensohn
> >> Sent: Friday, March 25, 2005 10:40 AM
> >> To: pgsql-hackers@postgresql.org
> >> Subject: [HACKERS] Patch for collation using ICU
> >>
> >> Hi!
> >>
> >> I've put together a patch for using IBM's ICU package for
> collation.
> >>
> >> If your OS does not have full support for collation ur
> >> uppercase/lowercase in multibyte locales, this might be useful. If
> >> you are using a multibyte character encoding in your database and
> >> want collation, i.e. order by, and also lower(), upper() and
> >> initcap() to work properly, this patch will do just that.
> >>
> >> This patch is needed for FreeBSD, since this OS has no support for
> >> collation of for example unicode locales (that is, wcscoll(3) does
> >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> example).
> >> AFAIK the patch is *not* necessary for Linux, although IBM
> claims ICU
> >> collation to be about twice as fast as glibc for simple western
> >> locales.
> >>
> >> It adds a configure switch, `--with-icu', which will set
> up the code
> >> to use ICU instead of wchar_t and wcscoll.
> >>
> >> This has been tested only on FreeBSD-4.11 &
> FreeBSD-5-stable, where
> >> it seems to run well. I've not had the time to do any comparative
> >> performance tests yet, but it seems it is at least not slower than
> >> using LATIN1 with
> >> sv_SE.ISO8859-1 locale, perhaps even faster.
> >>
> >> I'd be delighted if some more experienced postgresql hackers would
> >> review this stuff. The patch is pretty compact, so it's
> fast reading
> >> :)  I'm planning to add this patch as an option (tagged
> >> "experimental") to FreeBSD's postgresql port. Any ideas
> about whether
> >> this is a good idea or not?
> >>
> >> Any thoughts or ideas are welcome!
> >>
> >> Cheers,
> >> Palle
> >>
> >> Patch at:
> >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> > 005-03-14.diff>
> >>
> >> ICU at sourceforge: <http://icu.sf.net/>
> >>
> >>
> >> ---------------------------(end of
> >> broadcast)---------------------------
> >> TIP 7: don't forget to increase your free space map settings
> >>
> >>
>
>
>
>
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Ok,.. tested on debian sarge with ICU 3.2
UNICODE Database, C locale.

upper() and lower() returns an empty string for any input, including
7bit ascii, regardless of client_encoding, so something is obviously
broken.

Have you tested this patch on a UNICODE DB with locale C/POSIX ?

... John

> -----Original Message-----
> From: John Hansen
> Sent: Friday, March 25, 2005 10:27 PM
> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Patch for collation using ICU
>
> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
> > <john@geeknet.com.au>
> > wrote:
> >
> > > Useful if it's going to support earlier releases of ICU....
> > >
> > > Not all os's come with ICU3.2, debian for example,
> > currently has 2.1
> > > in testing, and 2.6 in unstable.
> >
> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
> > version, I doubt it would too much difference. Some
> autoconf sorcery
> > needed, perhaps.
>
> Naww, it's no biggie, we'll just need to include ICU with pg I think.
> I tried that, there are several functions from ICU that you
> use, that are not in ICU2.1
>
> Dono about 2.6.
>
> However, ICU3.2 compiles on debian with a small change to the
> debian/rules file.
> debian/tmp/etc is missing, so add mkdir debian/tmp/etc
>
> ... John
>
> >
> > /Palle
> >
> > >
> > > ... John
> > >
> > >> -----Original Message-----
> > >> From: pgsql-hackers-owner@postgresql.org
> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
> > >> Girgensohn
> > >> Sent: Friday, March 25, 2005 10:40 AM
> > >> To: pgsql-hackers@postgresql.org
> > >> Subject: [HACKERS] Patch for collation using ICU
> > >>
> > >> Hi!
> > >>
> > >> I've put together a patch for using IBM's ICU package for
> > collation.
> > >>
> > >> If your OS does not have full support for collation ur
> > >> uppercase/lowercase in multibyte locales, this might be
> useful. If
> > >> you are using a multibyte character encoding in your
> database and
> > >> want collation, i.e. order by, and also lower(), upper() and
> > >> initcap() to work properly, this patch will do just that.
> > >>
> > >> This patch is needed for FreeBSD, since this OS has no
> support for
> > >> collation of for example unicode locales (that is,
> wcscoll(3) does
> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> > example).
> > >> AFAIK the patch is *not* necessary for Linux, although IBM
> > claims ICU
> > >> collation to be about twice as fast as glibc for simple western
> > >> locales.
> > >>
> > >> It adds a configure switch, `--with-icu', which will set
> > up the code
> > >> to use ICU instead of wchar_t and wcscoll.
> > >>
> > >> This has been tested only on FreeBSD-4.11 &
> > FreeBSD-5-stable, where
> > >> it seems to run well. I've not had the time to do any
> comparative
> > >> performance tests yet, but it seems it is at least not
> slower than
> > >> using LATIN1 with
> > >> sv_SE.ISO8859-1 locale, perhaps even faster.
> > >>
> > >> I'd be delighted if some more experienced postgresql
> hackers would
> > >> review this stuff. The patch is pretty compact, so it's
> > fast reading
> > >> :)  I'm planning to add this patch as an option (tagged
> > >> "experimental") to FreeBSD's postgresql port. Any ideas
> > about whether
> > >> this is a good idea or not?
> > >>
> > >> Any thoughts or ideas are welcome!
> > >>
> > >> Cheers,
> > >> Palle
> > >>
> > >> Patch at:
> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> > > 005-03-14.diff>
> > >>
> > >> ICU at sourceforge: <http://icu.sf.net/>
> > >>
> > >>
> > >> ---------------------------(end of
> > >> broadcast)---------------------------
> > >> TIP 7: don't forget to increase your free space map settings
> > >>
> > >>
> >
> >
> >
> >
> >
> >


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On fredag, mars 25, 2005 23.39.33 +1100 John Hansen <john@geeknet.com.au> 
wrote:

> Ok,.. tested on debian sarge with ICU 3.2
> UNICODE Database, C locale.
>
> upper() and lower() returns an empty string for any input, including
> 7bit ascii, regardless of client_encoding, so something is obviously
> broken.
>
> Have you tested this patch on a UNICODE DB with locale C/POSIX ?

No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 and UNICODE, 
and also de_DE.UTF-8.

How will PostgreSQL react to this combo? A database cluster initdb:ed with 
locale=C/POSIX, and then a database in UNICODE (really utf-8) 
representation... hmm... I think I might have made a false assumption that 
the locale string would contain the character encoding. I do something like 
encoding = strchr(locale, '.') + 1... That code will be confused by a 'C' 
locale, indeed. I'll check it out!

/Palle



>
> ... John
>
>> -----Original Message-----
>> From: John Hansen
>> Sent: Friday, March 25, 2005 10:27 PM
>> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
>> Subject: RE: [HACKERS] Patch for collation using ICU
>>
>> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
>> > <john@geeknet.com.au>
>> > wrote:
>> >
>> > > Useful if it's going to support earlier releases of ICU....
>> > >
>> > > Not all os's come with ICU3.2, debian for example,
>> > currently has 2.1
>> > > in testing, and 2.6 in unstable.
>> >
>> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
>> > version, I doubt it would too much difference. Some
>> autoconf sorcery
>> > needed, perhaps.
>>
>> Naww, it's no biggie, we'll just need to include ICU with pg I think.
>> I tried that, there are several functions from ICU that you
>> use, that are not in ICU2.1
>>
>> Dono about 2.6.
>>
>> However, ICU3.2 compiles on debian with a small change to the
>> debian/rules file.
>> debian/tmp/etc is missing, so add mkdir debian/tmp/etc
>>
>> ... John
>>
>> >
>> > /Palle
>> >
>> > >
>> > > ... John
>> > >
>> > >> -----Original Message-----
>> > >> From: pgsql-hackers-owner@postgresql.org
>> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
>> > >> Girgensohn
>> > >> Sent: Friday, March 25, 2005 10:40 AM
>> > >> To: pgsql-hackers@postgresql.org
>> > >> Subject: [HACKERS] Patch for collation using ICU
>> > >>
>> > >> Hi!
>> > >>
>> > >> I've put together a patch for using IBM's ICU package for
>> > collation.
>> > >>
>> > >> If your OS does not have full support for collation ur
>> > >> uppercase/lowercase in multibyte locales, this might be
>> useful. If
>> > >> you are using a multibyte character encoding in your
>> database and
>> > >> want collation, i.e. order by, and also lower(), upper() and
>> > >> initcap() to work properly, this patch will do just that.
>> > >>
>> > >> This patch is needed for FreeBSD, since this OS has no
>> support for
>> > >> collation of for example unicode locales (that is,
>> wcscoll(3) does
>> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
>> > example).
>> > >> AFAIK the patch is *not* necessary for Linux, although IBM
>> > claims ICU
>> > >> collation to be about twice as fast as glibc for simple western
>> > >> locales.
>> > >>
>> > >> It adds a configure switch, `--with-icu', which will set
>> > up the code
>> > >> to use ICU instead of wchar_t and wcscoll.
>> > >>
>> > >> This has been tested only on FreeBSD-4.11 &
>> > FreeBSD-5-stable, where
>> > >> it seems to run well. I've not had the time to do any
>> comparative
>> > >> performance tests yet, but it seems it is at least not
>> slower than
>> > >> using LATIN1 with
>> > >> sv_SE.ISO8859-1 locale, perhaps even faster.
>> > >>
>> > >> I'd be delighted if some more experienced postgresql
>> hackers would
>> > >> review this stuff. The patch is pretty compact, so it's
>> > fast reading
>> > >> :)  I'm planning to add this patch as an option (tagged
>> > >> "experimental") to FreeBSD's postgresql port. Any ideas
>> > about whether
>> > >> this is a good idea or not?
>> > >>
>> > >> Any thoughts or ideas are welcome!
>> > >>
>> > >> Cheers,
>> > >> Palle
>> > >>
>> > >> Patch at:
>> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
>> > > 005-03-14.diff>
>> > >>
>> > >> ICU at sourceforge: <http://icu.sf.net/>
>> > >>
>> > >>
>> > >> ---------------------------(end of
>> > >> broadcast)---------------------------
>> > >> TIP 7: don't forget to increase your free space map settings
>> > >>
>> > >>
>> >
>> >
>> >
>> >
>> >
>> >






Re: Patch for collation using ICU

From
Tom Lane
Date:
Palle Girgensohn <girgen@pingpong.net> writes:
> hmm... I think I might have made a false assumption that 
> the locale string would contain the character encoding.

You certainly cannot assume that.  Would that it were so easy to find
out the character set for a locale :-(.

There's some code in initdb that you might emulate, though I can't say
that I trust it a whole lot.
        regards, tom lane


Re: Patch for collation using ICU

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Palle Girgensohn <girgen@pingpong.net> writes:
>  
>
>>hmm... I think I might have made a false assumption that 
>>the locale string would contain the character encoding.
>>    
>>
>
>You certainly cannot assume that.  Would that it were so easy to find
>out the character set for a locale :-(.
>
>There's some code in initdb that you might emulate, though I can't say
>that I trust it a whole lot.
>
>
>  
>

Me either. On FC3 I saw this the other day, haven't had time to dig into 
what's going on:

[andrew inst]$ echo $LANG
en_US.UTF-8
[andrew inst]$ bin/initdb floobl
The files belonging to this database system will be owned by user "andrew".
This user must also own the server process.

The database cluster will be initialized with locale en_US.UTF-8.
initdb: could not find suitable encoding for locale "en_US.UTF-8"
Rerun initdb with the -E option.
Try "initdb --help" for more information.



cheers

andrew


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On fredag, mars 25, 2005 09.53.38 -0500 Tom Lane <tgl@sss.pgh.pa.us> 
wrote:

> Palle Girgensohn <girgen@pingpong.net> writes:
>> hmm... I think I might have made a false assumption that
>> the locale string would contain the character encoding.
>
> You certainly cannot assume that.  Would that it were so easy to find
> out the character set for a locale :-(.
>
> There's some code in initdb that you might emulate, though I can't say
> that I trust it a whole lot.

Best thing would of course be to check the encoding of the database. I 
realized I've made a bad assumption that the initdb locale has anything to 
do with the character encoding of the database - it doesn't. There must be 
some already existsing trusted way to get the character encoding of a 
database?

/Palle



Re: Patch for collation using ICU

From
"John Hansen"
Date:
> --On fredag, mars 25, 2005 23.39.33 +1100 John Hansen
> <john@geeknet.com.au>
> wrote:
>
> > Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database,
> C locale.
> >
> > upper() and lower() returns an empty string for any input,
> including
> > 7bit ascii, regardless of client_encoding, so something is
> obviously
> > broken.
> >
> > Have you tested this patch on a UNICODE DB with locale C/POSIX ?

FYI, I also found that initdb crashes with error 139 on any locale other
than C/POSIX.

>
> No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8
> and UNICODE, and also de_DE.UTF-8.
>
> How will PostgreSQL react to this combo? A database cluster
> initdb:ed with locale=C/POSIX, and then a database in UNICODE
> (really utf-8) representation... hmm... I think I might have
> made a false assumption that the locale string would contain
> the character encoding. I do something like encoding =
> strchr(locale, '.') + 1... That code will be confused by a 'C'
> locale, indeed. I'll check it out!
>
> /Palle
>
>
>
> >
> > ... John
> >
> >> -----Original Message-----
> >> From: John Hansen
> >> Sent: Friday, March 25, 2005 10:27 PM
> >> To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
> >> Subject: RE: [HACKERS] Patch for collation using ICU
> >>
> >> > --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
> >> > <john@geeknet.com.au>
> >> > wrote:
> >> >
> >> > > Useful if it's going to support earlier releases of ICU....
> >> > >
> >> > > Not all os's come with ICU3.2, debian for example,
> >> > currently has 2.1
> >> > > in testing, and 2.6 in unstable.
> >> >
> >> > Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
> >> > version, I doubt it would too much difference. Some
> >> autoconf sorcery
> >> > needed, perhaps.
> >>
> >> Naww, it's no biggie, we'll just need to include ICU with
> pg I think.
> >> I tried that, there are several functions from ICU that
> you use, that
> >> are not in ICU2.1
> >>
> >> Dono about 2.6.
> >>
> >> However, ICU3.2 compiles on debian with a small change to the
> >> debian/rules file.
> >> debian/tmp/etc is missing, so add mkdir debian/tmp/etc
> >>
> >> ... John
> >>
> >> >
> >> > /Palle
> >> >
> >> > >
> >> > > ... John
> >> > >
> >> > >> -----Original Message-----
> >> > >> From: pgsql-hackers-owner@postgresql.org
> >> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf
> Of Palle
> >> > >> Girgensohn
> >> > >> Sent: Friday, March 25, 2005 10:40 AM
> >> > >> To: pgsql-hackers@postgresql.org
> >> > >> Subject: [HACKERS] Patch for collation using ICU
> >> > >>
> >> > >> Hi!
> >> > >>
> >> > >> I've put together a patch for using IBM's ICU package for
> >> > collation.
> >> > >>
> >> > >> If your OS does not have full support for collation ur
> >> > >> uppercase/lowercase in multibyte locales, this might be
> >> useful. If
> >> > >> you are using a multibyte character encoding in your
> >> database and
> >> > >> want collation, i.e. order by, and also lower(), upper() and
> >> > >> initcap() to work properly, this patch will do just that.
> >> > >>
> >> > >> This patch is needed for FreeBSD, since this OS has no
> >> support for
> >> > >> collation of for example unicode locales (that is,
> >> wcscoll(3) does
> >> > >> not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> >> > example).
> >> > >> AFAIK the patch is *not* necessary for Linux, although IBM
> >> > claims ICU
> >> > >> collation to be about twice as fast as glibc for
> simple western
> >> > >> locales.
> >> > >>
> >> > >> It adds a configure switch, `--with-icu', which will set
> >> > up the code
> >> > >> to use ICU instead of wchar_t and wcscoll.
> >> > >>
> >> > >> This has been tested only on FreeBSD-4.11 &
> >> > FreeBSD-5-stable, where
> >> > >> it seems to run well. I've not had the time to do any
> >> comparative
> >> > >> performance tests yet, but it seems it is at least not
> >> slower than
> >> > >> using LATIN1 with
> >> > >> sv_SE.ISO8859-1 locale, perhaps even faster.
> >> > >>
> >> > >> I'd be delighted if some more experienced postgresql
> >> hackers would
> >> > >> review this stuff. The patch is pretty compact, so it's
> >> > fast reading
> >> > >> :)  I'm planning to add this patch as an option (tagged
> >> > >> "experimental") to FreeBSD's postgresql port. Any ideas
> >> > about whether
> >> > >> this is a good idea or not?
> >> > >>
> >> > >> Any thoughts or ideas are welcome!
> >> > >>
> >> > >> Cheers,
> >> > >> Palle
> >> > >>
> >> > >> Patch at:
> >> > >> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> >> > > 005-03-14.diff>
> >> > >>
> >> > >> ICU at sourceforge: <http://icu.sf.net/>
> >> > >>
> >> > >>
> >> > >> ---------------------------(end of
> >> > >> broadcast)---------------------------
> >> > >> TIP 7: don't forget to increase your free space map settings
> >> > >>
> >> > >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
>
>
>
>
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
<girgen@pingpong.net> wrote:

> Hi!
>
> I've put together a patch for using IBM's ICU package for collation.
>
> If your OS does not have full support for collation ur
> uppercase/lowercase in multibyte locales, this might be useful. If you
> are using a multibyte character encoding in your database and want
> collation, i.e. order by, and also lower(), upper() and initcap() to work
> properly, this patch will do just that.
>
> This patch is needed for FreeBSD, since this OS has no support for
> collation of for example unicode locales (that is, wcscoll(3) does not do
> what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
> patch is *not* necessary for Linux, although IBM claims ICU collation to
> be about twice as fast as glibc for simple western locales.
>
> It adds a configure switch, `--with-icu', which will set up the code to
> use ICU instead of wchar_t and wcscoll.
>
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
> seems to run well. I've not had the time to do any comparative
> performance tests yet, but it seems it is at least not slower than using
> LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
>
> I'd be delighted if some more experienced postgresql hackers would review
> this stuff. The patch is pretty compact, so it's fast reading :)  I'm
> planning to add this patch as an option (tagged "experimental") to
> FreeBSD's postgresql port. Any ideas about whether this is a good idea or
> not?
>
> Any thoughts or ideas are welcome!
>
> Cheers,
> Palle
>
> Patch at:
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d
> iff>
>
> ICU at sourceforge: <http://icu.sf.net/>


Hi!

There's a new patch to fix some reported problems.

<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff>

This version uses the DatabaseEncoding and sets the ICU encoding at the
same time. I had to create a conversion table from PostgreSQL's own,
somewhat odd and non-standard, names of encodings, into the prefered IANA
names. On or two of the more odd ones might be slightly incorrect,
hopefully not too far off anyway?

I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:

- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)

- When the database is initialized using the C locale, upper() and lower()
normally does not work at all for non-ASCII characters even if the
database's encoding is say LATIN1 or UNICODE. (does not work for me anyway,
on FreeBSD, and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case, since it does
not act on single byte encodings, but for the UNICODE representation, it
works and does what I expect it to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'.
This is a good thing, although I'm surprised that upper/lower is dragged
along with the LC_COLLATE fixation at initdb. I never run initdb in the C
locale, but only now do I realize how broken that really is if you need to
store anything else than English :-)

I'd be delighted to get more feedback about this stuff.

Thanks,
Palle



Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, mars 26, 2005 10.42.19 +1100 John Hansen <john@geeknet.com.au>
wrote:

> FYI, I also found that initdb crashes with error 139 on any locale other
> than C/POSIX.

Odd, not for me, but I did make a bad assumption about character encoding.
Perhaps the new patch will help? (see previous mail)

What is error 139, anyone?

/Palle



Re: Patch for collation using ICU

From
"John Hansen"
Date:

> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen@pingpong.net]
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers@postgresql.org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
> <girgen@pingpong.net> wrote:
>
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur
> > uppercase/lowercase in multibyte locales, this might be
> useful. If you
> > are using a multibyte character encoding in your database and want
> > collation, i.e. order by, and also lower(), upper() and
> initcap() to
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for
> > collation of for example unicode locales (that is,
> wcscoll(3) does not
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> example). AFAIK
> > the patch is *not* necessary for Linux, although IBM claims ICU
> > collation to be about twice as fast as glibc for simple
> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up
> the code
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 &
> FreeBSD-5-stable, where it
> > seems to run well. I've not had the time to do any comparative
> > performance tests yet, but it seems it is at least not slower than
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would
> > review this stuff. The patch is pretty compact, so it's
> fast reading
> > :)  I'm planning to add this patch as an option (tagged
> > "experimental") to FreeBSD's postgresql port. Any ideas
> about whether
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> >
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>
>
>
> Hi!
>
> There's a new patch to fix some reported problems.
>
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-26.diff>
>
> This version uses the DatabaseEncoding and sets the ICU
> encoding at the same time. I had to create a conversion table
> from PostgreSQL's own, somewhat odd and non-standard, names
> of encodings, into the prefered IANA names. On or two of the
> more odd ones might be slightly incorrect, hopefully not too
> far off anyway?
>
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might
> break the SQL standard (?), but sure is nice :)

This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).

> - When the database is initialized using the C locale,
> upper() and lower() normally does not work at all for
> non-ASCII characters even if the database's encoding is say
> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
> and this is probably correct since the locale is still `C', I
> believe?). The ICU patch changes nothing for the LATIN1 case,
> since it does not act on single byte encodings, but for the
> UNICODE representation, it works and does what I expect it
> to, namely upper() and lower() neatly
> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
> -> 'åäö'.
> This is a good thing, although I'm surprised that upper/lower
> is dragged along with the LC_COLLATE fixation at initdb. I
> never run initdb in the C locale, but only now do I realize
> how broken that really is if you need to store anything else
> than English :-)

That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which
dependson the locale. 

>
> I'd be delighted to get more feedback about this stuff.
>
> Thanks,
> Palle
>
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Still doesn't work for me :(
UNICODE DB
C locale

set client_encoding = iso88591;
select upper('æ');upper
-------

(1 row)

Trying to initdb with en_IN.utf8

/usr/lib/postgresql/bin/initdb -D /var/lib/postgres/data/ -E UNICODE --locale=en_IN.utf8
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale en_IN.utf8.

fixing permissions on existing directory /var/lib/postgres/data ... ok
creating directory /var/lib/postgres/data/global ... ok
creating directory /var/lib/postgres/data/pg_xlog ... ok
creating directory /var/lib/postgres/data/pg_xlog/archive_status ... ok
creating directory /var/lib/postgres/data/pg_clog ... ok
creating directory /var/lib/postgres/data/pg_subtrans ... ok
creating directory /var/lib/postgres/data/base ... ok
creating directory /var/lib/postgres/data/base/1 ... ok
creating directory /var/lib/postgres/data/pg_tblspc ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 1000
creating configuration files ... ok
creating template1 database in /var/lib/postgres/data/base/1 ... ok
initializing pg_shadow ... ok
enabling unlimited row size for system tables ... ok
initializing pg_depend ... ok
creating system views ... ok
loading pg_description ... ok
creating conversions ... ok
setting privileges on built-in objects ... ok
creating information schema ... child process exited with exit code 139
initdb: removing contents of data directory "/var/lib/postgres/data"
... John


> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen@pingpong.net]
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers@postgresql.org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
> <girgen@pingpong.net> wrote:
>
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur
> > uppercase/lowercase in multibyte locales, this might be
> useful. If you
> > are using a multibyte character encoding in your database and want
> > collation, i.e. order by, and also lower(), upper() and
> initcap() to
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for
> > collation of for example unicode locales (that is,
> wcscoll(3) does not
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> example). AFAIK
> > the patch is *not* necessary for Linux, although IBM claims ICU
> > collation to be about twice as fast as glibc for simple
> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up
> the code
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 &
> FreeBSD-5-stable, where it
> > seems to run well. I've not had the time to do any comparative
> > performance tests yet, but it seems it is at least not slower than
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would
> > review this stuff. The patch is pretty compact, so it's
> fast reading
> > :)  I'm planning to add this patch as an option (tagged
> > "experimental") to FreeBSD's postgresql port. Any ideas
> about whether
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> >
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>
>
>
> Hi!
>
> There's a new patch to fix some reported problems.
>
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
> 005-03-26.diff>
>
> This version uses the DatabaseEncoding and sets the ICU
> encoding at the same time. I had to create a conversion table
> from PostgreSQL's own, somewhat odd and non-standard, names
> of encodings, into the prefered IANA names. On or two of the
> more odd ones might be slightly incorrect, hopefully not too
> far off anyway?
>
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might
> break the SQL standard (?), but sure is nice :)
>
> - When the database is initialized using the C locale,
> upper() and lower() normally does not work at all for
> non-ASCII characters even if the database's encoding is say
> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
> and this is probably correct since the locale is still `C', I
> believe?). The ICU patch changes nothing for the LATIN1 case,
> since it does not act on single byte encodings, but for the
> UNICODE representation, it works and does what I expect it
> to, namely upper() and lower() neatly
> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
> -> 'åäö'.
> This is a good thing, although I'm surprised that upper/lower
> is dragged along with the LC_COLLATE fixation at initdb. I
> never run initdb in the C locale, but only now do I realize
> how broken that really is if you need to store anything else
> than English :-)
>
> I'd be delighted to get more feedback about this stuff.
>
> Thanks,
> Palle
>
>
>


Re: Patch for collation using ICU

From
Stephan Szabo
Date:
On Sat, 26 Mar 2005, Palle Girgensohn wrote:
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might break the SQL
> standard (?), but sure is nice :)

Err, I think if your system implements strcoll correctly 8.0.1 can do this
if the chosen collation is set up that way (or at least naive tests I've
done seem to imply that). Or are you speaking about C locale?


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:

> On Sat, 26 Mar 2005, Palle Girgensohn wrote:
>> I've noticed a couple of things about using the ICU patch vs. pristine
>> pg-8.0.1:
>>
>> - ORDER BY is case insensitive when using ICU. This might break the SQL
>> standard (?), but sure is nice :)
>
> Err, I think if your system implements strcoll correctly 8.0.1 can do this
> if the chosen collation is set up that way (or at least naive tests I've
> done seem to imply that). Or are you speaking about C locale?

No, I doubt this.

Example: set up a cluster:
$ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
$ createdb foo
CREATE DATABASE
$ psql foo
foo=# create table bar (val text);
CREATE TABLE
foo=# insert into bar values ('aaa');
INSERT 18354409 1
foo=# insert into bar values ('BBB');
INSERT 18354412 1
foo=# select val from bar order by val;val
-----BBBaaa
(2 rows)


Order by is not case insensitive. It shouldn't be for any system, AFAIK. As
John Hansen noted, this might be a bad thing. I'm not sure about that,
though...

As for general collation of unicode, the reason for me to use ICU is that
my system does not support strcoll correctly for multibyte locales, as I
mentioned earlier. I also noted that even for systems that do handle
strcoll correctly for unicode, ICU claims to be a couple of magnitudes
faster, so this patch might be useful for other systems (read Linux) as
well. See previous emails for details.

Regards,
Palle



Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, mars 26, 2005 13.59.19 +1100 John Hansen <john@geeknet.com.au>
wrote:

>> - ORDER BY is case insensitive when using ICU. This might
>> break the SQL standard (?), but sure is nice :)
>
> This would mean that indexes are also case insensitive right?
> Which makes it a Bad Thing(tm).

Well, no, not really. Indices use collation rules, yes, but upper and lower
case strings are not considered *equal*, just "closer related". In
collation, characters are compared at four levels. See [1] for a good
explaination. This means that indices will use a case insensitive sort
order, but equality will not be different, so it shouldn't break anything.

>> - When the database is initialized using the C locale,
>> upper() and lower() normally does not work at all for
>> non-ASCII characters even if the database's encoding is say
>> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
>> and this is probably correct since the locale is still `C', I
>> believe?). The ICU patch changes nothing for the LATIN1 case,
>> since it does not act on single byte encodings, but for the
>> UNICODE representation, it works and does what I expect it
>> to, namely upper() and lower() neatly
>> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
>> -> 'åäö'.
>> This is a good thing, although I'm surprised that upper/lower
>> is dragged along with the LC_COLLATE fixation at initdb. I
>> never run initdb in the C locale, but only now do I realize
>> how broken that really is if you need to store anything else
>> than English :-)
>
> That is what I would have expected. However, it probably won't work for
> the more exotic cases, like turkish I, which depends on the locale.

Nope, Turkish must of course have its locale to for example handle their
special capital "i". Let's just say it is less broken :)

/Palle


[1]
<http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels>



Re: Patch for collation using ICU

From
Stephan Szabo
Date:
On Sun, 27 Mar 2005, Palle Girgensohn wrote:

>
>
> --On l�rdag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
> <sszabo@megazone.bigpanda.com> wrote:
>
> > On Sat, 26 Mar 2005, Palle Girgensohn wrote:
> >> I've noticed a couple of things about using the ICU patch vs. pristine
> >> pg-8.0.1:
> >>
> >> - ORDER BY is case insensitive when using ICU. This might break the SQL
> >> standard (?), but sure is nice :)
> >
> > Err, I think if your system implements strcoll correctly 8.0.1 can do this
> > if the chosen collation is set up that way (or at least naive tests I've
> > done seem to imply that). Or are you speaking about C locale?
>
> No, I doubt this.
>
> Example: set up a cluster:
> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
> $ createdb foo
> CREATE DATABASE
> $ psql foo
> foo=# create table bar (val text);
> CREATE TABLE
> foo=# insert into bar values ('aaa');
> INSERT 18354409 1
> foo=# insert into bar values ('BBB');
> INSERT 18354412 1
> foo=# select val from bar order by val;
>  val
> -----
>  BBB
>  aaa
> (2 rows)
>
>
> Order by is not case insensitive. It shouldn't be for any system, AFAIK. As

It is on my machine... for the same test:

foo=# select val from bar order by val;val
-----aaaBBB
(2 rows)

I think this just implies even greater breakage of either the collation or
strcoll on the system you're trying on. ;)  Which, of course, is a fairly
reasonable reason to offer an alternative.  Especially if it's generically
useful.


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, mars 26, 2005 17.40.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:

>
> On Sun, 27 Mar 2005, Palle Girgensohn wrote:
>
>>
>>
>> --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
>> <sszabo@megazone.bigpanda.com> wrote:
>>
>> > On Sat, 26 Mar 2005, Palle Girgensohn wrote:
>> >> I've noticed a couple of things about using the ICU patch vs. pristine
>> >> pg-8.0.1:
>> >>
>> >> - ORDER BY is case insensitive when using ICU. This might break the
>> >> SQL standard (?), but sure is nice :)
>> >
>> > Err, I think if your system implements strcoll correctly 8.0.1 can do
>> > this if the chosen collation is set up that way (or at least naive
>> > tests I've done seem to imply that). Or are you speaking about C
>> > locale?
>>
>> No, I doubt this.
>>
>> Example: set up a cluster:
>> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
>> $ createdb foo
>> CREATE DATABASE
>> $ psql foo
>> foo=# create table bar (val text);
>> CREATE TABLE
>> foo=# insert into bar values ('aaa');
>> INSERT 18354409 1
>> foo=# insert into bar values ('BBB');
>> INSERT 18354412 1
>> foo=# select val from bar order by val;
>>  val
>> -----
>>  BBB
>>  aaa
>> (2 rows)
>>
>>
>> Order by is not case insensitive. It shouldn't be for any system, AFAIK.
>> As
>
> It is on my machine... for the same test:
>
> foo=# select val from bar order by val;
>  val
> -----
>  aaa
>  BBB
> (2 rows)
>
> I think this just implies even greater breakage of either the collation or
> strcoll on the system you're trying on. ;)  Which, of course, is a fairly
> reasonable reason to offer an alternative.  Especially if it's generically
> useful.

Interesting! Indeed, just tried on an old Linux Redhat system... BTW,
that's pretty odd for a unix system. "ls -l" sorts aaa before BBB, I've
never seen the likes of it! Call me old fashion if you like ;-)

Still, as you say, FreeBSD does it capital letters first, and does not
handle unicode locales' collation, so I need an alternative. Perhaps the
best way would be to inject ICU into BSD instead :-)

/Palle




Re: Patch for collation using ICU

From
"Magnus Hagander"
Date:
>As for general collation of unicode, the reason for me to use
>ICU is that
>my system does not support strcoll correctly for multibyte
>locales, as I
>mentioned earlier. I also noted that even for systems that do handle
>strcoll correctly for unicode, ICU claims to be a couple of magnitudes
>faster, so this patch might be useful for other systems (read
>Linux) as
>well. See previous emails for details.

Does it work on win32? If so, perhaps it could help with the lack of
UTF-8 strcoll there?

//Magnus


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On söndag, mars 27, 2005 20.11.48 +0200 Magnus Hagander
<mha@sollentuna.net> wrote:

>> As for general collation of unicode, the reason for me to use
>> ICU is that
>> my system does not support strcoll correctly for multibyte
>> locales, as I
>> mentioned earlier. I also noted that even for systems that do handle
>> strcoll correctly for unicode, ICU claims to be a couple of magnitudes
>> faster, so this patch might be useful for other systems (read
>> Linux) as
>> well. See previous emails for details.
>
> Does it work on win32? If so, perhaps it could help with the lack of
> UTF-8 strcoll there?

Hej Magnus!

Well, ICU exists for Windows. If you configure postgresql's source to link
with ICU on Windows, it would probably work. Please try it out.

/Palle



Re: Patch for collation using ICU

From
Hannu Krosing
Date:
On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote:

> Hi!
> 
...
> I've noticed a couple of things about using the ICU patch vs. pristine 
> pg-8.0.1:
> 
> - ORDER BY is case insensitive when using ICU. This might break the SQL 
> standard (?), but sure is nice :)

How does your patch interact with the ability to use indexes for
anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ?

-- 
Hannu Krosing <hannu@tm.ee>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On söndag, mars 27, 2005 04.34.03 +0300 Hannu Krosing <hannu@tm.ee> wrote:

> On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote:
>
>> Hi!
>>
> ...
>> I've noticed a couple of things about using the ICU patch vs. pristine
>> pg-8.0.1:
>>
>> - ORDER BY is case insensitive when using ICU. This might break the SQL
>> standard (?), but sure is nice :)

Just a comment: ORDER BY *is* already case sensitive on Linux, since its
strcoll ignores case. I doubt very much it violates SQL standards.


> How does your patch interact with the ability to use indexes for
> anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ?

I don't think it matters. You still need to use the special non-locale
index functions described in the handbook to get anchored like queries use
indices. My ICU patch does not alter this. ICU is "injected" where strings
are compared and the database cluster was initialized with a multibyte
character encoding.

The problem, AFAIK, has to do with the nature of (some) locales, not with a
specific implementation of collation.

Regards,
Palle



Re: Patch for collation using ICU

From
Peter Eisentraut
Date:
Andrew Dunstan wrote:
> The database cluster will be initialized with locale en_US.UTF-8.
> initdb: could not find suitable encoding for locale "en_US.UTF-8"

What does

$ LC_ALL=en_US.UTF-8 locale charmap

show?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Patch for collation using ICU

From
Peter Eisentraut
Date:
Palle Girgensohn wrote:
> Just a comment: ORDER BY *is* already case sensitive on Linux, since
> its strcoll ignores case. I doubt very much it violates SQL
> standards.

The behavior of collation sequences is implementation-defined.  So as 
long as you can put the behavior in words, it should be OK.

It would seem, however, that the behavior of a certain locale name 
should be the same with or without ICU, so perhaps some locale renaming 
might be needed, but that is speculation on my part.

> > How does your patch interact with the ability to use indexes for
> > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use
> > index) ?

> The problem, AFAIK, has to do with the nature of (some) locales, not
> with a specific implementation of collation.

Yeah, pretty much the whole point of that code is to avoid collating 
stuff.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Patch for collation using ICU

From
Andrew Dunstan
Date:

Peter Eisentraut wrote:

>Andrew Dunstan wrote:
>  
>
>>The database cluster will be initialized with locale en_US.UTF-8.
>>initdb: could not find suitable encoding for locale "en_US.UTF-8"
>>    
>>
>
>What does
>
>$ LC_ALL=en_US.UTF-8 locale charmap
>
>show?
>
>  
>

[andrew pgsql]$ LC_ALL=en_US.UTF-8 locale charmap
UTF-8
[andrew pgsql]$

cheers

andrew


Re: Patch for collation using ICU

From
Peter Eisentraut
Date:
Andrew Dunstan wrote:
> [andrew pgsql]$ LC_ALL=en_US.UTF-8 locale charmap
> UTF-8

That seems normal.  Time to get out the debugger, I suppose.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Is this patch ready for application?
http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.diff.gz

The web site is:
http://people.freebsd.org/~girgen/postgresql-icu/readme.html

I do have a few questions:

Why don't you use the lc_ctype_is_c() part of this test?
    if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Why is so much code added, for example, in lower()?  The existing
multibyte code is much smaller, and lots of code is added in other
places too.

Why do you need to add a mapping of encoding names from iana to our
names?

---------------------------------------------------------------------------

Palle Girgensohn wrote:
> Hi!
> 
> I've put together a patch for using IBM's ICU package for collation.
> 
> If your OS does not have full support for collation ur uppercase/lowercase 
> in multibyte locales, this might be useful. If you are using a multibyte 
> character encoding in your database and want collation, i.e. order by, and 
> also lower(), upper() and initcap() to work properly, this patch will do 
> just that.
> 
> This patch is needed for FreeBSD, since this OS has no support for 
> collation of for example unicode locales (that is, wcscoll(3) does not do 
> what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the 
> patch is *not* necessary for Linux, although IBM claims ICU collation to be 
> about twice as fast as glibc for simple western locales.
> 
> It adds a configure switch, `--with-icu', which will set up the code to use 
> ICU instead of wchar_t and wcscoll.
> 
> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it 
> seems to run well. I've not had the time to do any comparative performance 
> tests yet, but it seems it is at least not slower than using LATIN1 with 
> sv_SE.ISO8859-1 locale, perhaps even faster.
> 
> I'd be delighted if some more experienced postgresql hackers would review 
> this stuff. The patch is pretty compact, so it's fast reading :)  I'm 
> planning to add this patch as an option (tagged "experimental") to 
> FreeBSD's postgresql port. Any ideas about whether this is a good idea or 
> not?
> 
> Any thoughts or ideas are welcome!
> 
> Cheers,
> Palle
> 
> Patch at:
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
> 
> ICU at sourceforge: <http://icu.sf.net/>
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Is this patch ready for application?

Not until ICU is released under a BSD license ...
        regards, tom lane


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Is this patch ready for application?
> 
> Not until ICU is released under a BSD license ...

Well, readline isn't BSD either, but we use it.  It is any different?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Andrew - Supernews
Date:
On 2005-05-07, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> Tom Lane wrote:
>> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> > Is this patch ready for application?
>> 
>> Not until ICU is released under a BSD license ...
>
> Well, readline isn't BSD either, but we use it.  It is any different?

ICU appears to be under the X license, which is no more restrictive than
BSD-with-no-advertising-clause.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Patch for collation using ICU

From
"John Hansen"
Date:
> Why do you need to add a mapping of encoding names from iana
> to our names?
>

The pg encoding names are not recognized by ICU, hence the mappings....

Install ICU 3.2 on your system, and run uconv -l, that will give you a
list of valid ICU encoding names.

... John


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Btw,

Does it feel right to have pg depend on the bleeding edge version of
ICU?
On many distro's, even gentoo (known for being bleeding edge) 2.8 is
still the default.


2.8 and 3.2 are however incompatible, and supporting both, would bloat
the source somewhat.

... John



Re: Patch for collation using ICU

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Not until ICU is released under a BSD license ...

> Well, readline isn't BSD either, but we use it.  It is any different?

Did you read the license?  Some of the more troubling bits:

: It is the understanding of INTERNATIONAL BUSINESS MACHINES CORPORATION
: that the purpose for which its publications are being reproduced is
: accurate and true as stated in your attached request.

(er, which attached request would that be?)

: Permission to quote from or reprint IBM publications is limited to the
: purpose and quantities originally requested and must not be construed as
: a blanket license to use the material for other purposes or to reprint
: other IBM copyrighted material.

: IBM reserves the right to withdraw permission to reproduce copyrighted
: material whenever, in its discretion, it feels that the privilege of
: reproducing its material is being used in a way detrimental to its
: interest or the above instructions are not being followed properly to
: protect its copyright.

: IBM may have patents or pending patent applications covering subject
: matter in this document. The furnishing of this document does not give
: you any license to these patents. You can send license inquiries, in
: writing, to:

: For license inquiries regarding double-byte (DBCS) information, contact
: the IBM Intellectual Property Department in your country or send
: inquiries, in writing, to:
        regards, tom lane


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Where'd you get the licence from?
None of that is in the licence I'm reading!

(http://www-306.ibm.com/software/globalization/icu/index.jsp)
(http://www-306.ibm.com/software/globalization/icu/license.jsp)

... John

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Saturday, May 07, 2005 3:17 PM
> To: Bruce Momjian
> Cc: Palle Girgensohn; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Not until ICU is released under a BSD license ...
>
> > Well, readline isn't BSD either, but we use it.  It is any
> different?
>
> Did you read the license?  Some of the more troubling bits:
>
> : It is the understanding of INTERNATIONAL BUSINESS MACHINES
> CORPORATION
> : that the purpose for which its publications are being reproduced is
> : accurate and true as stated in your attached request.
>
> (er, which attached request would that be?)
>
> : Permission to quote from or reprint IBM publications is
> limited to the
> : purpose and quantities originally requested and must not be
> construed as
> : a blanket license to use the material for other purposes or
> to reprint
> : other IBM copyrighted material.
>
> : IBM reserves the right to withdraw permission to reproduce
> copyrighted
> : material whenever, in its discretion, it feels that the privilege of
> : reproducing its material is being used in a way detrimental to its
> : interest or the above instructions are not being followed
> properly to
> : protect its copyright.
>
> : IBM may have patents or pending patent applications covering subject
> : matter in this document. The furnishing of this document
> does not give
> : you any license to these patents. You can send license inquiries, in
> : writing, to:
>
> : For license inquiries regarding double-byte (DBCS)
> information, contact
> : the IBM Intellectual Property Department in your country or send
> : inquiries, in writing, to:
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
> majordomo@postgresql.org)
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Is this patch ready for application?
>
> Not until ICU is released under a BSD license ...

It's not GPL anyway. Seems pretty much like the BSD license, at least more 
BSD-ish than GPL-ish.

<http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html>

/Palle



Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian 
<pgman@candle.pha.pa.us> wrote:

>
> Is this patch ready for application?
>
>     http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d
> iff.gz
>
> The web site is:
>
>     http://people.freebsd.org/~girgen/postgresql-icu/readme.html

I don't think so, not quite. I have not had any positive reports from linux 
users, this is only tested in a FreeBSD environment. I'd say it needs some 
more testing.

Also, apparently, ICU is installed by default in many linux distributions, 
and usually it is version 2.8. Some linux users have asked me if there are 
plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
and the ICU folks seem to consider 3.2 to be the stable version, older 
versions are hard to find on their sites, but most linux distributers seem 
to consider it too bleeding edge, even gentoo. I don't know why they don't 
agree.

> I do have a few questions:
>
> Why don't you use the lc_ctype_is_c() part of this test?
>
>      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Um, well, I didn't think about that. :)  What would be the locale in this 
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
encoding, indeed. Then the strings will be handled like byte-wide chars. 
Yeah, it's a bug. I'll fix it! Thanks.

> Why is so much code added, for example, in lower()?  The existing
> multibyte code is much smaller, and lots of code is added in other
> places too.

ICU uses UTF-16 internally, so all strings must be converted from the 
database encoding to UTF-16. Since that means the strings need to be 
copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
strings use the heap and only larger strings use a palloc. Comments in 
varstr_cmp about performance made me use that approach.

Also, in the latest patch, I also added checks and logging for *every* 
status returned from ICU. I hope this will help debugging on debian, where 
previous version didn't work. That excessive status checking is hardly be 
necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of the 
code bloat, together with the excessive status checking and logging.


> Why do you need to add a mapping of encoding names from iana to our
> names?

This was already answered by John Hansen... There's an old thread here 
about the choice of the name "UNICODE" to describe an encoding, which it 
doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
postgresql, that would have been a better name... Similarly for most other 
encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
can't provide it, so a mapping table is required.

I use this patch in production on one FreeBSD 4.10 server at the moment. 
With the latest version, I've had no problems. Logging is swithed on for 
now, and it shows no signs of ICU complaining. I'd like more reports on 
Linux, though.

/Palle

>
> -------------------------------------------------------------------------
> --
>
> Palle Girgensohn wrote:
>> Hi!
>>
>> I've put together a patch for using IBM's ICU package for collation.
>>
>> If your OS does not have full support for collation ur
>> uppercase/lowercase  in multibyte locales, this might be useful. If you
>> are using a multibyte  character encoding in your database and want
>> collation, i.e. order by, and  also lower(), upper() and initcap() to
>> work properly, this patch will do  just that.
>>
>> This patch is needed for FreeBSD, since this OS has no support for
>> collation of for example unicode locales (that is, wcscoll(3) does not
>> do  what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK
>> the  patch is *not* necessary for Linux, although IBM claims ICU
>> collation to be  about twice as fast as glibc for simple western locales.
>>
>> It adds a configure switch, `--with-icu', which will set up the code to
>> use  ICU instead of wchar_t and wcscoll.
>>
>> This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
>> seems to run well. I've not had the time to do any comparative
>> performance  tests yet, but it seems it is at least not slower than
>> using LATIN1 with  sv_SE.ISO8859-1 locale, perhaps even faster.
>>
>> I'd be delighted if some more experienced postgresql hackers would
>> review  this stuff. The patch is pretty compact, so it's fast reading :)
>> I'm  planning to add this patch as an option (tagged "experimental") to
>> FreeBSD's postgresql port. Any ideas about whether this is a good idea
>> or  not?
>>
>> Any thoughts or ideas are welcome!
>>
>> Cheers,
>> Palle
>>
>> Patch at:
>> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.
>> diff>
>>
>> ICU at sourceforge: <http://icu.sf.net/>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings
>>
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania
> 19073






Re: Patch for collation using ICU

From
"John Hansen"
Date:
>
> I use this patch in production on one FreeBSD 4.10 server at
> the moment.
> With the latest version, I've had no problems. Logging is
> swithed on for
> now, and it shows no signs of ICU complaining. I'd like more
> reports on
> Linux, though.

I currently use this on gentoo with ICU3.2 unmasked.

Works a dream, even with locale C and UNICODE database.

Small test:

createdb --encoding UNICODE --locale C test
psql test
set client_encoding=iso88591;
CREATE TABLE test (t text);
INSERT INTO test (t) VALUES ('æøå');
set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test;
set client_encoding=iso88591;
SELECT * FROM test; t
-----æøåÆØÅ
(2 rows)

Just as I'd expect, as upper/lower/initcap are locale independent for these characters.



Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Palle Girgensohn wrote:
> >
> > Is this patch ready for application?
> 
> I don't think so, not quite. I have not had any positive reports from linux 
> users, this is only tested in a FreeBSD environment. I'd say it needs some 
> more testing.

OK.

> Also, apparently, ICU is installed by default in many linux distributions, 
> and usually it is version 2.8. Some linux users have asked me if there are 
> plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
> and the ICU folks seem to consider 3.2 to be the stable version, older 
> versions are hard to find on their sites, but most linux distributers seem 
> to consider it too bleeding edge, even gentoo. I don't know why they don't 
> agree.

Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.

Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

> > I do have a few questions:
> >
> > Why don't you use the lc_ctype_is_c() part of this test?
> >
> >      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> 
> Um, well, I didn't think about that. :)  What would be the locale in this 
> case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
> encoding, indeed. Then the strings will be handled like byte-wide chars. 
> Yeah, it's a bug. I'll fix it! Thanks.

The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in "C".  I realize ICU might be fine
with it but the optimization still is an issue.

> > Why is so much code added, for example, in lower()?  The existing
> > multibyte code is much smaller, and lots of code is added in other
> > places too.
> 
> ICU uses UTF-16 internally, so all strings must be converted from the 
> database encoding to UTF-16. Since that means the strings need to be 
> copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
> strings use the heap and only larger strings use a palloc. Comments in 
> varstr_cmp about performance made me use that approach.

Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.

> Also, in the latest patch, I also added checks and logging for *every* 
> status returned from ICU. I hope this will help debugging on debian, where 
> previous version didn't work. That excessive status checking is hardly be 
> necessary once the stuff is better tested.
> 
> I think the string copying and heap/palloc choices stands for most of the 
> code bloat, together with the excessive status checking and logging.

OK, move that into some common functions and I think it will be better.

> > Why do you need to add a mapping of encoding names from iana to our
> > names?
> 
> This was already answered by John Hansen... There's an old thread here 
> about the choice of the name "UNICODE" to describe an encoding, which it 
> doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
> postgresql, that would have been a better name... Similarly for most other 
> encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
> can't provide it, so a mapping table is required.

We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?

> I use this patch in production on one FreeBSD 4.10 server at the moment. 
> With the latest version, I've had no problems. Logging is swithed on for 
> now, and it shows no signs of ICU complaining. I'd like more reports on 
> Linux, though.

OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Errm,... initdb --encoding UNICODE --locale C

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
> Sent: Saturday, May 07, 2005 10:23 PM
> To: Palle Girgensohn; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> >
> > I use this patch in production on one FreeBSD 4.10 server at the
> > moment.
> > With the latest version, I've had no problems. Logging is
> swithed on
> > for now, and it shows no signs of ICU complaining. I'd like more
> > reports on Linux, though.
>
> I currently use this on gentoo with ICU3.2 unmasked.
>
> Works a dream, even with locale C and UNICODE database.
>
> Small test:
>
> createdb --encoding UNICODE --locale C test psql test set
> client_encoding=iso88591; CREATE TABLE test (t text); INSERT
> INTO test (t) VALUES ('æøå'); set client_encoding=unicode;
> INSERT INTO test (t) SELECT upper(t) FROM test; set
> client_encoding=iso88591; SELECT * FROM test;
>   t
> -----
>  æøå
>  ÆØÅ
> (2 rows)
>
> Just as I'd expect, as upper/lower/initcap are locale
> independent for these characters.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen <john@geeknet.com.au>
wrote:

> Errm,... initdb --encoding UNICODE --locale C

You mean that ICU *shall* be used even for the C locale, and not as Bruce
suggested here:

>> I do have a few questions:
>>
>> Why don't you use the lc_ctype_is_c() part of this test?
>>
>>      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>
> Um, well, I didn't think about that. :)  What would be the locale in this
> case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide
> encoding, indeed. Then the strings will be handled like byte-wide chars.
> Yeah, it's a bug. I'll fix it! Thanks.

John disagrees here, and I'm obliged to agree. Using the C locale, one will
expect C collation, but upper/lower is better off still using ICU. Hence,
the above stuff is *not* a bug. Do we agree?

/Palle


>
>> -----Original Message-----
>> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
>> Sent: Saturday, May 07, 2005 10:23 PM
>> To: Palle Girgensohn; Bruce Momjian
>> Cc: pgsql-hackers@postgresql.org
>> Subject: Re: [HACKERS] Patch for collation using ICU
>>
>> >
>> > I use this patch in production on one FreeBSD 4.10 server at the
>> > moment.
>> > With the latest version, I've had no problems. Logging is
>> swithed on
>> > for now, and it shows no signs of ICU complaining. I'd like more
>> > reports on Linux, though.
>>
>> I currently use this on gentoo with ICU3.2 unmasked.
>>
>> Works a dream, even with locale C and UNICODE database.
>>
>> Small test:
>>
>> createdb --encoding UNICODE --locale C test psql test set
>> client_encoding=iso88591; CREATE TABLE test (t text); INSERT
>> INTO test (t) VALUES ('æøå'); set client_encoding=unicode;
>> INSERT INTO test (t) SELECT upper(t) FROM test; set
>> client_encoding=iso88591; SELECT * FROM test;
>>   t
>> -----
>>  æøå
>>  ÆØÅ
>> (2 rows)
>>
>> Just as I'd expect, as upper/lower/initcap are locale
>> independent for these characters.
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>>                http://www.postgresql.org/docs/faq
>>
>>






Re: Patch for collation using ICU

From
"John Hansen"
Date:
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> >
> > I don't think so, not quite. I have not had any positive
> reports from
> > linux users, this is only tested in a FreeBSD environment.
> I'd say it
> > needs some more testing.
>
> OK.
>
> > Also, apparently, ICU is installed by default in many linux
> > distributions, and usually it is version 2.8. Some linux users have
> > asked me if there are plans for a patch that works with ICU 2.8.
> > That's probably a good idea. IBM and the ICU folks seem to consider
> > 3.2 to be the stable version, older versions are hard to
> find on their
> > sites, but most linux distributers seem to consider it too bleeding
> > edge, even gentoo. I don't know why they don't agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their
> OS support encodings natively?  I am particularly excited
> about this for OSs that don't have such encodings, like UTF8
> support for Win32.
>
> Because ICU will not be used unless enabled by configure, it
> seems we are fine with only supporting the newest version.
> Do Linux users need to use ICU for any reason?

Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

>
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > >      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the
> locale in
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> CTYPE=C and use
> > a wide encoding, indeed. Then the strings will be handled
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a
> problem with some OSs that have processing problems with UTF8
> when the locale is supposed to be turned off, like in "C".  I
> realize ICU might be fine with it but the optimization still
> is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

> > > Why is so much code added, for example, in lower()?  The existing
> > > multibyte code is much smaller, and lots of code is added
> in other
> > > places too.
> >
> > ICU uses UTF-16 internally, so all strings must be
> converted from the
> > database encoding to UTF-16. Since that means the strings
> need to be
> > copied, I took the same approach as in
> varlena.c:varstr_cmp(), where
> > small strings use the heap and only larger strings use a palloc.
> > Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and
> easier to maintain.
>
> > Also, in the latest patch, I also added checks and logging
> for *every*
> > status returned from ICU. I hope this will help debugging
> on debian,
> > where previous version didn't work. That excessive status
> checking is
> > hardly be necessary once the stuff is better tested.
> >
> > I think the string copying and heap/palloc choices stands
> for most of
> > the code bloat, together with the excessive status checking
> and logging.
>
> OK, move that into some common functions and I think it will
> be better.
>
> > > Why do you need to add a mapping of encoding names from
> iana to our
> > > names?
> >
> > This was already answered by John Hansen... There's an old
> thread here
> > about the choice of the name "UNICODE" to describe an
> encoding, which
> > it doesn't. There's half a dozen unicode based encodings...
> UTF-8 is
> > used by postgresql, that would have been a better name... Similarly
> > for most other encodings, really. ICU expect a setlocale(3) string
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping
> table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).
>  Does that help?
>
> > I use this patch in production on one FreeBSD 4.10 server
> at the moment.
> > With the latest version, I've had no problems. Logging is
> swithed on
> > for now, and it shows no signs of ICU complaining. I'd like more
> > reports on Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should
> have feature freeze on July 1.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square,
> Pennsylvania 19073
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
> --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
> <john@geeknet.com.au>
> wrote:
>
> > Errm,... initdb --encoding UNICODE --locale C
>
> You mean that ICU *shall* be used even for the C locale, and
> not as Bruce suggested here:

Yes, that's exactly what I mean.

>
> >> I do have a few questions:
> >>
> >> Why don't you use the lc_ctype_is_c() part of this test?
> >>
> >>      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the
> locale in
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> CTYPE=C and use
> > a wide encoding, indeed. Then the strings will be handled
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
>
> John disagrees here, and I'm obliged to agree. Using the C
> locale, one will expect C collation, but upper/lower is
> better off still using ICU. Hence, the above stuff is *not* a
> bug. Do we agree?
>
> /Palle
>
>
> >
> >> -----Original Message-----
> >> From: pgsql-hackers-owner@postgresql.org
> >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
> John Hansen
> >> Sent: Saturday, May 07, 2005 10:23 PM
> >> To: Palle Girgensohn; Bruce Momjian
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Patch for collation using ICU
> >>
> >> >
> >> > I use this patch in production on one FreeBSD 4.10 server at the
> >> > moment.
> >> > With the latest version, I've had no problems. Logging is
> >> swithed on
> >> > for now, and it shows no signs of ICU complaining. I'd like more
> >> > reports on Linux, though.
> >>
> >> I currently use this on gentoo with ICU3.2 unmasked.
> >>
> >> Works a dream, even with locale C and UNICODE database.
> >>
> >> Small test:
> >>
> >> createdb --encoding UNICODE --locale C test psql test set
> >> client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO
> >> test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO
> >> test (t) SELECT upper(t) FROM test; set client_encoding=iso88591;
> >> SELECT * FROM test;
> >>   t
> >> -----
> >>  æøå
> >>  ÆØÅ
> >> (2 rows)
> >>
> >> Just as I'd expect, as upper/lower/initcap are locale
> independent for
> >> these characters.
> >>
> >>
> >> ---------------------------(end of
> >> broadcast)---------------------------
> >> TIP 5: Have you checked our extensive FAQ?
> >>
> >>                http://www.postgresql.org/docs/faq
> >>
> >>
>
>
>
>
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls
toICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own
conversiontables. 

ICU also has a fair few conversions that we do not have at present.

Any thoughts?

... John

> -----Original Message-----
> From: John Hansen
> Sent: Saturday, May 07, 2005 11:09 PM
> To: 'Palle Girgensohn'; 'Bruce Momjian'
> Cc: 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Patch for collation using ICU
>
> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
> > <john@geeknet.com.au>
> > wrote:
> >
> > > Errm,... initdb --encoding UNICODE --locale C
> >
> > You mean that ICU *shall* be used even for the C locale, and not as
> > Bruce suggested here:
>
> Yes, that's exactly what I mean.
>
> >
> > >> I do have a few questions:
> > >>
> > >> Why don't you use the lc_ctype_is_c() part of this test?
> > >>
> > >>      if (pg_database_encoding_max_length() > 1 &&
> !lc_ctype_is_c())
> > >
> > > Um, well, I didn't think about that. :)  What would be the
> > locale in
> > > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> > CTYPE=C and use
> > > a wide encoding, indeed. Then the strings will be handled
> > like byte-wide chars.
> > > Yeah, it's a bug. I'll fix it! Thanks.
> >
> > John disagrees here, and I'm obliged to agree. Using the C
> locale, one
> > will expect C collation, but upper/lower is better off still using
> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
> >
> > /Palle
> >
> >
> > >
> > >> -----Original Message-----
> > >> From: pgsql-hackers-owner@postgresql.org
> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
> > John Hansen
> > >> Sent: Saturday, May 07, 2005 10:23 PM
> > >> To: Palle Girgensohn; Bruce Momjian
> > >> Cc: pgsql-hackers@postgresql.org
> > >> Subject: Re: [HACKERS] Patch for collation using ICU
> > >>
> > >> >
> > >> > I use this patch in production on one FreeBSD 4.10
> server at the
> > >> > moment.
> > >> > With the latest version, I've had no problems. Logging is
> > >> swithed on
> > >> > for now, and it shows no signs of ICU complaining. I'd
> like more
> > >> > reports on Linux, though.
> > >>
> > >> I currently use this on gentoo with ICU3.2 unmasked.
> > >>
> > >> Works a dream, even with locale C and UNICODE database.
> > >>
> > >> Small test:
> > >>
> > >> createdb --encoding UNICODE --locale C test psql test set
> > >> client_encoding=iso88591; CREATE TABLE test (t text);
> INSERT INTO
> > >> test (t) VALUES ('æøå'); set client_encoding=unicode;
> INSERT INTO
> > >> test (t) SELECT upper(t) FROM test; set
> client_encoding=iso88591;
> > >> SELECT * FROM test;
> > >>   t
> > >> -----
> > >>  æøå
> > >>  ÆØÅ
> > >> (2 rows)
> > >>
> > >> Just as I'd expect, as upper/lower/initcap are locale
> > independent for
> > >> these characters.
> > >>
> > >>
> > >> ---------------------------(end of
> > >> broadcast)---------------------------
> > >> TIP 5: Have you checked our extensive FAQ?
> > >>
> > >>                http://www.postgresql.org/docs/faq
> > >>
> > >>
> >
> >
> >
> >
> >
> >


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

> Palle Girgensohn wrote:
>> >
>> > Is this patch ready for application?
>>
>> I don't think so, not quite. I have not had any positive reports from
>> linux  users, this is only tested in a FreeBSD environment. I'd say it
>> needs some  more testing.
>
> OK.

John Hansen just reported that it does work on linux. fine!


>> Also, apparently, ICU is installed by default in many linux
>> distributions,  and usually it is version 2.8. Some linux users have
>> asked me if there are  plans for a patch that works with ICU 2.8. That's
>> probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
>> the stable version, older  versions are hard to find on their sites, but
>> most linux distributers seem  to consider it too bleeding edge, even
>> gentoo. I don't know why they don't  agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their OS support
> encodings natively?  I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version.  Do Linux users need
> to use ICU for any reason?


There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:

-- without ICUselect upper('Eßer');upper
-------EßER
(1 row)

-- with ICU
select upper('Eßer');upper
-------ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that way.
Other known example is in Turkish, where an Ì (?) should look different
whether it is an initial letter or not. This fails in standard postgresql
for all platforms.

>> > I do have a few questions:
>> >
>> > Why don't you use the lc_ctype_is_c() part of this test?
>> >
>> >      if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>>
>> Um, well, I didn't think about that. :)  What would be the locale in
>> this  case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a
>> wide  encoding, indeed. Then the strings will be handled like byte-wide
>> chars.  Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a problem
> with some OSs that have processing problems with UTF8 when the locale is
> supposed to be turned off, like in "C".  I realize ICU might be fine
> with it but the optimization still is an issue.

Well, the results are quite different, depending on whether ICU is used or
not. See separate mail.


>> > Why is so much code added, for example, in lower()?  The existing
>> > multibyte code is much smaller, and lots of code is added in other
>> > places too.
>>
>> ICU uses UTF-16 internally, so all strings must be converted from the
>> database encoding to UTF-16. Since that means the strings need to be
>> copied, I took the same approach as in varlena.c:varstr_cmp(), where
>> small  strings use the heap and only larger strings use a palloc.
>> Comments in  varstr_cmp about performance made me use that approach.
>
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and easier to
> maintain.

Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...


>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where  previous version didn't work. That excessive status checking is
>> hardly be  necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the  code bloat, together with the excessive status checking and logging.
>
> OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function pointer...
uhh...


>> > Why do you need to add a mapping of encoding names from iana to our
>> > names?
>>
>> This was already answered by John Hansen... There's an old thread here
>> about the choice of the name "UNICODE" to describe an encoding, which it
>> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
>> by  postgresql, that would have been a better name... Similarly for most
>> other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
>> PostgreSQL  can't provide it, so a mapping table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
> help?

I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.


>> I use this patch in production on one FreeBSD 4.10 server at the moment.
>> With the latest version, I've had no problems. Logging is swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more reports on
>> Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should have
> feature freeze on July 1.

That shouldn't be a problem.

/Palle


>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania
> 19073






Re: Patch for collation using ICU

From
"John Hansen"
Date:
Palle Girgensohn wrote:
> I'm aware of that. It might help for unicode, but there are a
> bunch of
> other encodings. IANA has decided that utf-8 has *no*
> aliases, hence only
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> fogiving, I don't remember/know, but I think we need the mappings,
> unfortunately.
>

Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)

UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584
windows-1201 cp1200 cp1201 UTF16_BigEndian
UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian
windows-1200
UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4
UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233
UTF-32LE UTF32_LittleEndian ibm-1234
UTF16_PlatformEndian
UTF16_OppositeEndian
UTF32_PlatformEndian
UTF32_OppositeEndian
UTF-7 windows-65000
IMAP-mailbox-name
SCSU
BOCU-1 csBOCU-1
CESU-8
ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100
ISO_8859-1:1987 l1 819
US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991
iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646
windows-20127
gb18030 ibm-1392 windows-54936
ibm-367_P100-1995 ibm-367 IBM367
ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2
iso-ir-101 l2 8859_2 cp912 912 windows-28592
ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3
iso-ir-109 l3 8859_3 cp913 913 windows-28593
ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110
ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594
ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic
iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595
ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic
iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089
windows-28596 ISO-8859-6-I ISO-8859-6-E
ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118
csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813
windows-28597
ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138
ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598
ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148
ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128
ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921
ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0
csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605
ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78
ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA
ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j
csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932
pck sjis ibm-943_VSUB_VPUA
ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA
ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP
Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese
X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP
ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722
ibm-33722_VASCII_VPUA
ibm-954_P101-2000 ibm-954 EUC-JP
ibm-1373_P100-2002 ibm-1373 windows-950
windows-950-2000 Big5 csBig5 windows-950 x-big5
ibm-950_P110-1999 ibm-950 cp950 950
macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5
ibm-1375_P100-2003 ibm-1375 Big5-HKSCS
ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA
windows-936-2000 GBK CP936 MS936 windows-936
ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN
cp1383 1383 ibm-1383_VPUA
ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280
gb2312-1980 GB2312.1980-0
ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964
ibm-964_VPUA
ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA
ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA
ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR
ibm-eucKR KSC_5601 5601 ibm-970_VPUA
ibm-971_P100-1995 ibm-971 ibm-971_VPUA
ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949
ibm-1363_VSUB_VPUA
ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA
windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 ms949
ibm-1162_P100-1999 ibm-1162
ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH
cp9066
windows-874-2000 TIS-620 windows-874 MS874
ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437
ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual
windows-850
ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851
ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852
ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855
ibm-856_P100-1995 ibm-856 cp856 856
ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857
ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858
PC-Multilingual-850+euro cp858
ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM860
ibm-861_P100-1995 ibm-861 IBM861 cp861 861 cp-is csIBM861 windows-861
ibm-862_P100-1995 ibm-862 IBM862 cp862 862 csPC862LatinHebrew DOS-862
windows-862
ibm-863_P100-1995 ibm-863 IBM863 cp863 863 csIBM863
ibm-864_X110-1999 ibm-864 IBM864 cp864 csIBM864
ibm-865_P100-1995 ibm-865 IBM865 cp865 865 csIBM865
ibm-866_P100-1995 ibm-866 IBM866 cp866 866 csIBM866 windows-866
ibm-867_P100-1998 ibm-867 cp867
ibm-868_P100-1995 ibm-868 IBM868 CP868 868 csIBM868 cp-ar
ibm-869_P100-1995 ibm-869 IBM869 cp869 869 cp-gr csIBM869 windows-869
ibm-878_P100-1996 ibm-878 KOI8-R koi8 csKOI8R cp878
ibm-901_P100-1999 ibm-901
ibm-902_P100-1999 ibm-902
ibm-922_P100-1999 ibm-922 cp922 922
ibm-4909_P100-1999 ibm-4909
ibm-5346_P100-1998 ibm-5346 windows-1250 cp1250
ibm-5347_P100-1998 ibm-5347 windows-1251 cp1251
ibm-5348_P100-1997 ibm-5348 windows-1252 cp1252
ibm-5349_P100-1998 ibm-5349 windows-1253 cp1253
ibm-5350_P100-1998 ibm-5350 windows-1254 cp1254
ibm-9447_P100-2002 ibm-9447 windows-1255 cp1255
windows-1256-2000 windows-1256 cp1256
ibm-9449_P100-2002 ibm-9449 windows-1257 cp1257
ibm-5354_P100-1998 ibm-5354 windows-1258 cp1258
ibm-1250_P100-1995 ibm-1250 windows-1250
ibm-1251_P100-1995 ibm-1251 windows-1251
ibm-1252_P100-2000 ibm-1252 windows-1252
ibm-1253_P100-1995 ibm-1253 windows-1253
ibm-1254_P100-1995 ibm-1254 windows-1254
ibm-1255_P100-1995 ibm-1255
ibm-5351_P100-1998 ibm-5351 windows-1255
ibm-1256_P110-1997 ibm-1256
ibm-5352_P100-1998 ibm-5352 windows-1256
ibm-1257_P100-1995 ibm-1257
ibm-5353_P100-1998 ibm-5353 windows-1257
ibm-1258_P100-1997 ibm-1258 windows-1258
macos-0_2-10.2 macintosh mac csMacintosh windows-10000
macos-6-10.2 x-mac-greek windows-10006 macgr
macos-7_3-10.2 x-mac-cyrillic windows-10007 maccy
macos-29-10.2 x-mac-centraleurroman windows-10029 x-mac-ce macce
macos-35-10.2 x-mac-turkish windows-10081 mactr
ibm-1051_P100-1995 ibm-1051 hp-roman8 roman8 r8 csHPRoman8
ibm-1276_P100-1995 ibm-1276 Adobe-Standard-Encoding
csAdobeStandardEncoding
ibm-1277_P100-1995 ibm-1277 Adobe-Latin1-Encoding
ibm-1006_P100-1995 ibm-1006 cp1006 1006
ibm-1098_P100-1995 ibm-1098 cp1098 1098
ibm-1124_P100-1996 ibm-1124 cp1124 1124
ibm-1125_P100-1997 ibm-1125 cp1125
ibm-1129_P100-1997 ibm-1129
ibm-1131_P100-1997 ibm-1131 cp1131
ibm-1133_P100-1997 ibm-1133
ibm-1381_P110-1999 ibm-1381 cp1381 1381
ISO_2022,locale=ja,version=0 ISO-2022-JP csISO2022JP
ISO_2022,locale=ja,version=1 ISO-2022-JP-1 JIS JIS_Encoding
ISO_2022,locale=ja,version=2 ISO-2022-JP-2 csISO2022JP2
ISO_2022,locale=ja,version=3 JIS7 csJISEncoding
ISO_2022,locale=ja,version=4 JIS8
ISO_2022,locale=ko,version=0 ISO-2022-KR csISO2022KR
ISO_2022,locale=ko,version=1 ibm-25546
ISO_2022,locale=zh,version=0 ISO-2022-CN
ISO_2022,locale=zh,version=1 ISO-2022-CN-EXT
HZ HZ-GB-2312
ibm-897_P100-1995 ibm-897 JIS_X0201 X0201 csHalfWidthKatakana
ISCII,version=0 x-iscii-de windows-57002 iscii-dev
ISCII,version=1 x-iscii-be windows-57003 iscii-bng windows-57006
x-iscii-as
ISCII,version=2 x-iscii-pa windows-57011 iscii-gur
ISCII,version=3 x-iscii-gu windows-57010 iscii-guj
ISCII,version=4 x-iscii-or windows-57007 iscii-ori
ISCII,version=5 x-iscii-ta windows-57004 iscii-tml
ISCII,version=6 x-iscii-te windows-57005 iscii-tlg
ISCII,version=7 x-iscii-ka windows-57008 iscii-knd
ISCII,version=8 x-iscii-ma windows-57009 iscii-mlm
LMBCS-1 lmbcs
LMBCS-2
LMBCS-3
LMBCS-4
LMBCS-5
LMBCS-6
LMBCS-8
LMBCS-11
LMBCS-16
LMBCS-17
LMBCS-18
LMBCS-19
ibm-37_P100-1995 ibm-37 IBM037 ibm-037 ebcdic-cp-us ebcdic-cp-ca
ebcdic-cp-wt ebcdic-cp-nl csIBM037 cp037 037 cpibm37 cp37
ibm-273_P100-1995 ibm-273 IBM273 CP273 csIBM273 ebcdic-de cpibm273 273
ibm-277_P100-1995 ibm-277 IBM277 cp277 EBCDIC-CP-DK EBCDIC-CP-NO
csIBM277 ebcdic-dk cpibm277 277
ibm-278_P100-1995 ibm-278 IBM278 cp278 ebcdic-cp-fi ebcdic-cp-se
csIBM278 ebcdic-sv cpibm278 278
ibm-280_P100-1995 ibm-280 IBM280 CP280 ebcdic-cp-it csIBM280 cpibm280
280
ibm-284_P100-1995 ibm-284 IBM284 CP284 ebcdic-cp-es csIBM284 cpibm284
284
ibm-285_P100-1995 ibm-285 IBM285 CP285 ebcdic-cp-gb csIBM285 ebcdic-gb
cpibm285 285
ibm-290_P100-1995 ibm-290 IBM290 cp290 EBCDIC-JP-kana csIBM290
ibm-297_P100-1995 ibm-297 IBM297 cp297 ebcdic-cp-fr csIBM297 cpibm297
297
ibm-420_X120-1999 ibm-420 IBM420 cp420 ebcdic-cp-ar1 csIBM420 420
ibm-424_P100-1995 ibm-424 IBM424 cp424 ebcdic-cp-he csIBM424 424
ibm-500_P100-1995 ibm-500 IBM500 CP500 ebcdic-cp-be csIBM500
ebcdic-cp-ch cpibm500 500
ibm-803_P100-1999 ibm-803 cp803
ibm-838_P100-1995 ibm-838 IBM-Thai csIBMThai cp838 838 ibm-9030
ibm-870_P100-1995 ibm-870 IBM870 CP870 ebcdic-cp-roece ebcdic-cp-yu
csIBM870
ibm-871_P100-1995 ibm-871 IBM871 ebcdic-cp-is csIBM871 CP871 ebcdic-is
cpibm871 871
ibm-875_P100-1995 ibm-875 IBM875 cp875 875
ibm-918_P100-1995 ibm-918 IBM918 CP918 ebcdic-cp-ar2 csIBM918
ibm-930_P120-1999 ibm-930 ibm-5026 cp930 cpibm930 930
ibm-933_P110-1995 ibm-933 cp933 cpibm933 933
ibm-935_P110-1999 ibm-935 cp935 cpibm935 935
ibm-937_P110-1999 ibm-937 cp937 cpibm937 937
ibm-939_P120-1999 ibm-939 ibm-931 ibm-5035 cp939 939
ibm-1025_P100-1995 ibm-1025 cp1025 1025
ibm-1026_P100-1995 ibm-1026 IBM1026 CP1026 csIBM1026 1026
ibm-1047_P100-1995 ibm-1047 IBM1047 cpibm1047
ibm-1097_P100-1995 ibm-1097 cp1097 1097
ibm-1112_P100-1995 ibm-1112 cp1112 1112
ibm-1122_P100-1999 ibm-1122 cp1122 1122
ibm-1123_P100-1995 ibm-1123 cp1123 1123 cpibm1123
ibm-1130_P100-1997 ibm-1130
ibm-1132_P100-1998 ibm-1132
ibm-1140_P100-1997 ibm-1140 IBM01140 CCSID01140 CP01140 cp1140 cpibm1140
ebcdic-us-37+euro
ibm-1141_P100-1997 ibm-1141 IBM01141 CCSID01141 CP01141 cp1141 cpibm1141
ebcdic-de-273+euro
ibm-1142_P100-1997 ibm-1142 IBM01142 CCSID01142 CP01142 cp1142 cpibm1142
ebcdic-dk-277+euro ebcdic-no-277+euro
ibm-1143_P100-1997 ibm-1143 IBM01143 CCSID01143 CP01143 cp1143 cpibm1143
ebcdic-fi-278+euro ebcdic-se-278+euro
ibm-1144_P100-1997 ibm-1144 IBM01144 CCSID01144 CP01144 cp1144 cpibm1144
ebcdic-it-280+euro
ibm-1145_P100-1997 ibm-1145 IBM01145 CCSID01145 CP01145 cp1145 cpibm1145
ebcdic-es-284+euro
ibm-1146_P100-1997 ibm-1146 IBM01146 CCSID01146 CP01146 cp1146 cpibm1146
ebcdic-gb-285+euro
ibm-1147_P100-1997 ibm-1147 IBM01147 CCSID01147 CP01147 cp1147 cpibm1147
ebcdic-fr-297+euro
ibm-1148_P100-1997 ibm-1148 IBM01148 CCSID01148 CP01148 cp1148 cpibm1148
ebcdic-international-500+euro
ibm-1149_P100-1997 ibm-1149 IBM01149 CCSID01149 CP01149 cp1149 cpibm1149
ebcdic-is-871+euro
ibm-1153_P100-1999 ibm-1153 cpibm1153
ibm-1154_P100-1999 ibm-1154 cpibm1154
ibm-1155_P100-1999 ibm-1155 cpibm1155
ibm-1156_P100-1999 ibm-1156 cpibm1156
ibm-1157_P100-1999 ibm-1157 cpibm1157
ibm-1158_P100-1999 ibm-1158 cpibm1158
ibm-1160_P100-1999 ibm-1160 cpibm1160
ibm-1164_P100-1999 ibm-1164 cpibm1164
ibm-1364_P110-1997 ibm-1364 cp1364
ibm-1371_P100-1999 ibm-1371 cpibm1371
ibm-1388_P103-2001 ibm-1388 ibm-9580
ibm-1390_P110-2003 ibm-1390 cpibm1390
ibm-1399_P110-2003 ibm-1399
ibm-16684_P110-2003 ibm-16684
ibm-4899_P100-1998 ibm-4899 cpibm4899
ibm-4971_P100-1999 ibm-4971 cpibm4971
ibm-12712_P100-1998 ibm-12712 cpibm12712 ebcdic-he
ibm-16804_X110-1999 ibm-16804 cpibm16804 ebcdic-ar
ibm-1137_P100-1999 ibm-1137
ibm-5123_P100-1999 ibm-5123
ibm-8482_P100-1999 ibm-8482
ibm-37_P100-1995,swaplfnl ibm-37-s390 ibm037-s390
ibm-1047_P100-1995,swaplfnl ibm-1047-s390
ibm-1140_P100-1997,swaplfnl ibm-1140-s390
ibm-1142_P100-1997,swaplfnl ibm-1142-s390
ibm-1143_P100-1997,swaplfnl ibm-1143-s390
ibm-1144_P100-1997,swaplfnl ibm-1144-s390
ibm-1145_P100-1997,swaplfnl ibm-1145-s390
ibm-1146_P100-1997,swaplfnl ibm-1146-s390
ibm-1147_P100-1997,swaplfnl ibm-1147-s390
ibm-1148_P100-1997,swaplfnl ibm-1148-s390
ibm-1149_P100-1997,swaplfnl ibm-1149-s390
ibm-1153_P100-1999,swaplfnl ibm-1153-s390
ibm-12712_P100-1998,swaplfnl ibm-12712-s390
ibm-16804_X110-1999,swaplfnl ibm-16804-s390
ebcdic-xml-us




Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 23.25.15 +1000 John Hansen <john@geeknet.com.au>
wrote:

> Palle Girgensohn wrote:
>> I'm aware of that. It might help for unicode, but there are a
>> bunch of
>> other encodings. IANA has decided that utf-8 has *no*
>> aliases, hence only
>> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>>
>
> Here is the list of encoding names and aliases the ICU accepts as of
> 3.2:
> (it's a bit long...)
>
> UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208

No UTF8 in there. I think that's good, charset aliases are a hassle.

/Palle



Re: Patch for collation using ICU

From
"John Hansen"
Date:

> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen@pingpong.net]
> Sent: Saturday, May 07, 2005 11:30 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
>
>
>
> --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen
> <john@geeknet.com.au>
> wrote:
>
> > Palle Girgensohn wrote:
> >> I'm aware of that. It might help for unicode, but there
> are a bunch
> >> of other encodings. IANA has decided that utf-8 has *no* aliases,
> >> hence only
> >> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> >> fogiving, I don't remember/know, but I think we need the mappings,
> >> unfortunately.
> >>
> >
> > Here is the list of encoding names and aliases the ICU accepts as of
> > 3.2:
> > (it's a bit long...)
> >
> > UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
>
> No UTF8 in there. I think that's good, charset aliases are a hassle.

Yup! :)

>
> /Palle
>
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen <john@geeknet.com.au>
wrote:

>>
>> I use this patch in production on one FreeBSD 4.10 server at
>> the moment.
>> With the latest version, I've had no problems. Logging is
>> swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more
>> reports on
>> Linux, though.
>
> I currently use this on gentoo with ICU3.2 unmasked.
>
> Works a dream, even with locale C and UNICODE database.

Ah, good to hear, John. I beleive your report about linux is what's keeping
this back. Did you also manage to get it running on Debian?

/Palle


Re: Patch for collation using ICU

From
"John Hansen"
Date:

> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen@pingpong.net]
> Sent: Saturday, May 07, 2005 11:33 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
>
>
>
> --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen
> <john@geeknet.com.au>
> wrote:
>
> >>
> >> I use this patch in production on one FreeBSD 4.10 server at the
> >> moment.
> >> With the latest version, I've had no problems. Logging is
> swithed on
> >> for now, and it shows no signs of ICU complaining. I'd like more
> >> reports on Linux, though.
> >
> > I currently use this on gentoo with ICU3.2 unmasked.
> >
> > Works a dream, even with locale C and UNICODE database.
>
> Ah, good to hear, John. I beleive your report about linux is
> what's keeping this back. Did you also manage to get it
> running on Debian?

Not without ICU3.2

>
> /Palle
>
>


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au>
wrote:

> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making pg
> _depend_ on ICU), as this would seem like a cleaner solution than for us
> to maintain our own conversion tables.
>
> ICU also has a fair few conversions that we do not have at present.
>
> Any thoughts?

I just had a similar though. And why use ICU only for multibyte charsets?
If I use LATIN1, I still expect upper('ß') => SS, and I don't get it...
Same for the Turkish example.

It does eat more memory, and can perhaps cush some performance bits? With
the current scheme, a strdup is often enough, or at least just one palloc.
With ICU, using UTF-16, you must allocate memory twice, once for the ICU
internal UTF-16 representation. That's not a very strong objection, though,
as this would be an option... :)

John, I have a hard time finding docs about what differs in ICU 2.8 from
3.2. Do you have any pointers?

It seems 3.2 has much more support and bug fixes, I'm not sure if we should
really consider 2.8?

/Palle

>
> ... John
>
>> -----Original Message-----
>> From: John Hansen
>> Sent: Saturday, May 07, 2005 11:09 PM
>> To: 'Palle Girgensohn'; 'Bruce Momjian'
>> Cc: 'pgsql-hackers@postgresql.org'
>> Subject: RE: [HACKERS] Patch for collation using ICU
>>
>> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
>> > <john@geeknet.com.au>
>> > wrote:
>> >
>> > > Errm,... initdb --encoding UNICODE --locale C
>> >
>> > You mean that ICU *shall* be used even for the C locale, and not as
>> > Bruce suggested here:
>>
>> Yes, that's exactly what I mean.
>>
>> >
>> > >> I do have a few questions:
>> > >>
>> > >> Why don't you use the lc_ctype_is_c() part of this test?
>> > >>
>> > >>      if (pg_database_encoding_max_length() > 1 &&
>> !lc_ctype_is_c())
>> > >
>> > > Um, well, I didn't think about that. :)  What would be the
>> > locale in
>> > > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
>> > CTYPE=C and use
>> > > a wide encoding, indeed. Then the strings will be handled
>> > like byte-wide chars.
>> > > Yeah, it's a bug. I'll fix it! Thanks.
>> >
>> > John disagrees here, and I'm obliged to agree. Using the C
>> locale, one
>> > will expect C collation, but upper/lower is better off still using
>> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
>> >
>> > /Palle
>> >
>> >
>> > >
>> > >> -----Original Message-----
>> > >> From: pgsql-hackers-owner@postgresql.org
>> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
>> > John Hansen
>> > >> Sent: Saturday, May 07, 2005 10:23 PM
>> > >> To: Palle Girgensohn; Bruce Momjian
>> > >> Cc: pgsql-hackers@postgresql.org
>> > >> Subject: Re: [HACKERS] Patch for collation using ICU
>> > >>
>> > >> >
>> > >> > I use this patch in production on one FreeBSD 4.10
>> server at the
>> > >> > moment.
>> > >> > With the latest version, I've had no problems. Logging is
>> > >> swithed on
>> > >> > for now, and it shows no signs of ICU complaining. I'd
>> like more
>> > >> > reports on Linux, though.
>> > >>
>> > >> I currently use this on gentoo with ICU3.2 unmasked.
>> > >>
>> > >> Works a dream, even with locale C and UNICODE database.
>> > >>
>> > >> Small test:
>> > >>
>> > >> createdb --encoding UNICODE --locale C test psql test set
>> > >> client_encoding=iso88591; CREATE TABLE test (t text);
>> INSERT INTO
>> > >> test (t) VALUES ('æøå'); set client_encoding=unicode;
>> INSERT INTO
>> > >> test (t) SELECT upper(t) FROM test; set
>> client_encoding=iso88591;
>> > >> SELECT * FROM test;
>> > >>   t
>> > >> -----
>> > >>  æøå
>> > >>  ÆØÅ
>> > >> (2 rows)
>> > >>
>> > >> Just as I'd expect, as upper/lower/initcap are locale
>> > independent for
>> > >> these characters.
>> > >>
>> > >>
>> > >> ---------------------------(end of
>> > >> broadcast)---------------------------
>> > >> TIP 5: Have you checked our extensive FAQ?
>> > >>
>> > >>                http://www.postgresql.org/docs/faq
>> > >>
>> > >>
>> >
>> >
>> >
>> >
>> >
>> >






Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 23.33.31 +1000 John Hansen <john@geeknet.com.au>
wrote:

>
>
>> -----Original Message-----
>> From: Palle Girgensohn [mailto:girgen@pingpong.net]
>> Sent: Saturday, May 07, 2005 11:33 PM
>> To: John Hansen; Bruce Momjian
>> Cc: pgsql-hackers@postgresql.org
>> Subject: RE: [HACKERS] Patch for collation using ICU
>>
>>
>>
>> --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen
>> <john@geeknet.com.au>
>> wrote:
>>
>> >>
>> >> I use this patch in production on one FreeBSD 4.10 server at the
>> >> moment.
>> >> With the latest version, I've had no problems. Logging is
>> swithed on
>> >> for now, and it shows no signs of ICU complaining. I'd like more
>> >> reports on Linux, though.
>> >
>> > I currently use this on gentoo with ICU3.2 unmasked.
>> >
>> > Works a dream, even with locale C and UNICODE database.
>>
>> Ah, good to hear, John. I beleive your report about linux is
>> what's keeping this back. Did you also manage to get it
>> running on Debian?
>
> Not without ICU3.2

Did you try the latest patch? Maybe it will help, and if not, it will
(hopefully) give a lot more informative error messages.

/Palle



Re: Patch for collation using ICU

From
Andrew Dunstan
Date:

John Hansen wrote:

>Here is the list of encoding names and aliases the ICU accepts as of
>3.2:
>(it's a bit long...)
>
>UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
>UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
>
>  
>
[snip]

Don't we use "unicode" as an alias for UTF-8 ?

cheers

andrew


Re: Patch for collation using ICU

From
"John Hansen"
Date:
> Did you try the latest patch? Maybe it will help, and if not, it will
> (hopefully) give a lot more informative error messages.

No, and I got rid of my debian boxes @ home.
The patch required a certain amount of modifications too, to even
compile with 2.8.

So I guess it's a valid question to ask: it it worth supporting 2.8?
It is of course an option to bundle icu 3.2 with pg!


... John


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
John Hansen wrote:
> > --On l?rdag, maj 07, 2005 22.53.46 +1000 John Hansen 
> > <john@geeknet.com.au>
> > wrote:
> > 
> > > Errm,... initdb --encoding UNICODE --locale C
> > 
> > You mean that ICU *shall* be used even for the C locale, and 
> > not as Bruce suggested here:
> 
> Yes, that's exactly what I mean.

There are two reasons for that optimization --- first, some locale
support is broken and Unicode encoding with a C locale crashes (not an
issue for ICU), and second, it is an optimization for languages like
Japanese that want to use unicode, but don't need a locale because
upper/lower means nothing in those character sets.

So, the first issue doesn't apply for ICU, and the second might not
depending on what characters you are using in the Unicode character set.

I guess I am little confused how ICU can do upper() when the locale is
C.  What is it using to determine A is upper for a?  Am I confused?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
"John Hansen"
Date:

> -----Original Message-----
> From: Andrew Dunstan [mailto:andrew@dunslane.net]
> Sent: Saturday, May 07, 2005 11:39 PM
> To: John Hansen
> Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
>
>
> John Hansen wrote:
>
> >Here is the list of encoding names and aliases the ICU accepts as of
> >3.2:
> >(it's a bit long...)
> >
> >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
> >
> >
> >
> [snip]
>
> Don't we use "unicode" as an alias for UTF-8 ?

Yes, unfortunately!

>
> cheers
>
> andrew
>
>


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Palle Girgensohn wrote:
> >> Also, apparently, ICU is installed by default in many linux
> >> distributions,  and usually it is version 2.8. Some linux users have
> >> asked me if there are  plans for a patch that works with ICU 2.8. That's
> >> probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
> >> the stable version, older  versions are hard to find on their sites, but
> >> most linux distributers seem  to consider it too bleeding edge, even
> >> gentoo. I don't know why they don't  agree.
> >
> > Good point.  Why would linux folks need ICU?  Doesn't their OS support
> > encodings natively?  I am particularly excited about this for OSs that
> > don't have such encodings, like UTF8 support for Win32.
> >
> > Because ICU will not be used unless enabled by configure, it seems we
> > are fine with only supporting the newest version.  Do Linux users need
> > to use ICU for any reason?
> 
> 
> There are corner cases where it is impossible to upper/lowercase one 
> character at the time. for example:
> 
> -- without ICU
>  select upper('E?er');
>  upper
> -------
>  E?ER
> (1 row)
> 
> -- with ICU
> select upper('E?er');
>  upper
> -------
>  ESSER
> (1 rad)
> 
> This is because in the standard postgres implementation, upper/lower is 
> done one character at the time. A proper upper/lower cannot do it that way. 
> Other known example is in Turkish, where an ? (?) should look different 
> whether it is an initial letter or not. This fails in standard postgresql 
> for all platforms.

Uh, where do you see that?  Our code has:
       workspace = texttowcs(string);
       for (i = 0; workspace[i] != 0; i++)           workspace[i] = towupper(workspace[i]);
       result = wcstotext(workspace, i);


> >> Also, in the latest patch, I also added checks and logging for *every*
> >> status returned from ICU. I hope this will help debugging on debian,
> >> where  previous version didn't work. That excessive status checking is
> >> hardly be  necessary once the stuff is better tested.
> >>
> >> I think the string copying and heap/palloc choices stands for most of
> >> the  code bloat, together with the excessive status checking and logging.
> >
> > OK, move that into some common functions and I think it will be better.
> 
> Best way for upper/lower/initcap is probably to use a function pointer... 
> uhh...

Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.

> >> > Why do you need to add a mapping of encoding names from iana to our
> >> > names?
> >>
> >> This was already answered by John Hansen... There's an old thread here
> >> about the choice of the name "UNICODE" to describe an encoding, which it
> >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
> >> by  postgresql, that would have been a better name... Similarly for most
> >> other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
> >> PostgreSQL  can't provide it, so a mapping table is required.
> >
> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
> > help?
> 
> I'm aware of that. It might help for unicode, but there are a bunch of 
> other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> fogiving, I don't remember/know, but I think we need the mappings, 
> unfortunately.

OK.  I guess I am just confused why the native implementations are OK.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Palle Girgensohn wrote:
> 
> --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au> 
> wrote:
> 
> > Btw, I had been planning to propose replacing every single one of the
> > built in charset conversion functions with calls to ICU (thus making pg
> > _depend_ on ICU), as this would seem like a cleaner solution than for us
> > to maintain our own conversion tables.
> >
> > ICU also has a fair few conversions that we do not have at present.

That is a much larger issue, similar to our shipping our own timezone
database.  What does it buy us?o  Do we ship it in our tarball?o  Is the license compatible?o  Does it remove utils/mb
conversions?o Does it allow us to index LIKE (next high char)?o  Does it allow us to support multiple encodings in   a
singledatabase easier?o  performance?
 

> I just had a similar though. And why use ICU only for multibyte charsets? 
> If I use LATIN1, I still expect upper('?') => SS, and I don't get it... 
> Same for the Turkish example.

We assume the native toupper() can handle single-byte character
encodings.  We use towupper() only for wide character sets.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> John Hansen wrote:
> 
> >Here is the list of encoding names and aliases the ICU accepts as of
> >3.2:
> >(it's a bit long...)
> >
> >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
> >
> >  
> >
> [snip]
> 
> Don't we use "unicode" as an alias for UTF-8 ?

We do.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

> Palle Girgensohn wrote:
>> >> Also, apparently, ICU is installed by default in many linux
>> >> distributions,  and usually it is version 2.8. Some linux users have
>> >> asked me if there are  plans for a patch that works with ICU 2.8.
>> >> That's probably a good idea. IBM  and the ICU folks seem to consider
>> >> 3.2 to be the stable version, older  versions are hard to find on
>> >> their sites, but most linux distributers seem  to consider it too
>> >> bleeding edge, even gentoo. I don't know why they don't  agree.
>> >
>> > Good point.  Why would linux folks need ICU?  Doesn't their OS support
>> > encodings natively?  I am particularly excited about this for OSs that
>> > don't have such encodings, like UTF8 support for Win32.
>> >
>> > Because ICU will not be used unless enabled by configure, it seems we
>> > are fine with only supporting the newest version.  Do Linux users need
>> > to use ICU for any reason?
>>
>>
>> There are corner cases where it is impossible to upper/lowercase one
>> character at the time. for example:
>>
>> -- without ICU
>>  select upper('E?er');
>>  upper
>> -------
>>  E?ER
>> (1 row)
>>
>> -- with ICU
>> select upper('E?er');
>>  upper
>> -------
>>  ESSER
>> (1 rad)
>>
>> This is because in the standard postgres implementation, upper/lower is
>> done one character at the time. A proper upper/lower cannot do it that
>> way.  Other known example is in Turkish, where an ? (?) should look
>> different  whether it is an initial letter or not. This fails in
>> standard postgresql  for all platforms.
>
> Uh, where do you see that?  Our code has:
>
>         workspace = texttowcs(string);
>
>         for (i = 0; workspace[i] != 0; i++)
>             workspace[i] = towupper(workspace[i]);

as you see, the loop runs towupper for one character at the time. I cannot
consider whether the letter is the initial, as required in Turkish, and it
cannot really convert one character into two ('ß' -> 'SS')

>
>         result = wcstotext(workspace, i);
>
>
>> >> Also, in the latest patch, I also added checks and logging for *every*
>> >> status returned from ICU. I hope this will help debugging on debian,
>> >> where  previous version didn't work. That excessive status checking is
>> >> hardly be  necessary once the stuff is better tested.
>> >>
>> >> I think the string copying and heap/palloc choices stands for most of
>> >> the  code bloat, together with the excessive status checking and
>> >> logging.
>> >
>> > OK, move that into some common functions and I think it will be better.
>>
>> Best way for upper/lower/initcap is probably to use a function
>> pointer...  uhh...
>
> Uh, I don't think so.  Just send pointers to the the function and let
> the function allocate the memory, and another function to free them, or
> something like that.  I can probably do it if you want.

I'll check it out, it seems simple enough.

>> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
>> > that help?
>>
>> I'm aware of that. It might help for unicode, but there are a bunch of
>> other encodings. IANA has decided that utf-8 has *no* aliases, hence
>> only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>
> OK.  I guess I am just confused why the native implementations are OK.

They're OK since they understand that UNICODE (or UTF8) is really utf-8.
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation
*to* IANA, not the other way around. Maybe that fact lowers the rate of
confusion? ;-)

/Palle



Re: Patch for collation using ICU

From
"John Hansen"
Date:
Bruce Momjian wrote:
>
> There are two reasons for that optimization --- first, some
> locale support is broken and Unicode encoding with a C locale
> crashes (not an issue for ICU), and second, it is an
> optimization for languages like Japanese that want to use
> unicode, but don't need a locale because upper/lower means
> nothing in those character sets.

No, upper/lower means nothing in those languages, so why would you need
to optimize upper/lower if they're not used??
And if they are, it's obviously because the text contains characters
from other languages (probably english) and as such they should behave
correctly.

Did I mention that for japanese and the like, ICU would also offer
transliteration...

>
> So, the first issue doesn't apply for ICU, and the second
> might not depending on what characters you are using in the
> Unicode character set.
>
> I guess I am little confused how ICU can do upper() when the
> locale is C.  What is it using to determine A is upper for a?
>  Am I confused?

Simple, UNICODE basically consist of a table of characters
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

Excerpt:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
...
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041

From this you can see, that for 0041, which is capital letter A, there
is a mapping to it's lowercase counterpart, 0061
Likewise, there is a mapping for 0061 which says it's uppercase
counterpart is 0041.
There is also SpecialCasing.txt which covers those mappings that haven't
got a 1-1 mapping, such as the german SS.

These mappings are fixed, independent of locale, only a few cases from
specialcasing.txt depend on locale/context.




Re: Patch for collation using ICU

From
"John Hansen"
Date:
> It seems 3.2 has much more support and bug fixes, I'm not
> sure if we should really consider 2.8?

As I said, probably not worth the effort.


Re: Patch for collation using ICU

From
Bruce Momjian
Date:
Palle Girgensohn wrote:
> >> This is because in the standard postgres implementation, upper/lower is
> >> done one character at the time. A proper upper/lower cannot do it that
> >> way.  Other known example is in Turkish, where an ? (?) should look
> >> different  whether it is an initial letter or not. This fails in
> >> standard postgresql  for all platforms.
> >
> > Uh, where do you see that?  Our code has:
> >
> >         workspace = texttowcs(string);
> >
> >         for (i = 0; workspace[i] != 0; i++)
> >             workspace[i] = towupper(workspace[i]);
> 
> as you see, the loop runs towupper for one character at the time. I cannot 
> consider whether the letter is the initial, as required in Turkish, and it 
> cannot really convert one character into two ('?' -> 'SS')

Oh, OK. I thought texttowcs() would expand the string to allow such
conversions.

> >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
> >> > that help?
> >>
> >> I'm aware of that. It might help for unicode, but there are a bunch of
> >> other encodings. IANA has decided that utf-8 has *no* aliases, hence
> >> only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> >> fogiving, I don't remember/know, but I think we need the mappings,
> >> unfortunately.
> >
> > OK.  I guess I am just confused why the native implementations are OK.
> 
> They're OK since they understand that UNICODE (or UTF8) is really utf-8. 
> Problem is the strings used to describe them are not understood by ICU.
> 
> BTW, the pg_enc2iananame_tbl is only used *from* internal representation 
> *to* IANA, not the other way around. Maybe that fact lowers the rate of 
> confusion? ;-)

OK, got it.  I am still a little confused why every native
implementation understands our existing names but ICU does not.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> >
> > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen
> > <john@geeknet.com.au>
> > wrote:
> >
> > > Btw, I had been planning to propose replacing every single one of
> > > the built in charset conversion functions with calls to ICU (thus
> > > making pg _depend_ on ICU), as this would seem like a cleaner
> > > solution than for us to maintain our own conversion tables.
> > >
> > > ICU also has a fair few conversions that we do not have
> at present.
>
> That is a much larger issue, similar to our shipping our own
> timezone database.  What does it buy us?
>
>     o  Do we ship it in our tarball?
>     o  Is the license compatible?
>     o  Does it remove utils/mb conversions?
>     o  Does it allow us to index LIKE (next high char)?
>     o  Does it allow us to support multiple encodings in
>        a single database easier?
>     o  performance?
>
> > I just had a similar though. And why use ICU only for
> multibyte charsets?
> > If I use LATIN1, I still expect upper('?') => SS, and I
> don't get it...
> > Same for the Turkish example.
>
> We assume the native toupper() can handle single-byte
> character encodings.  We use towupper() only for wide character sets.

That assumption is wrong,...

Encoding latin1
Locale <> de*

Select Upper('ß'); (lowercase german SS)
Should return SS, but returns ß

... John




Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

> Palle Girgensohn wrote:
>>
>> --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen
>> <john@geeknet.com.au>  wrote:
>>
>> > Btw, I had been planning to propose replacing every single one of the
>> > built in charset conversion functions with calls to ICU (thus making pg
>> > _depend_ on ICU), as this would seem like a cleaner solution than for
>> > us to maintain our own conversion tables.
>> >
>> > ICU also has a fair few conversions that we do not have at present.
>
> That is a much larger issue, similar to our shipping our own timezone
> database.  What does it buy us?
>
>     o  Do we ship it in our tarball?
>     o  Is the license compatible?

It looks pretty similar to BSD, although I'm a novice on the subject.

>     o  Does it remove utils/mb conversions?

Yes, it would probably be possible to remove pg's own conversions.

>     o  Does it allow us to index LIKE (next high char)?

I beleive so, using ICU's substring stuff.

>     o  Does it allow us to support multiple encodings in
>        a single database easier?

Heh, the ultimate dream. Perhaps?

>     o  performance?

ICU in itself is said to be much faster than for example glibc. Problem is
the need for conversion via UTF-16, which requires extra memory allocations
and cpu cycles. I don't use glibc, but my very simple performance tests for
FreeBSD show that it is similiar in speed.

>
>> I just had a similar though. And why use ICU only for multibyte
>> charsets?  If I use LATIN1, I still expect upper('?') => SS, and I don't
>> get it...  Same for the Turkish example.
>
> We assume the native toupper() can handle single-byte character
> encodings.  We use towupper() only for wide character sets.

True, problem is that native toupper/towupper run one char at the time.
This is a bad design decision in POSIX, there is no way it can handle the
examples above unless considering more than one character. ICU does just
that.

/Palle



Re: Patch for collation using ICU

From
Bruce Momjian
Date:
John Hansen wrote:
> Bruce Momjian wrote:
> > 
> > There are two reasons for that optimization --- first, some 
> > locale support is broken and Unicode encoding with a C locale 
> > crashes (not an issue for ICU), and second, it is an 
> > optimization for languages like Japanese that want to use 
> > unicode, but don't need a locale because upper/lower means 
> > nothing in those character sets.
> 
> No, upper/lower means nothing in those languages, so why would you need
> to optimize upper/lower if they're not used??

True.  I suppose it is for databases that use both Japanese and Latin
alphabets and run upper() on all values.

> And if they are, it's obviously because the text contains characters
> from other languages (probably english) and as such they should behave
> correctly.
> 
> Did I mention that for japanese and the like, ICU would also offer
> transliteration...

Interesting.

> > So, the first issue doesn't apply for ICU, and the second 
> > might not depending on what characters you are using in the 
> > Unicode character set.
> > 
> > I guess I am little confused how ICU can do upper() when the 
> > locale is C.  What is it using to determine A is upper for a? 
> >  Am I confused?
> 
> Simple, UNICODE basically consist of a table of characters
> (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)
> 
> Excerpt:
> 
> 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
> ...
> 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
> 
> >From this you can see, that for 0041, which is capital letter A, there
> is a mapping to it's lowercase counterpart, 0061
> Likewise, there is a mapping for 0061 which says it's uppercase
> counterpart is 0041.
> There is also SpecialCasing.txt which covers those mappings that haven't
> got a 1-1 mapping, such as the german SS.
> 
> These mappings are fixed, independent of locale, only a few cases from
> specialcasing.txt depend on locale/context.

As far as I know, the only way to use Unicode currently is to use a
locale that is unicode-aware.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Patch for collation using ICU

From
Tom Lane
Date:
"John Hansen" <john@geeknet.com.au> writes:
> Where'd you get the licence from?

It was the first thing I came across in their docs:

http://icu.sourceforge.net/userguide/intro.html

Looking more closely, it may be that this license is only intended to
apply to the documentation and not the code ... though "free" code with
un-free documentation isn't real useful.
        regards, tom lane


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:
--On lördag, maj 07, 2005 10.58.09 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "John Hansen" <john@geeknet.com.au> writes:
>> Where'd you get the licence from?
>
> It was the first thing I came across in their docs:
>
> http://icu.sourceforge.net/userguide/intro.html
>
> Looking more closely, it may be that this license is only intended to
> apply to the documentation and not the code ... though "free" code with
> un-free documentation isn't real useful.
>
>             regards, tom lane

Someone should ask the ICU folks about that...


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tom Lane wrote:
> "John Hansen" <john@geeknet.com.au> writes:
> > Where'd you get the licence from?
>
> It was the first thing I came across in their docs:
>
> http://icu.sourceforge.net/userguide/intro.html
>
> Looking more closely, it may be that this license is only
> intended to apply to the documentation and not the code ...
> though "free" code with un-free documentation isn't real useful.
>

Right, it seems to apply only to the resources found on sourceforge.

>             regards, tom lane
>
>


Re: Patch for collation using ICU

From
Tom Lane
Date:
"John Hansen" <john@geeknet.com.au> writes:
> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making
> pg _depend_ on ICU),

I find that fairly unacceptable ... especially given the licensing
questions, but in any case.

It might be OK to rip out the existing conversion support and say
that *if* you want encoding conversion, you have to use ICU.  But
I don't want to be told you cannot build PG without ICU period.

The 3.2 vs 2.8 business is disturbing also; specifically, I don't
think we get to require 3.2 on a platform where 2.8 is installed.
People just aren't going to hold still for that, even assuming
that ICU supports installing both versions at once, which isn't
clear to me at the moment ...
        regards, tom lane


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> Palle Girgensohn wrote:
> > 
> > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au> 
> > wrote:
> > 
> > > Btw, I had been planning to propose replacing every single one of the
> > > built in charset conversion functions with calls to ICU (thus making pg
> > > _depend_ on ICU), as this would seem like a cleaner solution than for us
> > > to maintain our own conversion tables.

I don't buy it. If current conversion tables does the right thing, why
we need to replace. Or if conversion tables are not correct, why don't
you fix it? I think the rule of character conversion will not change
frequently, especially for LATIN languages. Thus maintaining cost is
not too high.
--
Tatsuo Ishii

> > > ICU also has a fair few conversions that we do not have at present.
> 
> That is a much larger issue, similar to our shipping our own timezone
> database.  What does it buy us?
>     
>     o  Do we ship it in our tarball?
>     o  Is the license compatible?
>     o  Does it remove utils/mb conversions?
>     o  Does it allow us to index LIKE (next high char)?
>     o  Does it allow us to support multiple encodings in
>        a single database easier?
>     o  performance?


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> Bruce Momjian wrote:
> > 
> > There are two reasons for that optimization --- first, some 
> > locale support is broken and Unicode encoding with a C locale 
> > crashes (not an issue for ICU), and second, it is an 
> > optimization for languages like Japanese that want to use 
> > unicode, but don't need a locale because upper/lower means 
> > nothing in those character sets.
> 
> No, upper/lower means nothing in those languages, so why would you need
> to optimize upper/lower if they're not used??
> And if they are, it's obviously because the text contains characters
> from other languages (probably english) and as such they should behave
> correctly.

Yes, Japanese (and probably Chinese and Korean) languages include
ASCII character. More precisely ASCII is part of Japanese
encodings(LATIN1 is not, however). And we have no problem at all with
glibc/C locale. See below("unitest" is an UNICODE database).

unitest=# create table t1(t text);
CREATE TABLE
unitest=# \encoding EUC_JP
unitest=# insert into t1 values('abcあいう');
INSERT 1842628 1
unitest=# select upper(t) from t1;  upper   
-----------ABCあいう
(1 row)

So Japanese(including ASCII)/UNICODE behavior is perfectly correct at
this moment. So I strongly object removing that optimization.
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tom Lane wrote:
> "John Hansen" <john@geeknet.com.au> writes:
> > Btw, I had been planning to propose replacing every single
> one of the
> > built in charset conversion functions with calls to ICU
> (thus making
> > pg _depend_ on ICU),
>
> I find that fairly unacceptable ... especially given the
> licensing questions, but in any case.

The licencing seems pretty clear to me.
http://www-306.ibm.com/software/globalization/icu/license.jsp

>
> It might be OK to rip out the existing conversion support and
> say that *if* you want encoding conversion, you have to use
> ICU.  But I don't want to be told you cannot build PG without
> ICU period.

Right, that could be done, but I think the issue at heart is _are_
we going to use it at all, and if so, locale support would certainly
benefit from going that way as well.

>
> The 3.2 vs 2.8 business is disturbing also; specifically, I
> don't think we get to require 3.2 on a platform where 2.8 is
> installed.

There seems to be nothing in the ICU licence that would prevent us from
bundling it.
This would solve both the 3.2 vs 2.8 problems, and would remove the
'dependency'.

> People just aren't going to hold still for that, even
> assuming that ICU supports installing both versions at once,
> which isn't clear to me at the moment ...

There's no problems with having both installed.
I did that on debian to get the patch going.
Tho, bundling it seems cleaner to me.

>
>             regards, tom lane
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
> I don't buy it. If current conversion tables does the right
> thing, why we need to replace. Or if conversion tables are
> not correct, why don't you fix it? I think the rule of
> character conversion will not change frequently, especially
> for LATIN languages. Thus maintaining cost is not too high.

I never said we need to, but if we're going to implement ICU,
then we might as well go all the way.

... John



Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 10:09 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > Bruce Momjian wrote:
> > > 
> > > There are two reasons for that optimization --- first, 
> some locale 
> > > support is broken and Unicode encoding with a C locale 
> crashes (not 
> > > an issue for ICU), and second, it is an optimization for 
> languages 
> > > like Japanese that want to use unicode, but don't need a locale 
> > > because upper/lower means nothing in those character sets.
> > 
> > No, upper/lower means nothing in those languages, so why would you 
> > need to optimize upper/lower if they're not used??
> > And if they are, it's obviously because the text contains 
> characters 
> > from other languages (probably english) and as such they 
> should behave 
> > correctly.
> 
> Yes, Japanese (and probably Chinese and Korean) languages 
> include ASCII character. More precisely ASCII is part of Japanese
> encodings(LATIN1 is not, however). And we have no problem at 
> all with glibc/C locale. See below("unitest" is an UNICODE database).
> 
> unitest=# create table t1(t text);
> CREATE TABLE
> unitest=# \encoding EUC_JP
> unitest=# insert into t1 values('abcあいう');
> INSERT 1842628 1
> unitest=# select upper(t) from t1;
>    upper   
> -----------
>  ABCあいう
> (1 row)
> 
> So Japanese(including ASCII)/UNICODE behavior is perfectly 
> correct at this moment. 

Right, so you _never_ use accented ascii characters in Japanese? 
(like è for example, whose uppercase is È)

> So I strongly object removing that optimization.

I'm guessing this would call for a vote then, since if implementing ICU, then
I'd have to object to leaving it in.

Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, 
so that it works as it should when using unicode, seems the right solution to me.

> --
> Tatsuo Ishii
> 
> 

Re: Patch for collation using ICU

From
Alvaro Herrera
Date:
On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> Tatsuo Ishii wrote:

> > So Japanese(including ASCII)/UNICODE behavior is perfectly 
> > correct at this moment. 
> 
> Right, so you _never_ use accented ascii characters in Japanese? 
> (like è for example, whose uppercase is È)

That isn't ASCII.  It's latin1 or some other ASCII extension.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"La principal característica humana es la tontería"
(Augusto Monterroso)


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Alvaro Herrera wrote:
> Sent: Sunday, May 08, 2005 2:49 PM
> To: John Hansen
> Cc: Tatsuo Ishii; pgman@candle.pha.pa.us;
> girgen@pingpong.net; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > Tatsuo Ishii wrote:
>
> > > So Japanese(including ASCII)/UNICODE behavior is
> perfectly correct
> > > at this moment.
> >
> > Right, so you _never_ use accented ascii characters in Japanese?
> > (like è for example, whose uppercase is È)
>
> That isn't ASCII.  It's latin1 or some other ASCII extension.

Point taken...
But...

If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode).
UTF-8 encoded databases are very useful for representing multiple languages in the same database,
but this usefulness vanishes if functions like upper/lower doesn't work correctly.

So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair!

>
> --
> Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La principal
> característica humana es la tontería"
> (Augusto Monterroso)
>
>

... John


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> Alvaro Herrera wrote:
> > Sent: Sunday, May 08, 2005 2:49 PM
> > To: John Hansen
> > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us;
> > girgen@pingpong.net; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> >
> > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > Tatsuo Ishii wrote:
> >
> > > > So Japanese(including ASCII)/UNICODE behavior is
> > perfectly correct
> > > > at this moment.
> > >
> > > Right, so you _never_ use accented ascii characters in Japanese?
> > > (like è for example, whose uppercase is È)
> >
> > That isn't ASCII.  It's latin1 or some other ASCII extension.
>
> Point taken...
> But...
>
> If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode).
> UTF-8 encoded databases are very useful for representing multiple languages in the same database,
> but this usefulness vanishes if functions like upper/lower doesn't work correctly.

I'm just curious if Germany/French/Spanish mixed text can be sorted
correctly. I think these languages need their own locales even with
UNICODE/ICU.

> So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair!

Why don't you add a GUC variable or some such to control the
upper/lower behavior?
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii
> Sent: Sunday, May 08, 2005 3:41 PM
> To: John Hansen
> Cc: alvherre@dcc.uchile.cl; pgman@candle.pha.pa.us;
> girgen@pingpong.net; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > Alvaro Herrera wrote:
> > > Sent: Sunday, May 08, 2005 2:49 PM
> > > To: John Hansen
> > > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > Tatsuo Ishii wrote:
> > >
> > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > perfectly correct
> > > > > at this moment.
> > > >
> > > > Right, so you _never_ use accented ascii characters in
> Japanese?
> > > > (like è for example, whose uppercase is È)
> > >
> > > That isn't ASCII.  It's latin1 or some other ASCII extension.
> >
> > Point taken...
> > But...
> >
> > If you want EUC_JP (Japanese + ASCII) then use that as your
> backend encoding, not UTF-8 (unicode).
> > UTF-8 encoded databases are very useful for representing multiple
> > languages in the same database, but this usefulness
> vanishes if functions like upper/lower doesn't work correctly.
>
> I'm just curious if Germany/French/Spanish mixed text can be
> sorted correctly. I think these languages need their own
> locales even with UNICODE/ICU.

No, they will not sort correctly, for that you still need the locale.

>
> > So optimizing for 3 languages breaks more than a hundred,
> that's doesn't seem fair!

That is a compromise I'd be willing to agree on. :)
> Why don't you add a GUC variable or some such to control the
> upper/lower behavior?
> --
> Tatsuo Ishii
>
>


Re: Patch for collation using ICU

From
"Magnus Hagander"
Date:
>> Is this patch ready for application?
>>
>>
>http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-200
>5-05-06.d
>> iff.gz
>>
>> The web site is:
>>
>>     http://people.freebsd.org/~girgen/postgresql-icu/readme.html
>
>I don't think so, not quite. I have not had any positive
>reports from linux
>users, this is only tested in a FreeBSD environment. I'd say
>it needs some more testing.

I've just finished some simple testing on win32, and it does seem to work fine there as well, with a few modifications
tothe build step. As I don't have a  working autoconf, I applied the stuff that would come from your configure.in
changesdirectly to the files. Meaning putting USE_ICU in pg_config.h, and the following changes to Makefile.global: 
1) Add the directory for the ICU include files to CPPFLAGS
2) Add -licuuc -licuin to LIBS. I notice these are different names from those used on Unix, so a different configure
testwill be needed there. 
3) Add the icu lib directory to LDFLAGS
4) Remove encodings "PG_WIN_1258"/PG_TCVN and "IBM866"/PG_ALT. Didn't investigate further why this was needed, but this
probablyhas something to do with my tests being off CVS tip vs the patch being for 8.0-stable. 


After doing this, I can properly get upper and lowercase for the swedish characters åäö/ÅÄÖ, as well as the ß->SS
conversion.That's all I really tested at this point, but it did *not* work before in a unicode/UTF8 database. 

I've been working off the ICU 3.2 binaries available from the ICU page, the version compiled with MSVC 6 (because it
usesthe same runtime DLLs as stuff compiled with mingw. It should work with the MSVC7 version as well, but that would
introduceadditional DLL dependencies). 

//Magnus


Re: Patch for collation using ICU

From
"Magnus Hagander"
Date:
>> The 3.2 vs 2.8 business is disturbing also; specifically, I
>> don't think we get to require 3.2 on a platform where 2.8 is
>> installed.
>
>There seems to be nothing in the ICU licence that would prevent us from
>bundling it.
>This would solve both the 3.2 vs 2.8 problems, and would remove the
>'dependency'.
>
>> People just aren't going to hold still for that, even
>> assuming that ICU supports installing both versions at once,
>> which isn't clear to me at the moment ...
>
>There's no problems with having both installed.

... unless you're on win32, it seems.  For some reason, they name their
libs with the version on unix (libicu18n.so.32), but not on win32 where
they all have the same name. And they don't stuff versioning information
in the DLL files.
That can be lived with as long as libpq doesn't depend on it, though -
you can just stick the DLL in the same directory as the EXE, which is
also what the ICU people recommend in their docs. Unnecessarily ugly,
but it works.


>I did that on debian to get the patch going.
>Tho, bundling it seems cleaner to me.

The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2.
That means the size of the distribution would almost *double* if we
bundled ICU.

It's probably fine bundling it in the binary distributions (at least
we'd probably do it on win32, since not many ppl will have it already
there), but bundling the source seems a bit excessive to me.

//Magnus


Re: Patch for collation using ICU

From
"John Hansen"
Date:
> The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2.
> That means the size of the distribution would almost *double*
> if we bundled ICU.

Ermm,. Don't forget to remove the current charset conversions and locale
support before making your size estimation.

>
> It's probably fine bundling it in the binary distributions
> (at least we'd probably do it on win32, since not many ppl
> will have it already there), but bundling the source seems a
> bit excessive to me.
>
> //Magnus
>
>


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> > I don't buy it. If current conversion tables does the right 
> > thing, why we need to replace. Or if conversion tables are 
> > not correct, why don't you fix it? I think the rule of 
> > character conversion will not change frequently, especially 
> > for LATIN languages. Thus maintaining cost is not too high.
> 
> I never said we need to, but if we're going to implement ICU,
> then we might as well go all the way.

So you admit there's no benefit using ICU for replacing existing
conversions?

Besides ICU does not support all existing conversions, I think ICU has
serious flaw for using conversion. If I understand correctly, ICU uses
UNICODE internally to do the conversion. For example, to implement
SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
converts UNICODE to EUC_JP. Problem is these conversion is not roud
trip(conversion between SJIS/EUC_JP and UNICODE will lose some
information). Thus SJIS->EUC_JP->SJIS conversion using ICU does not
preserve original text.
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > > Tatsuo Ishii wrote:
> > > >
> > > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > > perfectly correct
> > > > > > at this moment.
> > > > >
> > > > > Right, so you _never_ use accented ascii characters in
> > Japanese?
> > > > > (like è for example, whose uppercase is È)
> > > >
> > > > That isn't ASCII.  It's latin1 or some other ASCII extension.
> > >
> > > Point taken...
> > > But...
> > >
> > > If you want EUC_JP (Japanese + ASCII) then use that as your
> > backend encoding, not UTF-8 (unicode).
> > > UTF-8 encoded databases are very useful for representing multiple
> > > languages in the same database, but this usefulness
> > vanishes if functions like upper/lower doesn't work correctly.
> >
> > I'm just curious if Germany/French/Spanish mixed text can be
> > sorted correctly. I think these languages need their own
> > locales even with UNICODE/ICU.
>
> No, they will not sort correctly, for that you still need the locale.

I'm confused. I thought the ICU patches is intended for using on
broken locale platforms?
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
Palle Girgensohn
Date:

--On söndag, maj 08, 2005 22.19.25 +0900 Tatsuo Ishii <t-ishii@sra.co.jp>
wrote:

>> > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
>> > > > > Tatsuo Ishii wrote:
>> > > >
>> > > > > > So Japanese(including ASCII)/UNICODE behavior is
>> > > > perfectly correct
>> > > > > > at this moment.
>> > > > >
>> > > > > Right, so you _never_ use accented ascii characters in
>> > Japanese?
>> > > > > (like è for example, whose uppercase is È)
>> > > >
>> > > > That isn't ASCII.  It's latin1 or some other ASCII extension.
>> > >
>> > > Point taken...
>> > > But...
>> > >
>> > > If you want EUC_JP (Japanese + ASCII) then use that as your
>> > backend encoding, not UTF-8 (unicode).
>> > > UTF-8 encoded databases are very useful for representing multiple
>> > > languages in the same database, but this usefulness
>> > vanishes if functions like upper/lower doesn't work correctly.
>> >
>> > I'm just curious if Germany/French/Spanish mixed text can be
>> > sorted correctly. I think these languages need their own
>> > locales even with UNICODE/ICU.
>>
>> No, they will not sort correctly, for that you still need the locale.
>
> I'm confused. I thought the ICU patches is intended for using on
> broken locale platforms?

It will sort correctly in *one* locale, using ICU. You still cannot mix
different locales in the same database cluster, the collation locale is
still fixed at initdb time, unfortunately.

/Palle




Re: Patch for collation using ICU

From
Tom Lane
Date:
Palle Girgensohn <girgen@pingpong.net> writes:
>> I'm confused. I thought the ICU patches is intended for using on
>> broken locale platforms?

> It will sort correctly in *one* locale, using ICU. You still cannot mix 
> different locales in the same database cluster, the collation locale is 
> still fixed at initdb time, unfortunately.

I thought the point of using ICU was to be able to dig out from under
that restriction?  It's a bit of a large pill to swallow if we will
still have to throw it away someday to become SQL spec compliant.
        regards, tom lane


Re: Patch for collation using ICU

From
Andrew Dunstan
Date:

Magnus Hagander wrote:

>The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2.
>That means the size of the distribution would almost *double* if we
>bundled ICU.
>
>It's probably fine bundling it in the binary distributions (at least
>we'd probably do it on win32, since not many ppl will have it already
>there), but bundling the source seems a bit excessive to me.
>
>
>  
>

I'm also mildly curious to know what effect using ICU will have on 
memory consumption. Has anyone looked? My suspicion was aroused by this 
library that it installed on my FC3 box:

-rwxr-xr-x  1 root root 9777876 Jan  1 20:18 /usr/lib/libicudata.so.32.0



cheers

andrew


Re: Patch for collation using ICU

From
"Palle Girgensohn"
Date:
> Palle Girgensohn <girgen@pingpong.net> writes:
>>> I'm confused. I thought the ICU patches is intended for using on
>>> broken locale platforms?
>
>> It will sort correctly in *one* locale, using ICU. You still cannot mix
>> different locales in the same database cluster, the collation locale is
>> still fixed at initdb time, unfortunately.
>
> I thought the point of using ICU was to be able to dig out from under
> that restriction?

I think it might be quite possible to mix several locales, using ICU. It's
just that this is not what the patch does at moment. It just finds out the
locale set at initdb and uses it for collation with ICU.

Handling mixed locales for collation has a few hard problems, AFAIK.
First, isn't the main obstacle for mixing collations that indices require
a single well defined locale? I assume that locale dependant comparison
(collation) is used when indexing tuples, right? As long as a specific
locales collation is used for indexing text fields, I believe we cannot
easily mix different locales, right? Second, how do we tell the backend
which locale to use? Is there some SQL spec for this?

> It's a bit of a large pill to swallow if we will still
> have to throw it away someday to become SQL spec compliant.

What do we need to be SQL spec compliant in this respect?

/Palle




Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 11:19 PM
> To: John Hansen
> Cc: alvherre@dcc.uchile.cl; pgman@candle.pha.pa.us;
> girgen@pingpong.net; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > > > > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > > > > Tatsuo Ishii wrote:
> > > > >
> > > > > > > So Japanese(including ASCII)/UNICODE behavior is
> > > > > perfectly correct
> > > > > > > at this moment.
> > > > > >
> > > > > > Right, so you _never_ use accented ascii characters in
> > > Japanese?
> > > > > > (like è for example, whose uppercase is È)
> > > > >
> > > > > That isn't ASCII.  It's latin1 or some other ASCII extension.
> > > >
> > > > Point taken...
> > > > But...
> > > >
> > > > If you want EUC_JP (Japanese + ASCII) then use that as your
> > > backend encoding, not UTF-8 (unicode).
> > > > UTF-8 encoded databases are very useful for
> representing multiple
> > > > languages in the same database, but this usefulness
> > > vanishes if functions like upper/lower doesn't work correctly.
> > >
> > > I'm just curious if Germany/French/Spanish mixed text can
> be sorted
> > > correctly. I think these languages need their own locales
> even with
> > > UNICODE/ICU.
> >
> > No, they will not sort correctly, for that you still need
> the locale.
>
> I'm confused. I thought the ICU patches is intended for using
> on broken locale platforms?

Initially yes, but why duplicate code?
What I meant was, that they will not sort correctly using the C locale.
Locale _name_ needs to be known to ICU for it to sort correctly.

> --
> Tatsuo Ishii
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tom Lane wrote:
> Sent: Monday, May 09, 2005 2:47 AM
> To: Palle Girgensohn
> Cc: Tatsuo Ishii; John Hansen; alvherre@dcc.uchile.cl;
> pgman@candle.pha.pa.us; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> Palle Girgensohn <girgen@pingpong.net> writes:
> >> I'm confused. I thought the ICU patches is intended for using on
> >> broken locale platforms?
>
> > It will sort correctly in *one* locale, using ICU. You still cannot
> > mix different locales in the same database cluster, the collation
> > locale is still fixed at initdb time, unfortunately.
>
> I thought the point of using ICU was to be able to dig out
> from under that restriction?  It's a bit of a large pill to
> swallow if we will still have to throw it away someday to
> become SQL spec compliant.

That is not a limitation of ICU but of postgresql.
I don't know what the specs say, but imagine something like:
SELECT foo FROM bar ORDER BY foo WITH LOCALE 'en_US', foobar WITH LOCALE
'jp_JP';

Which would be less difficult to implement using ICU.

>
>             regards, tom lane
>
>

... John


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 11:08 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > > I don't buy it. If current conversion tables does the
> right thing,
> > > why we need to replace. Or if conversion tables are not
> correct, why
> > > don't you fix it? I think the rule of character
> conversion will not
> > > change frequently, especially for LATIN languages. Thus
> maintaining
> > > cost is not too high.
> >
> > I never said we need to, but if we're going to implement
> ICU, then we
> > might as well go all the way.
>
> So you admit there's no benefit using ICU for replacing
> existing conversions?
>
> Besides ICU does not support all existing conversions, I
> think ICU has serious flaw for using conversion. If I
> understand correctly, ICU uses UNICODE internally to do the
> conversion. For example, to implement
> SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> converts UNICODE to EUC_JP. Problem is these conversion is
> not roud trip(conversion between SJIS/EUC_JP and UNICODE will
> lose some information). Thus SJIS->EUC_JP->SJIS conversion
> using ICU does not preserve original text.

Could you please send me a sample text as an attachment encoded in SJIS
where this would happen?

> --
> Tatsuo Ishii
>
>


Re: Patch for collation using ICU

From
"John Hansen"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: Sunday, May 08, 2005 11:08 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > > I don't buy it. If current conversion tables does the
> right thing,
> > > why we need to replace. Or if conversion tables are not
> correct, why
> > > don't you fix it? I think the rule of character
> conversion will not
> > > change frequently, especially for LATIN languages. Thus
> maintaining
> > > cost is not too high.
> >
> > I never said we need to, but if we're going to implement
> ICU, then we
> > might as well go all the way.
>
> So you admit there's no benefit using ICU for replacing
> existing conversions?
>
> Besides ICU does not support all existing conversions, I
> think ICU has serious flaw for using conversion. If I
> understand correctly, ICU uses UNICODE internally to do the
> conversion. For example, to implement
> SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> converts UNICODE to EUC_JP. Problem is these conversion is
> not roud trip(conversion between SJIS/EUC_JP and UNICODE will
> lose some information). Thus SJIS->EUC_JP->SJIS conversion
> using ICU does not preserve original text.

Just for the record, I fetched a web page encoded in sjis, and converted
it to euc-jp and back using uconv from ICU 3.2, and the result is the
original is identical to the transformed file.
uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.htmluconv -f EUC-JP -t Shift_JIS -o index.html.sjis
index.html.eucdiffindex.html index.html.sjis 

... John


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> > -----Original Message-----
> > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] 
> > Sent: Sunday, May 08, 2005 11:08 PM
> > To: John Hansen
> > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
> > pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> > 
> > > > I don't buy it. If current conversion tables does the 
> > right thing, 
> > > > why we need to replace. Or if conversion tables are not 
> > correct, why 
> > > > don't you fix it? I think the rule of character 
> > conversion will not 
> > > > change frequently, especially for LATIN languages. Thus 
> > maintaining 
> > > > cost is not too high.
> > > 
> > > I never said we need to, but if we're going to implement 
> > ICU, then we 
> > > might as well go all the way.
> > 
> > So you admit there's no benefit using ICU for replacing 
> > existing conversions?
> > 
> > Besides ICU does not support all existing conversions, I 
> > think ICU has serious flaw for using conversion. If I 
> > understand correctly, ICU uses UNICODE internally to do the 
> > conversion. For example, to implement
> > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > converts UNICODE to EUC_JP. Problem is these conversion is 
> > not roud trip(conversion between SJIS/EUC_JP and UNICODE will 
> > lose some information). Thus SJIS->EUC_JP->SJIS conversion 
> > using ICU does not preserve original text.
> 
> Just for the record, I fetched a web page encoded in sjis, and converted
> it to euc-jp and back using uconv from ICU 3.2, and the result is the
> original is identical to the transformed file.
> 
>  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
>  uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc
>  diff index.html index.html.sjis

Not all SJIS/EUC_JP characters have the problem. You might want to
try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.

BTW, I got this with ICU 3.2:

$ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found

The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.

This makes me nervous in using ICU...
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 12:32 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > > -----Original Message-----
> > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > > Sent: Sunday, May 08, 2005 11:08 PM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > > > I don't buy it. If current conversion tables does the
> > > right thing,
> > > > > why we need to replace. Or if conversion tables are not
> > > correct, why
> > > > > don't you fix it? I think the rule of character
> > > conversion will not
> > > > > change frequently, especially for LATIN languages. Thus
> > > maintaining
> > > > > cost is not too high.
> > > >
> > > > I never said we need to, but if we're going to implement
> > > ICU, then we
> > > > might as well go all the way.
> > >
> > > So you admit there's no benefit using ICU for replacing existing
> > > conversions?
> > >
> > > Besides ICU does not support all existing conversions, I
> think ICU
> > > has serious flaw for using conversion. If I understand correctly,
> > > ICU uses UNICODE internally to do the conversion. For example, to
> > > implement
> > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > converts UNICODE to EUC_JP. Problem is these conversion
> is not roud
> > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some
> > > information). Thus SJIS->EUC_JP->SJIS conversion using
> ICU does not
> > > preserve original text.
> >
> > Just for the record, I fetched a web page encoded in sjis, and
> > converted it to euc-jp and back using uconv from ICU 3.2, and the
> > result is the original is identical to the transformed file.
> >
> >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> uconv -f
> > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff
> index.html
> > index.html.sjis
>
> Not all SJIS/EUC_JP characters have the problem. You might want to
> try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
>
> BTW, I got this with ICU 3.2:
>
> $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
> Conversion from Unicode to codepage failed at input byte
> position 0. Unicode: 301c Error: Invalid character found
>
> The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.

That actually makes perfect sense, since according to unicode.org's
database:
301C ~ WAVE DASH      This character was encoded to match JIS C 6226-1978 1-33 "wave
dash".      The JIS standards and some industry practise disagree in mapping. - 3030 wavy dash - FF5E full width tilde

In PG FF5E is the mapping currently used. That is obviously wrong
(according to the standards), as that is only a 'similar character'.

Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
doesn't define "WAVE DASH".
In all, I believe this behaviour to be correct according to the
standards.

There'd be nothing to stop us from defining alternative mappings for the
cases where we deviate from the standard, but the question is, should we
be non-standard?

>
> This makes me nervous in using ICU...
> --
> Tatsuo Ishii
>
>

... John


Re: Patch for collation using ICU

From
Tatsuo Ishii
Date:
> Tatsuo Ishii wrote:
> > Sent: Tuesday, May 10, 2005 12:32 AM
> > To: John Hansen
> > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
> > pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> > 
> > > > -----Original Message-----
> > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > > > Sent: Sunday, May 08, 2005 11:08 PM
> > > > To: John Hansen
> > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
> > > > pgsql-hackers@postgresql.org
> > > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > > 
> > > > > > I don't buy it. If current conversion tables does the
> > > > right thing,
> > > > > > why we need to replace. Or if conversion tables are not
> > > > correct, why
> > > > > > don't you fix it? I think the rule of character
> > > > conversion will not
> > > > > > change frequently, especially for LATIN languages. Thus
> > > > maintaining
> > > > > > cost is not too high.
> > > > > 
> > > > > I never said we need to, but if we're going to implement
> > > > ICU, then we
> > > > > might as well go all the way.
> > > > 
> > > > So you admit there's no benefit using ICU for replacing existing 
> > > > conversions?
> > > > 
> > > > Besides ICU does not support all existing conversions, I 
> > think ICU 
> > > > has serious flaw for using conversion. If I understand correctly, 
> > > > ICU uses UNICODE internally to do the conversion. For example, to 
> > > > implement
> > > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > > converts UNICODE to EUC_JP. Problem is these conversion 
> > is not roud 
> > > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some 
> > > > information). Thus SJIS->EUC_JP->SJIS conversion using 
> > ICU does not 
> > > > preserve original text.
> > > 
> > > Just for the record, I fetched a web page encoded in sjis, and 
> > > converted it to euc-jp and back using uconv from ICU 3.2, and the 
> > > result is the original is identical to the transformed file.
> > > 
> > >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html  
> > uconv -f 
> > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff 
> > index.html 
> > > index.html.sjis
> > 
> > Not all SJIS/EUC_JP characters have the problem. You might want to
> > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> > 
> > BTW, I got this with ICU 3.2:
> > 
> > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt 
> > Conversion from Unicode to codepage failed at input byte 
> > position 0. Unicode: 301c Error: Invalid character found
> > 
> > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
> 
> That actually makes perfect sense, since according to unicode.org's
> database:
> 301C ~ WAVE DASH
>        This character was encoded to match JIS C 6226-1978 1-33 "wave
> dash".
>        The JIS standards and some industry practise disagree in mapping.
>      - 3030 wavy dash
>      - FF5E full width tilde
> 
> In PG FF5E is the mapping currently used. That is obviously wrong
> (according to the standards), as that is only a 'similar character'.
> 
> Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
> doesn't define "WAVE DASH".
> In all, I believe this behaviour to be correct according to the
> standards.
> 
> There'd be nothing to stop us from defining alternative mappings for the
> cases where we deviate from the standard, but the question is, should we
> be non-standard?

You missed the point. EUC_JP 0xa1c1 is a perfect valid data and 
uconv -f EUC_JP -t Shift_JIS should convert it to Shift_JIS 0x8160
regardless of the internal of uconv.
--
Tatsuo Ishii


Re: Patch for collation using ICU

From
"John Hansen"
Date:
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 5:45 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > Tatsuo Ishii wrote:
> > > Sent: Tuesday, May 10, 2005 12:32 AM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > > > -----Original Message-----
> > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > > > > Sent: Sunday, May 08, 2005 11:08 PM
> > > > > To: John Hansen
> > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > > > pgsql-hackers@postgresql.org
> > > > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > > >
> > > > > > > I don't buy it. If current conversion tables does the
> > > > > right thing,
> > > > > > > why we need to replace. Or if conversion tables are not
> > > > > correct, why
> > > > > > > don't you fix it? I think the rule of character
> > > > > conversion will not
> > > > > > > change frequently, especially for LATIN languages. Thus
> > > > > maintaining
> > > > > > > cost is not too high.
> > > > > >
> > > > > > I never said we need to, but if we're going to implement
> > > > > ICU, then we
> > > > > > might as well go all the way.
> > > > >
> > > > > So you admit there's no benefit using ICU for
> replacing existing
> > > > > conversions?
> > > > >
> > > > > Besides ICU does not support all existing conversions, I
> > > think ICU
> > > > > has serious flaw for using conversion. If I understand
> > > > > correctly, ICU uses UNICODE internally to do the
> conversion. For
> > > > > example, to implement
> > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to
> UNICODE then
> > > > > converts UNICODE to EUC_JP. Problem is these conversion
> > > is not roud
> > > > > trip(conversion between SJIS/EUC_JP and UNICODE will
> lose some
> > > > > information). Thus SJIS->EUC_JP->SJIS conversion using
> > > ICU does not
> > > > > preserve original text.
> > > >
> > > > Just for the record, I fetched a web page encoded in sjis, and
> > > > converted it to euc-jp and back using uconv from ICU
> 3.2, and the
> > > > result is the original is identical to the transformed file.
> > > >
> > > >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> > > uconv -f
> > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff
> > > index.html
> > > > index.html.sjis
> > >
> > > Not all SJIS/EUC_JP characters have the problem. You might want to
> > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> > >
> > > BTW, I got this with ICU 3.2:
> > >
> > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
> Conversion
> > > from Unicode to codepage failed at input byte position 0.
> Unicode:
> > > 301c Error: Invalid character found
> > >
> > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
> >
> > That actually makes perfect sense, since according to unicode.org's
> > database:
> > 301C ~ WAVE DASH
> >        This character was encoded to match JIS C 6226-1978
> 1-33 "wave
> > dash".
> >        The JIS standards and some industry practise
> disagree in mapping.
> >      - 3030 wavy dash
> >      - FF5E full width tilde
> >
> > In PG FF5E is the mapping currently used. That is obviously wrong
> > (according to the standards), as that is only a 'similar character'.
> >
> > Unfortunately, there is no mapping from 301C to shift_jis, as
> > shift_jis doesn't define "WAVE DASH".
> > In all, I believe this behaviour to be correct according to the
> > standards.
> >
> > There'd be nothing to stop us from defining alternative
> mappings for
> > the cases where we deviate from the standard, but the question is,
> > should we be non-standard?
>
> You missed the point. EUC_JP 0xa1c1 is a perfect valid data
> and uconv -f EUC_JP -t Shift_JIS should convert it to
> Shift_JIS 0x8160 regardless of the internal of uconv.

Studying ICU forther, I found that it works fine, provided you use the
_correct_ charset for the conversion..

a.txt contains 0x81 0x60
uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt
b.txt now contains 0xa1 0xc1
uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt
a.txt still contains 0x81 0x60

The mapping table you want is ibm-943_P130-1999
Similar, we'd need to find the right euc-jp (and plain jis) mapping,
assuming we want the one that strictly defines JIS X 0208 right?

I trust this to put your fears to rest...

> --
> Tatsuo Ishii
>
>

... John