Home > mailing lists

Re: Patch for collation using ICU - Mailing list pgsql-hackers

From	Palle Girgensohn
Subject	Re: Patch for collation using ICU
Date	May 7, 2005 13:37:00
Msg-id	1B9F2612297F6479B9E5E7B1@palle.girgensohn.se Whole thread Raw
In response to	Re: Patch for collation using ICU ("John Hansen" <john@geeknet.com.au>)
Responses	Re: Patch for collation using ICU
List	pgsql-hackers

Tree view

--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen <john@geeknet.com.au>
wrote:

> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making pg
> _depend_ on ICU), as this would seem like a cleaner solution than for us
> to maintain our own conversion tables.
>
> ICU also has a fair few conversions that we do not have at present.
>
> Any thoughts?

I just had a similar though. And why use ICU only for multibyte charsets?
If I use LATIN1, I still expect upper('ß') => SS, and I don't get it...
Same for the Turkish example.

It does eat more memory, and can perhaps cush some performance bits? With
the current scheme, a strdup is often enough, or at least just one palloc.
With ICU, using UTF-16, you must allocate memory twice, once for the ICU
internal UTF-16 representation. That's not a very strong objection, though,
as this would be an option... :)

John, I have a hard time finding docs about what differs in ICU 2.8 from
3.2. Do you have any pointers?

It seems 3.2 has much more support and bug fixes, I'm not sure if we should
really consider 2.8?

/Palle

>
> ... John
>
>> -----Original Message-----
>> From: John Hansen
>> Sent: Saturday, May 07, 2005 11:09 PM
>> To: 'Palle Girgensohn'; 'Bruce Momjian'
>> Cc: 'pgsql-hackers@postgresql.org'
>> Subject: RE: [HACKERS] Patch for collation using ICU
>>
>> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
>> > <john@geeknet.com.au>
>> > wrote:
>> >
>> > > Errm,... initdb --encoding UNICODE --locale C
>> >
>> > You mean that ICU *shall* be used even for the C locale, and not as
>> > Bruce suggested here:
>>
>> Yes, that's exactly what I mean.
>>
>> >
>> > >> I do have a few questions:
>> > >>
>> > >> Why don't you use the lc_ctype_is_c() part of this test?
>> > >>
>> > >>      if (pg_database_encoding_max_length() > 1 &&
>> !lc_ctype_is_c())
>> > >
>> > > Um, well, I didn't think about that. :)  What would be the
>> > locale in
>> > > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
>> > CTYPE=C and use
>> > > a wide encoding, indeed. Then the strings will be handled
>> > like byte-wide chars.
>> > > Yeah, it's a bug. I'll fix it! Thanks.
>> >
>> > John disagrees here, and I'm obliged to agree. Using the C
>> locale, one
>> > will expect C collation, but upper/lower is better off still using
>> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
>> >
>> > /Palle
>> >
>> >
>> > >
>> > >> -----Original Message-----
>> > >> From: pgsql-hackers-owner@postgresql.org
>> > >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
>> > John Hansen
>> > >> Sent: Saturday, May 07, 2005 10:23 PM
>> > >> To: Palle Girgensohn; Bruce Momjian
>> > >> Cc: pgsql-hackers@postgresql.org
>> > >> Subject: Re: [HACKERS] Patch for collation using ICU
>> > >>
>> > >> >
>> > >> > I use this patch in production on one FreeBSD 4.10
>> server at the
>> > >> > moment.
>> > >> > With the latest version, I've had no problems. Logging is
>> > >> swithed on
>> > >> > for now, and it shows no signs of ICU complaining. I'd
>> like more
>> > >> > reports on Linux, though.
>> > >>
>> > >> I currently use this on gentoo with ICU3.2 unmasked.
>> > >>
>> > >> Works a dream, even with locale C and UNICODE database.
>> > >>
>> > >> Small test:
>> > >>
>> > >> createdb --encoding UNICODE --locale C test psql test set
>> > >> client_encoding=iso88591; CREATE TABLE test (t text);
>> INSERT INTO
>> > >> test (t) VALUES ('æøå'); set client_encoding=unicode;
>> INSERT INTO
>> > >> test (t) SELECT upper(t) FROM test; set
>> client_encoding=iso88591;
>> > >> SELECT * FROM test;
>> > >>   t
>> > >> -----
>> > >>  æøå
>> > >>  ÆØÅ
>> > >> (2 rows)
>> > >>
>> > >> Just as I'd expect, as upper/lower/initcap are locale
>> > independent for
>> > >> these characters.
>> > >>
>> > >>
>> > >> ---------------------------(end of
>> > >> broadcast)---------------------------
>> > >> TIP 5: Have you checked our extensive FAQ?
>> > >>
>> > >>                http://www.postgresql.org/docs/faq
>> > >>
>> > >>
>> >
>> >
>> >
>> >
>> >
>> >

pgsql-hackers by date:

From: "John Hansen"
Date: 07 May 2005, 13:33:36
Subject: Re: Patch for collation using ICU

From: Palle Girgensohn
Date: 07 May 2005, 13:38:08
Subject: Re: Patch for collation using ICU

Re: Patch for collation using ICU - Mailing list pgsql-hackers

Previous

Next