Thread: [HACKERS] How can I find a specific collation in pg_collation when using ICU?

Hello,

I tried to find a particular collation name in pg_collation, but I
cannot understand the naming convention after reading the following
article.  Specifically, I want to find out whether there is some
collation equivalent to Japanese_CI_AS in SQL Server, which means
Japanese, case-insensitive, and accent sensitive.  Could you tell me
how to do this?  Is there any room to improve the PG manual?


https://www.postgresql.org/docs/devel/static/collation.html
--------------------------------------------------
23.2.2.2.2. ICU collations

With ICU, it is not sensible to enumerate all possible locale names.
ICU uses a particular naming system for locales, but there are many
more ways to name
a locale than there are actually distinct locales. (In fact, any
string will be accepted as a locale name.) See
http://userguide.icu-project.org/locale
for information on ICU locale naming.
--------------------------------------------------


Regards
MauMau




Re: [HACKERS] How can I find a specific collation in pg_collationwhen using ICU?

From
Peter Eisentraut
Date:
On 8/9/17 08:38, MauMau wrote:
> I tried to find a particular collation name in pg_collation, but I
> cannot understand the naming convention after reading the following
> article.  Specifically, I want to find out whether there is some
> collation equivalent to Japanese_CI_AS in SQL Server, which means
> Japanese, case-insensitive, and accent sensitive.

There are no case-insensitive collations in PostgreSQL (yet).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] How can I find a specific collation in pg_collationwhen using ICU?

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 5:38 AM, MauMau <maumau307@gmail.com> wrote:
> I tried to find a particular collation name in pg_collation, but I
> cannot understand the naming convention after reading the following
> article.  Specifically, I want to find out whether there is some
> collation equivalent to Japanese_CI_AS in SQL Server, which means
> Japanese, case-insensitive, and accent sensitive.  Could you tell me
> how to do this?  Is there any room to improve the PG manual?

This is not an answer to the question you asked, but it may interest
you to know that ICU uses JIS X 4061 for Japanese, unlike Glibc. This
will give more useful results when sorting Japanese.

The best explanation of the difference that I can understand is here,
under "Why do CJK strings sort incorrectly in Unicode?":

https://dev.mysql.com/doc/refman/5.5/en/faqs-cjk.html

-- 
Peter Geoghegan



Re: [HACKERS] How can I find a specific collation in pg_collationwhen using ICU?

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
> There are no case-insensitive collations in PostgreSQL (yet).

That's sad news, as I expected ICU to bring its various collation features to PostgreSQL.  I hope it will be easy to
addthem.
 


From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Geoghegan
> This is not an answer to the question you asked, but it may interest you
> to know that ICU uses JIS X 4061 for Japanese, unlike Glibc. This will give
> more useful results when sorting Japanese.
> 
> The best explanation of the difference that I can understand is here, under
> "Why do CJK strings sort incorrectly in Unicode?":
> 
> https://dev.mysql.com/doc/refman/5.5/en/faqs-cjk.html

Thanks a lot.  MysQL seems to have many collations, doesn't it?

Regards
Takayuki Tsunakawa


Re: [HACKERS] How can I find a specific collation in pg_collationwhen using ICU?

From
Peter Geoghegan
Date:
On Wed, Aug 9, 2017 at 6:19 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
>> There are no case-insensitive collations in PostgreSQL (yet).
>
> That's sad news, as I expected ICU to bring its various collation features to PostgreSQL.  I hope it will be easy to
addthem.
 

The reason it is not easy is that text equality is based on strict
binary equality. We would have to teach hash operator classes about
collations to fix this, and make text_eq() hash strxfrm() or
something. That requires special work. You can ask ICU for case
insensitivity with Postgres 10, but the strcmp() tie breaker within
varstr_cmp() will prevent it from being truly case insensitive.

>> The best explanation of the difference that I can understand is here, under
>> "Why do CJK strings sort incorrectly in Unicode?":
>>
>> https://dev.mysql.com/doc/refman/5.5/en/faqs-cjk.html
>
> Thanks a lot.  MysQL seems to have many collations, doesn't it?

Well, it depends on how you define collation, which actually gets
quite complicated with ICU. You can combine certain options together
with great flexibility, it seems (see my e-mail from earlier today --
"What users can do with custom ICU collations in Postgres 10"). Let's
assume that there are 10 distinct options for each locale that each
independently affect sort order for the base collation of the locale.
I believe that there are at least 10 such options that change things,
and maybe a lot more. Theoretically, this means that there are an
absolutely enormous number of possible collations with ICU.

Now, I wouldn't *actually* say that we have many thousands of
collations with ICU, because that doesn't seem like a sensible way to
explain ICU Postgres collations. The way that we think about this from
using libc doesn't work well for ICU. Instead, I would say that we
have hundreds of locales (not collations), each of which support some
variant options (e.g., traditional Spanish sort order, alternative
Japanese sort order, pictographic emoji sorting), with some further
generic options for varying how case it handled, how numbers are
handled, and other things like that.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Aug 9, 2017 at 6:19 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> That's sad news, as I expected ICU to bring its various collation features to PostgreSQL.  I hope it will be easy to
addthem. 

> The reason it is not easy is that text equality is based on strict
> binary equality.

Yeah.  You can change sort order all you want, but you can't easily
change the system's notion of equality.  But when people ask for "case
insensitive" collation, they generally want that too.
        regards, tom lane