Re: UTF-8 and LIKE vs = - Mailing list pgsql-general

From Joel
Subject Re: UTF-8 and LIKE vs =
Date
Msg-id 20040831164908.5FFD.REES@ddcom.co.jp
Whole thread Raw
In response to UTF-8 and LIKE vs =  (David Wheeler <david@kineticode.com>)
List pgsql-general
On Mon, 30 Aug 2004 17:16:20 -0700 David Wheeler wrote

> On Aug 27, 2004, at 5:27 AM, Joel wrote:
>
> > I would expect to run into problems with collation. In that case, you
> > may end up setting up separate databases for each language, as I
> > mentioned before in the mail that I forgot to post to the list so
> > people
> > could correct me if I'm wrong.
>
> As far as I know, collation is essentially how an index is ordered,

Collation can be used when setting up an index (as Michael points out).

> correct? So that when I so an "ORDER BY" query, the order in which the
> rows are returned is determined by the collation. Is that correct?

Anything that is related to sort order will be effected by collation,
and it is sometimes surprising what is related to sort order. (Sorry to
be vague, it's been a while.)

I have no experience with Korean. All I know is by hearsay.

In Japanese, key fields will often be doubled. This is so that both the
Kanji (ideographic) and kana (pronunciation) can be indexed. Kanji are
not considered to have inherent (standard) ordering in most applications,
and space really isn't (usually) a delimiter. So straight kana order is
(usually) sufficient for the kana field, and codepoint order is usually
sufficient for the Kanji.

Some functions will require some special handling for the kana. One of
the issues is from legacy 8-bit katakana only encodings. Another is
derived from what we would call compositing problems.

There are occasions when codepoint ordering for the Kanji will produce
counter-intuitive results. This is because the natural orderings do
exist (however ambiguously) and both the traditional JIS lists and the
Unicode lists break up the Chinese ideographs in groups that cut cross
sections out of the natural orderings.

What I've heard of Korean, you may run into similar issues even though
legacy should not be so much of a problem.

> If so, then I'm happy with the 80% solution of defaulting to Unicode
> ordering (or "Unicodabetical").

Not knowing what your app is, it would be hard to say how far down the
road it will be before you hit problems with this. (Or if you will.) It
sounds like the only way for you to find out is to put it into
production and ask for feedback.

If you want to get a head start on something, you might want to look
into making or finding custom collation tables. (Maybe.)

> > Other than that, it depends on what functions the database will have.
> >
> > If what is being done with the CJKT is pretty basic stuff, I may be
> > just
> > another too-pessimistic voice.
>
> Frankly, I'm more concerned with the ability of queries to work than I
> am of ordering results. Ordering is strictly secondary.

For now, that's probably correct. But I've given you about a half of a
heads-up on it. Have fun.

--
Joel <rees@ddcom.co.jp>


pgsql-general by date:

Previous
From: Ulrich Wisser
Date:
Subject: Re: Hebrew support -- please help !
Next
From: Martijn van Oosterhout
Date:
Subject: Re: DB failure?