Re: UTF-8 and LIKE vs = - Mailing list pgsql-general
From | Joel |
---|---|
Subject | Re: UTF-8 and LIKE vs = |
Date | |
Msg-id | 20040831164908.5FFD.REES@ddcom.co.jp Whole thread Raw |
In response to | UTF-8 and LIKE vs = (David Wheeler <david@kineticode.com>) |
List | pgsql-general |
On Mon, 30 Aug 2004 17:16:20 -0700 David Wheeler wrote > On Aug 27, 2004, at 5:27 AM, Joel wrote: > > > I would expect to run into problems with collation. In that case, you > > may end up setting up separate databases for each language, as I > > mentioned before in the mail that I forgot to post to the list so > > people > > could correct me if I'm wrong. > > As far as I know, collation is essentially how an index is ordered, Collation can be used when setting up an index (as Michael points out). > correct? So that when I so an "ORDER BY" query, the order in which the > rows are returned is determined by the collation. Is that correct? Anything that is related to sort order will be effected by collation, and it is sometimes surprising what is related to sort order. (Sorry to be vague, it's been a while.) I have no experience with Korean. All I know is by hearsay. In Japanese, key fields will often be doubled. This is so that both the Kanji (ideographic) and kana (pronunciation) can be indexed. Kanji are not considered to have inherent (standard) ordering in most applications, and space really isn't (usually) a delimiter. So straight kana order is (usually) sufficient for the kana field, and codepoint order is usually sufficient for the Kanji. Some functions will require some special handling for the kana. One of the issues is from legacy 8-bit katakana only encodings. Another is derived from what we would call compositing problems. There are occasions when codepoint ordering for the Kanji will produce counter-intuitive results. This is because the natural orderings do exist (however ambiguously) and both the traditional JIS lists and the Unicode lists break up the Chinese ideographs in groups that cut cross sections out of the natural orderings. What I've heard of Korean, you may run into similar issues even though legacy should not be so much of a problem. > If so, then I'm happy with the 80% solution of defaulting to Unicode > ordering (or "Unicodabetical"). Not knowing what your app is, it would be hard to say how far down the road it will be before you hit problems with this. (Or if you will.) It sounds like the only way for you to find out is to put it into production and ask for feedback. If you want to get a head start on something, you might want to look into making or finding custom collation tables. (Maybe.) > > Other than that, it depends on what functions the database will have. > > > > If what is being done with the CJKT is pretty basic stuff, I may be > > just > > another too-pessimistic voice. > > Frankly, I'm more concerned with the ability of queries to work than I > am of ordering results. Ordering is strictly secondary. For now, that's probably correct. But I've given you about a half of a heads-up on it. Have fun. -- Joel <rees@ddcom.co.jp>
pgsql-general by date: