On Sat, Sep 17, 2005 at 03:49:24PM -0400, Greg Stark wrote:
> Well, consider the case of a two different Unicode encoded strings that
> actually represent the same series of characters. They may be byte-wise
> different but there's really no difference at all in the text they contain.
Strictly speaking, a valid Unicode string is the shortest possible
representation. So at least one of the two should be rejected as
invalid. Whether people do this or not is another issue entirely. It is
certainly recommended to reject non-optimally encoded strings, for
security purposes at least. You don't really want to accept multiple
ways of specifying things like '/' and '\' and other special chars.
> Nonetheless, I may agree with you that the world would be a better place if
> collation orders never created this situation. But unless we can point to some
> spec or some solid reason why if that ever happened it would cause worse
> headaches than this I think it's necessary to protect the hashing function
> from being out of sync with the btree operators.
Well, the Unicode spec doesn't do it that way, does that count? On a
purely practical level though, we have to work with it until PostgreSQL
is using something like ICU thus solving the problem completely.
Case-insensetivity is a large can of worms. The strings "quit" and
"QUIT" match case-insensetivly in most languages, but not in Turkish.
And neither of:
toupper(tolower(a)) == toupper(a)
tolower(toupper(a)) == tolower(a)
can be assumed in the general case. In the end we may need to provide
ways of specifying what people mean by "case-insensetive". Whether or
not to ignore accents, etc.
ICU provides a way of specifying transforms like 'drop accents', so
this can be solved...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.