Peter Eisentraut <peter_e@gmx.net> writes:
> By the way, I have always been concerned about the feature of Unicode
> that you can write logically equivalent strings using different
> code-point sequences. Namely, you often have the option of writing an
> accented letter using the "legacy" single codepoint (like in ISO
> 8859-something) or alternatively using accept plus "base letter" as two
> code points. Collating systems should treat them the same, so hashing
> the byte values won't work anyway. This is a more extreme case of
> "tyty" vs. "tty" because using a proper rendering system, those Unicode
> strings should look the same to the naked eye. Therefore, I'm doubtful
> that using a binary comparison as tie-breaker is proper behavior.
Hm. Would you expect that these sequences generate identical strxfrm
output?
The weight of opinion later in the thread seems to be leaning towards
the idea that we do not want to accept the word of strcoll/strxfrm about
whether two strings are equal: there are too many scenarios where lax
equality behavior would be a serious bug, and too few where it's
critical to have it. I'm still prepared to listen to argument though.
A possible compromise going forward would be to introduce an additional
comparison operator that tests for strcoll equality --- but I'd vote for
calling it something other than "=".
regards, tom lane