Re: is this a bug or I am blind? - Mailing list pgsql-general

From Greg Stark
Subject Re: is this a bug or I am blind?
Date
Msg-id 87psnvb2ww.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: is this a bug or I am blind?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: is this a bug or I am blind?
List pgsql-general
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > By the way, I have always been concerned about the feature of Unicode
> > that you can write logically equivalent strings using different
> > code-point sequences.  Namely, you often have the option of writing an
> > accented letter using the "legacy" single codepoint (like in ISO
> > 8859-something) or alternatively using accept plus "base letter" as two
> > code points.  Collating systems should treat them the same, so hashing
> > the byte values won't work anyway.  This is a more extreme case of
> > "tyty" vs. "tty" because using a proper rendering system, those Unicode
> > strings should look the same to the naked eye.  Therefore, I'm doubtful
> > that using a binary comparison as tie-breaker is proper behavior.
>
> Hm.  Would you expect that these sequences generate identical strxfrm
> output?

I think this is mixing up two different things.

Using iso-8859-1 to encode "é" as a single byte versus using UTF8 which would
take two bytes to encode it is an issue of using two *different* encodings.
The actual string of characters being encoded is precisely the same string.
That is, while the sequence of bytes in the encoded string is different the
sequence of characters being encoded is precisely the same.

Postgres doesn't really face this issue currently since it only supports one
encoding at a time anyways. If Postgres supported multiple encodings and it
was necessary to compare two strings in two different encodings then they
would probably both have to be converted to a common encoding (presumably UTFx
for some value of x) before comparing.

There is a separate issue that some characters could theoretically have
multiple representations even within the same encoding. This doesn't really
happen in the usual non UTF encodings (like iso-8859-x) to my knowledge, but
it can happen in UTF8 or UTF16 because, for example, you could use the
variable length form that takes 2 bytes or even 4 bytes for characters that
are really just plain ascii characters.

However there are canonicalization rules that basically rule all but the
shortest representation invalid unicode strings. I assume these exist
precisely to make it easier to compare or hash unicode strings. I guess it's
an open question whether the database should signal an error on such invalid
strings or silently treat them as equivalent to a correct encoding of the same
string.

On the original issue I think the bottom line is that strings are sequences of
characters and two sequences of characters should only compare equal if they
contain the same characters in the same order.

The encodings can be different and still represent the same string, but they
do have to represent the same sequence of characters. If they represent two
different sequences of characters -- even if the two sequences have the same
significance in the language of the reader, they're still not actually the
same sequence of characters.

As long as both strings are encoded in the same encoding (whether that be
iso-8859-1 or utf8 or whatever) sorting by strcoll and then strcmp will
effectively give this set of semantics with one exception, the case of invalid
UTF encodings that are not canonicalized where it will silently treat them as
distinct strings from the correctly encoded string.

One day when it's possible for the two strings to be in two different
encodings then they will have to both be converted to an encoding that
includes the union of the two character sets covered by the two encodings.

--
greg

pgsql-general by date:

Previous
From: Michael Fuhr
Date:
Subject: Re: Views
Next
From: Peter Eisentraut
Date:
Subject: Re: is this a bug or I am blind?