Re: Duplicate Values or Not?! - Mailing list pgsql-general

From Greg Stark
Subject Re: Duplicate Values or Not?!
Date
Msg-id 87fys3r8vf.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Duplicate Values or Not?!  (Greg Stark <gsstark@mit.edu>)
Responses Re: Duplicate Values or Not?!
List pgsql-general
Greg Stark <gsstark@MIT.EDU> writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > If that does change the results, it indicates you've got strings which
> > are bytewise different but compare equal according to strcoll().  We've
> > seen this and other misbehaviors from some locale definitions when faced
> > with data that is invalid per the encoding the locale expects.
>
> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

Hm. Some experimentation shows that at least on glibc's locale definitions the
strings that I thought compared equal don't actually compare equal.
Capitalization, punctuation, white space, while they're basically ignored in
general in non-C locales do seem to compare non-equal when they're the only
differentiating factor.

Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

If it's legal for strcoll to compare as equal two byte-wise different strings
then the hash function really ought to be calling strxfrm before hashing or
else it will be inconsistent. It doesn't seem to be doing so currently.

I find it interesting that Perl has faced this same dilemma and chose to
override the locale definition in this case. If the locale definitions
compares two strings equally then Perl does a bytewise comparison and uses
that to break ties. This guarantees non-bytewise-identical strings don't
compare eqal. I suspect they did it for a similar reason too, namely keeping
the semantics in sync with perl hashes.

Postgres could follow that model, I think it would solve any inconsistencies
just fine and not cause problems. However it would be visible to users which
may be considered a bug if the locale really does claim the strings are equal
but Postgres doesn't agree. On the other hand I think it would perform better
than a lot of extra calls to strxfrm since it would only rarely kick in with
an extra memcmp.

--
greg

pgsql-general by date:

Previous
From: Greg Stark
Date:
Subject: Re: Duplicate Values or Not?!
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Duplicate Values or Not?!