Thread: Duplicate Values or Not?!

Duplicate Values or Not?!

From

John Seberg

Date:

16 September 2005, 20:25:15

I recently tried to CREATE a UNIQUE INDEX and could
not, due to duplicate values:

CREATE UNIQUE INDEX usr_login ON usr (login);

To try to find the offending row(s), I then executed
the following:

SELECT count(*), login FROM usr GROUP BY login ORDER
BY 1 DESC;

The GROUP BY didn't group anything, indicating to me
that there were no duplicate values. There were the
same number of rows in this query as a simple SELECT
count(*) FROM usr.

This tells me that Postgresql is not using the same
method for determining duplicates when GROUPING and
INDEXing.

I dig a little deeper. The best candidate I find for a
possible duplicate are caused by characters that did
not translate well. IIRC, the basis was the name Pena,
which looked like Pe?a. I'm thinking the original data
was not encoded properly, or my export didn't handle
encodings properly, etc. The two Penas used different
characters in the 3rd position, neither of which were
translated correctly.

I loaded up data from another database vendor (4th
Dimension), into a 8.0.3 Postgresql, which I had
compiled from source with the default configuration.
This was on Yellow Dog Linux 4.0.1.

I brought the same data into a 8.0.1 on Max OS X
(binary from entropy.ch) and did NOT have this UNIQUE
INDEX failure.

I'm sure my problems are deeper than the INDEX
failure, involving the accuracy of the conversion,
but, short term, I would like to know what is
different? They both are SQL_ASCII databases. I tried
importing into a UNICODE database, but that really a
mess of errors (during COPY).

I realize I need to learn about encodings, my source
data, etc., but I'm looking for hints. Anybody
experienced in exported 4th Dimension data containing
a certain amount of foriegn language text?

Thanks,




__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

Re: Duplicate Values or Not?!

From

Greg Stark

Date:

17 September 2005, 05:37:06

John Seberg <johnseberg@yahoo.com> writes:

> I recently tried to CREATE a UNIQUE INDEX and could
> not, due to duplicate values:
>
> CREATE UNIQUE INDEX usr_login ON usr (login);
>
> To try to find the offending row(s), I then executed
> the following:
>
> SELECT count(*), login FROM usr GROUP BY login ORDER
> BY 1 DESC;
>
> The GROUP BY didn't group anything, indicating to me
> that there were no duplicate values. There were the
> same number of rows in this query as a simple SELECT
> count(*) FROM usr.
>
> This tells me that Postgresql is not using the same
> method for determining duplicates when GROUPING and
> INDEXing.

You might try running the GROUP BY query after doing:

set enable_hashagg = false;
select ...

With that false it would have to sort the results which should be exactly the
same code as the index is using. I think.

That doesn't really answer the rest of your questions. The short of it is that
setting the encoding doesn't magically make your data encoded in that
encoding. If your client sends it one encoding but claims it's unicode then
Postgres will happily store it in a UNICODE database and it'll be garbage.

Maybe someone else will have more specific advice on that front.

--
greg

Re: Duplicate Values or Not?!

From

Tom Lane

Date:

17 September 2005, 14:22:45

Greg Stark <gsstark@mit.edu> writes:
> John Seberg <johnseberg@yahoo.com> writes:
>> This tells me that Postgresql is not using the same
>> method for determining duplicates when GROUPING and
>> INDEXing.

> You might try running the GROUP BY query after doing:
> set enable_hashagg = false;

> With that false it would have to sort the results which should be exactly the
> same code as the index is using. I think.

If that does change the results, it indicates you've got strings which
are bytewise different but compare equal according to strcoll().  We've
seen this and other misbehaviors from some locale definitions when faced
with data that is invalid per the encoding the locale expects.

So, yeah, the answer is to fix your encoding problems.  In particular,
don't ever use a locale like that with a SQL_ASCII database encoding,
because then Postgres won't prevent strcoll from seeing data it fails
on.  The only safe locale setting for a SQL_ASCII database is C,
I think.

            regards, tom lane

Re: Duplicate Values or Not?!

From

Greg Stark

Date:

17 September 2005, 14:51:19

Tom Lane <tgl@sss.pgh.pa.us> writes:

> If that does change the results, it indicates you've got strings which
> are bytewise different but compare equal according to strcoll().  We've
> seen this and other misbehaviors from some locale definitions when faced
> with data that is invalid per the encoding the locale expects.

There are plenty of non-bytewise-identical strings that do legitimately
compare equal in various locales. Does the hash code hash strxfrm or the
original bytes?

--
greg

Re: Duplicate Values or Not?!

From

Greg Stark

Date:

17 September 2005, 15:50:56

Greg Stark <gsstark@MIT.EDU> writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > If that does change the results, it indicates you've got strings which
> > are bytewise different but compare equal according to strcoll().  We've
> > seen this and other misbehaviors from some locale definitions when faced
> > with data that is invalid per the encoding the locale expects.
>
> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

Hm. Some experimentation shows that at least on glibc's locale definitions the
strings that I thought compared equal don't actually compare equal.
Capitalization, punctuation, white space, while they're basically ignored in
general in non-C locales do seem to compare non-equal when they're the only
differentiating factor.

Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

If it's legal for strcoll to compare as equal two byte-wise different strings
then the hash function really ought to be calling strxfrm before hashing or
else it will be inconsistent. It doesn't seem to be doing so currently.

I find it interesting that Perl has faced this same dilemma and chose to
override the locale definition in this case. If the locale definitions
compares two strings equally then Perl does a bytewise comparison and uses
that to break ties. This guarantees non-bytewise-identical strings don't
compare eqal. I suspect they did it for a similar reason too, namely keeping
the semantics in sync with perl hashes.

Postgres could follow that model, I think it would solve any inconsistencies
just fine and not cause problems. However it would be visible to users which
may be considered a bug if the locale really does claim the strings are equal
but Postgres doesn't agree. On the other hand I think it would perform better
than a lot of extra calls to strxfrm since it would only rarely kick in with
an extra memcmp.

--
greg

Re: Duplicate Values or Not?!

From

Martijn van Oosterhout

Date:

17 September 2005, 17:14:16

On Sat, Sep 17, 2005 at 11:50:44AM -0400, Greg Stark wrote:
> Hm. Some experimentation shows that at least on glibc's locale definitions the
> strings that I thought compared equal don't actually compare equal.
> Capitalization, punctuation, white space, while they're basically ignored in
> general in non-C locales do seem to compare non-equal when they're the only
> differentiating factor.
>
> Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

I don't know if it's guarenteed by spec, but it certainly seems silly
for strings to compare equal when they're not. Just because a locale
sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The
only real sensible rule is that strcoll should return 0 only if strcmp
would also return zero...

If you actually use strxfrm on glibc you'll see the result comes out
aprroximatly twice as long. The first n bytes being sortof case-folded
versions of the original characters, the second n characters being some
kind of class identification.

I think that all the spec guarentees is that strcoll(a,b) ==
strcmp(strxfrm(a),strxfrm(b)). If strcoll is returning zero for two
non-identical strings, they must strxfrm to the same thing, so that may
be a solution.

Anyway, long term the plan is to move to a cross-platform locale
library so hopefully broken locale libraries will be a thing of the
pasy...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-103988.dat

Re: Duplicate Values or Not?!

From

Mike Nolan

Date:

17 September 2005, 17:45:33

> I don't know if it's guarenteed by spec, but it certainly seems silly
> for strings to compare equal when they're not. Just because a locale
> sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The
> only real sensible rule is that strcoll should return 0 only if strcmp
> would also return zero...

I disagree.  Someone who wants true case independence (for whatever reason)
needs all aspects of uniqueness such as selects, indexes and groups
treating data the same way.

This needs to be something the person who creates the instance or the
database can control.
--
Mike Nolan

Re: Duplicate Values or Not?!

From

Martijn van Oosterhout

Date:

17 September 2005, 18:14:37

On Sat, Sep 17, 2005 at 12:45:17PM -0500, Mike Nolan wrote:
> > I don't know if it's guarenteed by spec, but it certainly seems silly
> > for strings to compare equal when they're not. Just because a locale
> > sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The
> > only real sensible rule is that strcoll should return 0 only if strcmp
> > would also return zero...
>
> I disagree.  Someone who wants true case independence (for whatever reason)
> needs all aspects of uniqueness such as selects, indexes and groups
> treating data the same way.
>
> This needs to be something the person who creates the instance or the
> database can control.

Such people need to be looking at citext [1]. My point is that the
*locale* should not be case-insensetive that way. Consider that if the
locale treats "sun" and "Sun" identically, then I can't have
case-sensetivity if I want it. If they are treated differently, I can
build case-insensetivity on top of it.

[1] http://gborg.postgresql.org/project/citext/projdisplay.php
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-103991.dat

Re: Duplicate Values or Not?!

From

Greg Stark

Date:

17 September 2005, 19:49:37

Martijn van Oosterhout <kleptog@svana.org> writes:

> On Sat, Sep 17, 2005 at 12:45:17PM -0500, Mike Nolan wrote:
> > > I don't know if it's guarenteed by spec, but it certainly seems silly
> > > for strings to compare equal when they're not. Just because a locale
> > > sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The
> > > only real sensible rule is that strcoll should return 0 only if strcmp
> > > would also return zero...
> >
> > I disagree.  Someone who wants true case independence (for whatever reason)
> > needs all aspects of uniqueness such as selects, indexes and groups
> > treating data the same way.
> >
> > This needs to be something the person who creates the instance or the
> > database can control.
>
> Such people need to be looking at citext [1]. My point is that the
> *locale* should not be case-insensetive that way. Consider that if the
> locale treats "sun" and "Sun" identically, then I can't have
> case-sensetivity if I want it. If they are treated differently, I can
> build case-insensetivity on top of it.

Well, consider the case of a two different Unicode encoded strings that
actually represent the same series of characters. They may be byte-wise
different but there's really no difference at all in the text they contain.

That's a bit different from a collation order that specifies two different
character strings that compare equal. But it would suffer from the same
problem.

Nonetheless, I may agree with you that the world would be a better place if
collation orders never created this situation. But unless we can point to some
spec or some solid reason why if that ever happened it would cause worse
headaches than this I think it's necessary to protect the hashing function
from being out of sync with the btree operators.

--
greg

Re: Duplicate Values or Not?!

From

Tom Lane

Date:

17 September 2005, 22:00:34

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> If that does change the results, it indicates you've got strings which
>> are bytewise different but compare equal according to strcoll().  We've
>> seen this and other misbehaviors from some locale definitions when faced
>> with data that is invalid per the encoding the locale expects.

> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

I think you are jumping to conclusions.  I have not yet seen it
demonstrated that any locale definition in use in-the-wild intends to
compare nonidentical strings as equal.  On the other hand, we have seen
plenty of cases of strcoll simply failing (delivering results that are
not even self-consistent) when faced with data it considers invalid.

I notice that the SUS permits strcoll to set errno if given invalid
data:
http://www.opengroup.org/onlinepubs/007908799/xsh/strcoll.html
We are not currently checking for that, but probably we should be.

            regards, tom lane

Re: Duplicate Values or Not?!

From

Martijn van Oosterhout

Date:

17 September 2005, 22:05:10

On Sat, Sep 17, 2005 at 03:49:24PM -0400, Greg Stark wrote:
> Well, consider the case of a two different Unicode encoded strings that
> actually represent the same series of characters. They may be byte-wise
> different but there's really no difference at all in the text they contain.

Strictly speaking, a valid Unicode string is the shortest possible
representation. So at least one of the two should be rejected as
invalid. Whether people do this or not is another issue entirely. It is
certainly recommended to reject non-optimally encoded strings, for
security purposes at least. You don't really want to accept multiple
ways of specifying things like '/' and '\' and other special chars.

> Nonetheless, I may agree with you that the world would be a better place if
> collation orders never created this situation. But unless we can point to some
> spec or some solid reason why if that ever happened it would cause worse
> headaches than this I think it's necessary to protect the hashing function
> from being out of sync with the btree operators.

Well, the Unicode spec doesn't do it that way, does that count? On a
purely practical level though, we have to work with it until PostgreSQL
is using something like ICU thus solving the problem completely.

Case-insensetivity is a large can of worms. The strings "quit" and
"QUIT" match case-insensetivly in most languages, but not in Turkish.
And neither of:

 toupper(tolower(a)) == toupper(a)
 tolower(toupper(a)) == tolower(a)

can be assumed in the general case. In the end we may need to provide
ways of specifying what people mean by "case-insensetive". Whether or
not to ignore accents, etc.

ICU provides a way of specifying transforms like 'drop accents', so
this can be solved...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-103995.dat