Thread: Duplicate Values or Not?!
I recently tried to CREATE a UNIQUE INDEX and could not, due to duplicate values: CREATE UNIQUE INDEX usr_login ON usr (login); To try to find the offending row(s), I then executed the following: SELECT count(*), login FROM usr GROUP BY login ORDER BY 1 DESC; The GROUP BY didn't group anything, indicating to me that there were no duplicate values. There were the same number of rows in this query as a simple SELECT count(*) FROM usr. This tells me that Postgresql is not using the same method for determining duplicates when GROUPING and INDEXing. I dig a little deeper. The best candidate I find for a possible duplicate are caused by characters that did not translate well. IIRC, the basis was the name Pena, which looked like Pe?a. I'm thinking the original data was not encoded properly, or my export didn't handle encodings properly, etc. The two Penas used different characters in the 3rd position, neither of which were translated correctly. I loaded up data from another database vendor (4th Dimension), into a 8.0.3 Postgresql, which I had compiled from source with the default configuration. This was on Yellow Dog Linux 4.0.1. I brought the same data into a 8.0.1 on Max OS X (binary from entropy.ch) and did NOT have this UNIQUE INDEX failure. I'm sure my problems are deeper than the INDEX failure, involving the accuracy of the conversion, but, short term, I would like to know what is different? They both are SQL_ASCII databases. I tried importing into a UNICODE database, but that really a mess of errors (during COPY). I realize I need to learn about encodings, my source data, etc., but I'm looking for hints. Anybody experienced in exported 4th Dimension data containing a certain amount of foriegn language text? Thanks, __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
John Seberg <johnseberg@yahoo.com> writes: > I recently tried to CREATE a UNIQUE INDEX and could > not, due to duplicate values: > > CREATE UNIQUE INDEX usr_login ON usr (login); > > To try to find the offending row(s), I then executed > the following: > > SELECT count(*), login FROM usr GROUP BY login ORDER > BY 1 DESC; > > The GROUP BY didn't group anything, indicating to me > that there were no duplicate values. There were the > same number of rows in this query as a simple SELECT > count(*) FROM usr. > > This tells me that Postgresql is not using the same > method for determining duplicates when GROUPING and > INDEXing. You might try running the GROUP BY query after doing: set enable_hashagg = false; select ... With that false it would have to sort the results which should be exactly the same code as the index is using. I think. That doesn't really answer the rest of your questions. The short of it is that setting the encoding doesn't magically make your data encoded in that encoding. If your client sends it one encoding but claims it's unicode then Postgres will happily store it in a UNICODE database and it'll be garbage. Maybe someone else will have more specific advice on that front. -- greg
Greg Stark <gsstark@mit.edu> writes: > John Seberg <johnseberg@yahoo.com> writes: >> This tells me that Postgresql is not using the same >> method for determining duplicates when GROUPING and >> INDEXing. > You might try running the GROUP BY query after doing: > set enable_hashagg = false; > With that false it would have to sort the results which should be exactly the > same code as the index is using. I think. If that does change the results, it indicates you've got strings which are bytewise different but compare equal according to strcoll(). We've seen this and other misbehaviors from some locale definitions when faced with data that is invalid per the encoding the locale expects. So, yeah, the answer is to fix your encoding problems. In particular, don't ever use a locale like that with a SQL_ASCII database encoding, because then Postgres won't prevent strcoll from seeing data it fails on. The only safe locale setting for a SQL_ASCII database is C, I think. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > If that does change the results, it indicates you've got strings which > are bytewise different but compare equal according to strcoll(). We've > seen this and other misbehaviors from some locale definitions when faced > with data that is invalid per the encoding the locale expects. There are plenty of non-bytewise-identical strings that do legitimately compare equal in various locales. Does the hash code hash strxfrm or the original bytes? -- greg
Greg Stark <gsstark@MIT.EDU> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > If that does change the results, it indicates you've got strings which > > are bytewise different but compare equal according to strcoll(). We've > > seen this and other misbehaviors from some locale definitions when faced > > with data that is invalid per the encoding the locale expects. > > There are plenty of non-bytewise-identical strings that do legitimately > compare equal in various locales. Does the hash code hash strxfrm or the > original bytes? Hm. Some experimentation shows that at least on glibc's locale definitions the strings that I thought compared equal don't actually compare equal. Capitalization, punctuation, white space, while they're basically ignored in general in non-C locales do seem to compare non-equal when they're the only differentiating factor. Is this guaranteed by any spec? Or is counting on this behaviour unsafe? If it's legal for strcoll to compare as equal two byte-wise different strings then the hash function really ought to be calling strxfrm before hashing or else it will be inconsistent. It doesn't seem to be doing so currently. I find it interesting that Perl has faced this same dilemma and chose to override the locale definition in this case. If the locale definitions compares two strings equally then Perl does a bytewise comparison and uses that to break ties. This guarantees non-bytewise-identical strings don't compare eqal. I suspect they did it for a similar reason too, namely keeping the semantics in sync with perl hashes. Postgres could follow that model, I think it would solve any inconsistencies just fine and not cause problems. However it would be visible to users which may be considered a bug if the locale really does claim the strings are equal but Postgres doesn't agree. On the other hand I think it would perform better than a lot of extra calls to strxfrm since it would only rarely kick in with an extra memcmp. -- greg
On Sat, Sep 17, 2005 at 11:50:44AM -0400, Greg Stark wrote: > Hm. Some experimentation shows that at least on glibc's locale definitions the > strings that I thought compared equal don't actually compare equal. > Capitalization, punctuation, white space, while they're basically ignored in > general in non-C locales do seem to compare non-equal when they're the only > differentiating factor. > > Is this guaranteed by any spec? Or is counting on this behaviour unsafe? I don't know if it's guarenteed by spec, but it certainly seems silly for strings to compare equal when they're not. Just because a locale sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The only real sensible rule is that strcoll should return 0 only if strcmp would also return zero... If you actually use strxfrm on glibc you'll see the result comes out aprroximatly twice as long. The first n bytes being sortof case-folded versions of the original characters, the second n characters being some kind of class identification. I think that all the spec guarentees is that strcoll(a,b) == strcmp(strxfrm(a),strxfrm(b)). If strcoll is returning zero for two non-identical strings, they must strxfrm to the same thing, so that may be a solution. Anyway, long term the plan is to move to a cross-platform locale library so hopefully broken locale libraries will be a thing of the pasy... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> I don't know if it's guarenteed by spec, but it certainly seems silly > for strings to compare equal when they're not. Just because a locale > sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The > only real sensible rule is that strcoll should return 0 only if strcmp > would also return zero... I disagree. Someone who wants true case independence (for whatever reason) needs all aspects of uniqueness such as selects, indexes and groups treating data the same way. This needs to be something the person who creates the instance or the database can control. -- Mike Nolan
On Sat, Sep 17, 2005 at 12:45:17PM -0500, Mike Nolan wrote: > > I don't know if it's guarenteed by spec, but it certainly seems silly > > for strings to compare equal when they're not. Just because a locale > > sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The > > only real sensible rule is that strcoll should return 0 only if strcmp > > would also return zero... > > I disagree. Someone who wants true case independence (for whatever reason) > needs all aspects of uniqueness such as selects, indexes and groups > treating data the same way. > > This needs to be something the person who creates the instance or the > database can control. Such people need to be looking at citext [1]. My point is that the *locale* should not be case-insensetive that way. Consider that if the locale treats "sun" and "Sun" identically, then I can't have case-sensetivity if I want it. If they are treated differently, I can build case-insensetivity on top of it. [1] http://gborg.postgresql.org/project/citext/projdisplay.php -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Sep 17, 2005 at 12:45:17PM -0500, Mike Nolan wrote: > > > I don't know if it's guarenteed by spec, but it certainly seems silly > > > for strings to compare equal when they're not. Just because a locale > > > sorts ignoring case doesn't mean that "sun" and "Sun" are the same. The > > > only real sensible rule is that strcoll should return 0 only if strcmp > > > would also return zero... > > > > I disagree. Someone who wants true case independence (for whatever reason) > > needs all aspects of uniqueness such as selects, indexes and groups > > treating data the same way. > > > > This needs to be something the person who creates the instance or the > > database can control. > > Such people need to be looking at citext [1]. My point is that the > *locale* should not be case-insensetive that way. Consider that if the > locale treats "sun" and "Sun" identically, then I can't have > case-sensetivity if I want it. If they are treated differently, I can > build case-insensetivity on top of it. Well, consider the case of a two different Unicode encoded strings that actually represent the same series of characters. They may be byte-wise different but there's really no difference at all in the text they contain. That's a bit different from a collation order that specifies two different character strings that compare equal. But it would suffer from the same problem. Nonetheless, I may agree with you that the world would be a better place if collation orders never created this situation. But unless we can point to some spec or some solid reason why if that ever happened it would cause worse headaches than this I think it's necessary to protect the hashing function from being out of sync with the btree operators. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> If that does change the results, it indicates you've got strings which >> are bytewise different but compare equal according to strcoll(). We've >> seen this and other misbehaviors from some locale definitions when faced >> with data that is invalid per the encoding the locale expects. > There are plenty of non-bytewise-identical strings that do legitimately > compare equal in various locales. Does the hash code hash strxfrm or the > original bytes? I think you are jumping to conclusions. I have not yet seen it demonstrated that any locale definition in use in-the-wild intends to compare nonidentical strings as equal. On the other hand, we have seen plenty of cases of strcoll simply failing (delivering results that are not even self-consistent) when faced with data it considers invalid. I notice that the SUS permits strcoll to set errno if given invalid data: http://www.opengroup.org/onlinepubs/007908799/xsh/strcoll.html We are not currently checking for that, but probably we should be. regards, tom lane
On Sat, Sep 17, 2005 at 03:49:24PM -0400, Greg Stark wrote: > Well, consider the case of a two different Unicode encoded strings that > actually represent the same series of characters. They may be byte-wise > different but there's really no difference at all in the text they contain. Strictly speaking, a valid Unicode string is the shortest possible representation. So at least one of the two should be rejected as invalid. Whether people do this or not is another issue entirely. It is certainly recommended to reject non-optimally encoded strings, for security purposes at least. You don't really want to accept multiple ways of specifying things like '/' and '\' and other special chars. > Nonetheless, I may agree with you that the world would be a better place if > collation orders never created this situation. But unless we can point to some > spec or some solid reason why if that ever happened it would cause worse > headaches than this I think it's necessary to protect the hashing function > from being out of sync with the btree operators. Well, the Unicode spec doesn't do it that way, does that count? On a purely practical level though, we have to work with it until PostgreSQL is using something like ICU thus solving the problem completely. Case-insensetivity is a large can of worms. The strings "quit" and "QUIT" match case-insensetivly in most languages, but not in Turkish. And neither of: toupper(tolower(a)) == toupper(a) tolower(toupper(a)) == tolower(a) can be assumed in the general case. In the end we may need to provide ways of specifying what people mean by "case-insensetive". Whether or not to ignore accents, etc. ICU provides a way of specifying transforms like 'drop accents', so this can be solved... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.