Re: is this a bug or I am blind? - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: is this a bug or I am blind? |
Date | |
Msg-id | 27064.1134753128@sss.pgh.pa.us Whole thread Raw |
In response to | is this a bug or I am blind? (Mage <mage@mage.hu>) |
Responses |
Re: is this a bug or I am blind?
Re: is this a bug or I am blind? Re: is this a bug or I am blind? Re: is this a bug or I am blind? Re: is this a bug or I am blind? |
List | pgsql-general |
Mage <mage@mage.hu> writes: > Tom, I can send you the data I dumped and you can try it out with same > settings. I trust you. Thanks. After digging through it, I can exhibit the problem: in hu_HU locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably all glibc installations will do this), I get mage=# select 'potyty'::varchar = 'potty'::varchar; ?column? ---------- f (1 row) mage=# select 'potyty'::varchar < 'potty'::varchar; ?column? ---------- f (1 row) mage=# select 'potyty'::varchar > 'potty'::varchar; ?column? ---------- f (1 row) Needless to say, this makes sorting and btree indexing very unhappy, as they take the trichotomy law as an article of faith ;-) I don't know anything about hu_HU comparison rules, but it appears that strcoll() believes that these two strings should be considered equal. Is that sane? The immediate cause of the problem is that texteq() and textne() have a "fast path" for unequal-length inputs: /* fast path for different-length inputs */ if (VARSIZE(arg1) != VARSIZE(arg2)) result = false; else result = (text_cmp(arg1, arg2) == 0); (text_cmp is what invokes strcoll.) Thus the = operator returns false, while the other two go to strcoll() and then return false. Perhaps the fast-path check is a bad idea, but fixing this is not just a matter of removing that. If we subscribe to strcoll's worldview then we have to conclude that *text strings are not hashable*, because strings that should be "equal" may have different hash codes. And at least in the current PG code, that's not something we can flip on and off depending on the locale --- texteq would have to be marked non hashable in the system catalogs, meaning a big performance hit for *everybody* even if their locale is not this weird. The other approach we could take is to define text comparison as yielding equality only for bitwise-equal strings. If strcoll() returns zero then ask strcmp() for a second opinion. This would mean that we'd sort according to strcoll in the main, but strings that strcoll considers equal but aren't physically identical would sort in codeset order. I can't see that this would do any harm in the context of sorting rules, but the question of what equality means is something for users to answer. Do you think that these two queries ought to yield the same rows in hu_HU locale, or not? select * from common_logins where username = 'potyty'; select * from common_logins where username = 'potty'; regards, tom lane
pgsql-general by date: