Thread: is this a bug or I am blind?
online=# select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) online=# select * from common_logins where username like 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum --------+----------+----------+----------------------------+--------+----------+---------- 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 (3 rows) online=# select username, username = 'potyty' from common_logins where username like 'potyty'; username | ?column? ----------+---------- potyty | t potyty | t potyty | t (3 rows) psql 8.0.3, Debian. Mage
am 15.12.2005, um 14:12:23 +0100 mailte Mage folgendes: > online=# select * from common_logins where username = 'potyty'; > uid | username | password | lastlogin | status | usertype | loginnum > -----+----------+----------+-----------+--------+----------+---------- > (0 rows) > > online=# select * from common_logins where username like 'potyty'; > uid | username | password | lastlogin | status | > usertype | loginnum > --------+----------+----------+----------------------------+--------+----------+---------- > 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S > | 1 > 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S > | 3 > 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S > | 3 > (3 rows) Try: select *, length(username), length('potyty') from common_logins where username like 'potyty'; My guess: select length(username) from common_logins where username like 'potyty'; is _NOT_ 6, there is a SPACE like 'potyty '. HTH, Andreas -- Andreas Kretschmer (Kontakt: siehe Header) Heynitz: 035242/47212, D1: 0160/7141639 GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net === Schollglas Unternehmensgruppe ===
On 12/15/05, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote: > am 15.12.2005, um 14:12:23 +0100 mailte Mage folgendes: > > online=# select * from common_logins where username = 'potyty'; > > uid | username | password | lastlogin | status | usertype | loginnum > > -----+----------+----------+-----------+--------+----------+---------- > > (0 rows) > > > > online=# select * from common_logins where username like 'potyty'; > > uid | username | password | lastlogin | status | > > usertype | loginnum > > --------+----------+----------+----------------------------+--------+----------+---------- > > 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S > > | 1 > > 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S > > | 3 > > 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S > > | 3 > > (3 rows) > > Try: > > select *, length(username), length('potyty') from common_logins where username like 'potyty'; > > > My guess: > > select length(username) from common_logins where username like 'potyty'; > > is _NOT_ 6, there is a SPACE like 'potyty '. > > even is that is true i think you need the "comodin characters" (ie: %. _) to make "like" behave different from simple comparisons > HTH, Andreas > -- > Andreas Kretschmer (Kontakt: siehe Header) > Heynitz: 035242/47212, D1: 0160/7141639 > GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net > === Schollglas Unternehmensgruppe === > -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;)
[snip] > even is that is true i think you need the "comodin characters" (ie: %. > _) to make "like" behave different from simple comparisons Not entirely true, if the database was initialized in a different locale than C, then the direct comparison will probably go for an index on username, while "like" will not. Which points to a possible index corruption... which might be interesting for the developers to investigate, but I would guess a reindex will solve the problem for the OP if he has it urgent... Cheers, Csaba.
Mage wrote: > online=# select * from common_logins where username = 'potyty'; > online=# select * from common_logins where username like 'potyty'; It's probably worth seeing whether these have different plans (EXPLAIN ANALYSE...) and if the = is using an index but like isn't. If so, try issuing "set enable_indexscan=false" first and see what happens then. If that makes a difference then I'd guess you have one of two things: 1. A corrupt index (check the REINDEX command) 2. (perhaps more likely) Some localisation issues. What encoding/locale settings are you using? -- Richard Huxton Archonet Ltd
On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote: > [snip] > > even is that is true i think you need the "comodin characters" (ie: %. > > _) to make "like" behave different from simple comparisons > > Not entirely true, if the database was initialized in a different locale > than C, then the direct comparison will probably go for an index on > username, while "like" will not. Which points to a possible index > corruption... which might be interesting for the developers to > investigate, but I would guess a reindex will solve the problem for the > OP if he has it urgent... > > Cheers, > Csaba. > > Mage if it's not urgent maybe you can make a post in -hackers and follow instructions about how to get more info to see what happened here... -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;)
[snip] > If that makes a difference then I'd guess you have one of two things: > 1. A corrupt index (check the REINDEX command) > 2. (perhaps more likely) Some localisation issues. > What encoding/locale settings are you using? Based on the 3rd query of the OP, where the direct comparison results in "true" for all the rows which matched the "like", I would exclude the localisation issues variant... unless = is not equals in all cases ;-) Cheers, Csaba.
A. Kretschmer wrote: > >select *, length(username), length('potyty') from common_logins where username like 'potyty'; > > >My guess: > >select length(username) from common_logins where username like 'potyty'; > >is _NOT_ 6, there is a SPACE like 'potyty '. > > > If you look my 3rd query, you will see that there are no spaces, however: select *, length(username), length('potyty') from common_logins where username like 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum | length | length --------+----------+----------+----------------------------+--------+----------+----------+--------+-------- 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 | 6 | 6 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 | 6 | 6 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 | 6 | 6 (3 rows) Mage
Csaba Nagy wrote: >[snip] > > >>even is that is true i think you need the "comodin characters" (ie: %. >>_) to make "like" behave different from simple comparisons >> >> > >Not entirely true, if the database was initialized in a different locale >than C, then the direct comparison will probably go for an index on >username, while "like" will not. Which points to a possible index >corruption... which might be interesting for the developers to >investigate, but I would guess a reindex will solve the problem for the >OP if he has it urgent... > > I thought that it may be a locale problem, but: - look at my 3rd query - potyty doesn't contain special chars # EXPLAIN ANALYZE select * from common_logins where username = 'potyty'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using common_logins_username_idx on common_logins (cost=0.00..4.30 rows=1 width=47) (actual time=0.056..0.056 rows=0 loops=1) Index Cond: ((username)::text = 'potyty'::text) Total runtime: 0.109 ms (3 rows) online=# EXPLAIN ANALYZE select * from common_logins where username like 'potyty'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Seq Scan on common_logins (cost=0.00..63833.88 rows=1 width=47) (actual time=180.333..262.492 rows=3 loops=1) Filter: ((username)::text ~~ 'potyty'::text) Total runtime: 262.551 ms (3 rows) I tried it in two databases (dump and load to another one), so I don't think that we have corrupted indexes. I can try on a newer version of postgresql on another server. By the way, if this is a bug then it's a serious one. We have it in production environment. Mage
> > If you look my 3rd query, you will see that there are no spaces, however: > > select *, length(username), length('potyty') from common_logins where username like 'potyty'; > uid | username | password | lastlogin | status | usertype | loginnum | length | length > --------+----------+----------+----------------------------+--------+----------+----------+--------+-------- > 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 | 6 | 6 > 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 | 6 | 6 > 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 | 6 | 6 > (3 rows) > > Mage I am puzzled by the lack of a "%" in the LIKE query. When I try this on postgres 7.4 and 8.0 I get no rows when I am missingit; including it works as expected. The names have been changed to protect the guilty ;-} but the core of it is true -- no "%" means wierdnesses, I think. gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home'; gex_clientname ---------------- (0 rows) gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%'; gex_clientname -------------------------------------------------------------- HomeHappinesses HomeMorgageValues, Inc. (2 rows) Could you try your query again with the wild card ? HTH Greg Williamson DBA GlobeXplorer LLC
> > I tried it in two databases (dump and load to another one), so I don't > think that we have corrupted indexes. > the problem persist after a dump a reload? then there is not because an index corruption... can you send part of the data that reproduces the bug? > I can try on a newer version of postgresql on another server. > > By the way, if this is a bug then it's a serious one. We have it in > production environment. > > Mage > what locales do you have? encoding? -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;)
Csaba Nagy wrote: > [snip] >> If that makes a difference then I'd guess you have one of two things: >> 1. A corrupt index (check the REINDEX command) >> 2. (perhaps more likely) Some localisation issues. >> What encoding/locale settings are you using? > > Based on the 3rd query of the OP, where the direct comparison results in > "true" for all the rows which matched the "like", I would exclude the > localisation issues variant... unless = is not equals in all cases ;-) Well spotted Csaba - that _would_ seem to point to the index. -- Richard Huxton Archonet Ltd
Gregory S. Williamson wrote: > > I am puzzled by the lack of a "%" in the LIKE query. When I try this on postgres 7.4 and 8.0 I get no rows when I am missingit; including it works as expected. > > The names have been changed to protect the guilty ;-} but the core of it is true -- no "%" means wierdnesses, I think. > > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home'; > gex_clientname > ---------------- > (0 rows) > > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%'; > gex_clientname > -------------------------------------------------------------- > HomeHappinesses > HomeMorgageValues, Inc. > (2 rows) > > Could you try your query again with the wild card ? But "like" without any wildcards should be the same as "=", but it isn't in the original post. -- Richard Huxton Archonet Ltd
Jaime Casanova wrote: >>I tried it in two databases (dump and load to another one), so I don't >>think that we have corrupted indexes. >> >> >> > >the problem persist after a dump a reload? then there is not because >an index corruption... can you send part of the data that reproduces >the bug? > > > >>I can try on a newer version of postgresql on another server. >> >>By the way, if this is a bug then it's a serious one. We have it in >>production environment. >> >> Mage >> >> >> > >what locales do you have? encoding? > > > The problem can be reproduced on pgsql 8.0.3, compiled from source. This is a third machine where the bug persists. I dumped the table and loaded in. I cannot send the table to you because it contains user data and passwords. I will try to create a fake one with the same problem. ./configure --prefix=/usr/local/pgsql --with-python client_encoding | LATIN2 lc_collate | hu_HU lc_ctype | hu_HU lc_messages | en_US lc_monetary | en_US lc_numeric | en_US lc_time | en_US server_encoding | LATIN2 server_version | 8.0.3 Mage
Well, then I have the disease. The database is UNICODE: gex_runtime=# select gex_clientname from gex_clients where gex_clientname = 'HomeHappinesses'; gex_clientname -------------------------------------------------------------- HomeGain (1 row) gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'HomeHappinesses'; gex_clientname ---------------- (0 rows) > > Gregory S. Williamson wrote: > > > > I am puzzled by the lack of a "%" in the LIKE query. When I try this on > > postgres 7.4 and 8.0 I get no rows when I am missing it; including it > > works as expected. > > > > The names have been changed to protect the guilty ;-} but the core of it > > is true -- no "%" means wierdnesses, I think. > > > > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home'; > > gex_clientname > > ---------------- > > (0 rows) > > > > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%'; > > gex_clientname > > -------------------------------------------------------------- > > HomeHappinesses > > HomeMorgageValues, Inc. > > (2 rows) > > > > Could you try your query again with the wild card ? > > But "like" without any wildcards should be the same as "=", but it isn't > in the original post. > Well then I have the same behavior -- the database is UNICODE: gex_runtime=# select gex_clientname from gex_clients where gex_clientname = 'HomeHappinesses'; gex_clientname -------------------------------------------------------------- HomeHappinesses (1 row) gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'HomeHappinesses'; gex_clientname ---------------- (0 rows) An "=" is not equivalent to LIKE with no wildcard. I never really thought of this as a bug, but if it is ... ring one up for Mage as a good catch.
Mage wrote: > ./configure --prefix=/usr/local/pgsql --with-python > client_encoding | LATIN2 > lc_collate | hu_HU > lc_ctype | hu_HU > lc_messages | en_US > lc_monetary | en_US > lc_numeric | en_US > lc_time | en_US > server_encoding | LATIN2 > server_version | 8.0.3 > I have created a table that can be sent to you to examine the bug. I am actually vacuuming it for further testing. Where can I upload it? Sorry, I cannot host it. Mage
Richard Huxton <dev@archonet.com> writes: > Csaba Nagy wrote: >> Based on the 3rd query of the OP, where the direct comparison results in >> "true" for all the rows which matched the "like", I would exclude the >> localisation issues variant... unless = is not equals in all cases ;-) > Well spotted Csaba - that _would_ seem to point to the index. No, localization issues should be real high on your list. In particular I wonder whether this is the old bugaboo of using a database encoding that's incompatible with the postmaster's locale setting. We've seen that on some platforms strcoll() gets completely confused by this and returns comparison results that are not even self-consistent. Non-self-consistent comparison results can lead to an index that is either actually or effectively corrupt (because index searches proceed down the wrong tree path and thus fail to find items that should be found). So the observation that only index searches fail is consistent with this idea. regards, tom lane
Richard Huxton <dev@archonet.com> writes: > But "like" without any wildcards should be the same as "=", but it isn't > in the original post. I'm too lazy to go look at the code right now, but I think that the reduction of "x LIKE constant-pattern" to "x = constant-pattern" is part of the LIKE index optimization code, which means it'd only get done in C locale. We're missing a bet there perhaps. regards, tom lane
Ok, that explains then the problem... but the index is arguably corrupt in this case, with the important difference that it's not even fixable by reindex... I guess what the OP really wants is a solution to his problem. If the table is not too big, a quick fix would be to just drop the index. Then figure out an encoding+locale combination which can be used to properly host the same sample data which leads to failure here and still satisfies the OP's string sorting and other needs, and then dump reload... or is there other better way to fix things ? I guess a note in the docs about not using the same encoding/locale/(postgres version?)/(OS?) combination as the OP would make sense too ? Cheers, Csaba. On Thu, 2005-12-15 at 16:20, Tom Lane wrote: > Richard Huxton <dev@archonet.com> writes: > > Csaba Nagy wrote: > >> Based on the 3rd query of the OP, where the direct comparison results in > >> "true" for all the rows which matched the "like", I would exclude the > >> localisation issues variant... unless = is not equals in all cases ;-) > > > Well spotted Csaba - that _would_ seem to point to the index. > > No, localization issues should be real high on your list. In particular > I wonder whether this is the old bugaboo of using a database encoding > that's incompatible with the postmaster's locale setting. We've seen > that on some platforms strcoll() gets completely confused by this and > returns comparison results that are not even self-consistent. > > Non-self-consistent comparison results can lead to an index that is > either actually or effectively corrupt (because index searches proceed > down the wrong tree path and thus fail to find items that should be > found). So the observation that only index searches fail is consistent > with this idea. > > regards, tom lane
Mage <mage@mage.hu> writes: >> lc_collate | hu_HU >> lc_ctype | hu_HU >> server_encoding | LATIN2 Hm, are those settings actually compatible? You need to check your system documentation to find out what encoding "hu_HU" expects. regards, tom lane
On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote: > Ok, that explains then the problem... but the index is arguably corrupt > in this case, with the important difference that it's not even fixable > by reindex... > > I guess what the OP really wants is a solution to his problem. MAGE was reproducing the problem in a little table that can be send but now tolds me that the problem in the test table disappear when a VACUUM was executed... is this consistent with the idea of locale problem? -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;)
Jaime Casanova <systemguards@gmail.com> writes: > On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote: >> Ok, that explains then the problem... but the index is arguably corrupt >> in this case, with the important difference that it's not even fixable >> by reindex... >> >> I guess what the OP really wants is a solution to his problem. > MAGE was reproducing the problem in a little table that can be send > but now tolds me that the problem in the test table disappear when a > VACUUM was executed... is this consistent with the idea of locale > problem? The VACUUM might have caused the planner not to use the index anymore; check EXPLAIN. regards, tom lane
UPDATE ---------------- I was trying to create a demo table, because I cannot send our confidental data. I have found weird result. # drop table common_logins; DROP TABLE $ psql < ../cl.sql SET SET SET SET SET SET CREATE TABLE setval -------- 203650 (1 row) ALTER TABLE CREATE INDEX CREATE INDEX # select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) # VACUUM FULL analyze; VACUUM # select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) # select count(1) from common_logins; count -------- 203361 (1 row) # delete from common_logins where uid in (select uid from common_logins where username not ilike 'potyty' limit 100000); DELETE 100000 mage=# select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) # VACUUM FULL analyze; VACUUM # select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum --------+----------+----------+----------------------------+--------+----------+---------- 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 (2 rows) # delete from common_logins where uid in (select uid from common_logins where username not ilike 'potyty' limit 80000); DELETE 80000 # VACUUM FULL analyze; VACUUM # select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum --------+----------+----------+----------------------------+--------+----------+---------- 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 (3 rows) The 2 rows part seems to be (for me) a non-locale-related, but serious problem. I have the data file, it is confidental, but I can send it to official pg developers if needed. Mage Tom Lane wrote: >Jaime Casanova <systemguards@gmail.com> writes: > > >>On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote: >> >> >>>Ok, that explains then the problem... but the index is arguably corrupt >>>in this case, with the important difference that it's not even fixable >>>by reindex... >>> >>>I guess what the OP really wants is a solution to his problem. >>> >>> > > > >>MAGE was reproducing the problem in a little table that can be send >>but now tolds me that the problem in the test table disappear when a >>VACUUM was executed... is this consistent with the idea of locale >>problem? >> >> > >The VACUUM might have caused the planner not to use the index anymore; >check EXPLAIN. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > >
Tom Lane <tgl@sss.pgh.pa.us> writes: > Non-self-consistent comparison results can lead to an index that is > either actually or effectively corrupt (because index searches proceed > down the wrong tree path and thus fail to find items that should be > found). So the observation that only index searches fail is consistent > with this idea. I wondered if there were some simple tests Postgres could do to notice the problem and report it. Checking to make sure strcoll(a,b) = -strcoll(b,a) for example. But then I thought of another idea. What if Postgres just used strxfrm() instead of strcoll everywhere? Then it ought to never produce inconsistent results. At least if strxfrm() doesn't just return randomly varying results for the same inputs. I suspect the worst case in practice is that strxfrm() will return the same data for just about every input string, which would hopefully be noticed by the user. But at least wouldn't cause corrupted indexes. -- greg
Greg Stark <gsstark@mit.edu> writes: > But then I thought of another idea. What if Postgres just used strxfrm() > instead of strcoll everywhere? Then it ought to never produce inconsistent > results. At least if strxfrm() doesn't just return randomly varying results > for the same inputs. AFAICS the most that could accomplish is to make failures more obvious; it wouldn't actually fix anything. I don't think that's worth the runtime penalty it would incur. What we *ought* to be doing is trying to figure a way to detect and disallow inconsistent locale/encoding combinations. We've avoided that because there seems no general platform-independent way to find out the encoding expected by a locale. But surely we could manage to make it work at least on glibc and Windows, which would be a step ahead of doing nothing. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > But then I thought of another idea. What if Postgres just used strxfrm() > > instead of strcoll everywhere? Then it ought to never produce inconsistent > > results. At least if strxfrm() doesn't just return randomly varying results > > for the same inputs. > > AFAICS the most that could accomplish is to make failures more obvious; > it wouldn't actually fix anything. I don't think that's worth the > runtime penalty it would incur. What runtime penalty? It seems likely that strcoll is implemented by the equivalent of calling strxfrm twice internally anyways. I don't see how that could produce inconsistent results unless the internal interface has some error handling codepath that returns an error that isn't being checked. In fact calling strxfrm() has the potential to open up some optimizations. Like caching the transformed string for the duration of a single sort operation instead of repeatedly transforming it. > What we *ought* to be doing is trying to figure a way to detect and > disallow inconsistent locale/encoding combinations. We've avoided that > because there seems no general platform-independent way to find out the > encoding expected by a locale. But surely we could manage to make it > work at least on glibc and Windows, which would be a step ahead of doing > nothing. Harumph. When I suggested having a strxfrm() function like the ones three different people have independently developed and posted that would work though slowly on all platforms, used only standard libc functions and perform fine on at least glibc you complained it wasn't portable enough. -- greg
Greg Stark <gsstark@mit.edu> writes: > What runtime penalty? It seems likely that strcoll is implemented by the > equivalent of calling strxfrm twice internally anyways. Only by a very incompetent implementor. regards, tom lane
Mage <mage@mage.hu> writes: > Tom, I can send you the data I dumped and you can try it out with same > settings. I trust you. Thanks. After digging through it, I can exhibit the problem: in hu_HU locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably all glibc installations will do this), I get mage=# select 'potyty'::varchar = 'potty'::varchar; ?column? ---------- f (1 row) mage=# select 'potyty'::varchar < 'potty'::varchar; ?column? ---------- f (1 row) mage=# select 'potyty'::varchar > 'potty'::varchar; ?column? ---------- f (1 row) Needless to say, this makes sorting and btree indexing very unhappy, as they take the trichotomy law as an article of faith ;-) I don't know anything about hu_HU comparison rules, but it appears that strcoll() believes that these two strings should be considered equal. Is that sane? The immediate cause of the problem is that texteq() and textne() have a "fast path" for unequal-length inputs: /* fast path for different-length inputs */ if (VARSIZE(arg1) != VARSIZE(arg2)) result = false; else result = (text_cmp(arg1, arg2) == 0); (text_cmp is what invokes strcoll.) Thus the = operator returns false, while the other two go to strcoll() and then return false. Perhaps the fast-path check is a bad idea, but fixing this is not just a matter of removing that. If we subscribe to strcoll's worldview then we have to conclude that *text strings are not hashable*, because strings that should be "equal" may have different hash codes. And at least in the current PG code, that's not something we can flip on and off depending on the locale --- texteq would have to be marked non hashable in the system catalogs, meaning a big performance hit for *everybody* even if their locale is not this weird. The other approach we could take is to define text comparison as yielding equality only for bitwise-equal strings. If strcoll() returns zero then ask strcmp() for a second opinion. This would mean that we'd sort according to strcoll in the main, but strings that strcoll considers equal but aren't physically identical would sort in codeset order. I can't see that this would do any harm in the context of sorting rules, but the question of what equality means is something for users to answer. Do you think that these two queries ought to yield the same rows in hu_HU locale, or not? select * from common_logins where username = 'potyty'; select * from common_logins where username = 'potty'; regards, tom lane
On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote: > Thanks. After digging through it, I can exhibit the problem: in hu_HU > locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably > all glibc installations will do this), I get I don't know if this is related or not, but did the following issue from a year ago ever get resolved? We were wondering then if the Hungarian locale on some platforms might be causing problems. http://archives.postgresql.org/pgsql-bugs/2004-12/msg00206.php http://archives.postgresql.org/pgsql-bugs/2004-12/msg00228.php -- Michael Fuhr
On Fri, 2005-12-16 at 18:12, Tom Lane wrote: [snip] > I don't know anything about hu_HU comparison rules, but it appears that > strcoll() believes that these two strings should be considered equal. > Is that sane? It is sane in a way, as the "ty" combination is pronounced together as a single consonant in hungarian, and "tty" is the 'strong' version of it. The usual way of making a consonant strong in Hungarian is to double it, which works well for simple consonants, but for "ty" is normally written "tty". So "tyty" and "tty" could be arguably both taken as double "ty", except that the official form is "tty"... but from a pronunciation point of view they ARE equivalent in hungarian. Cheers, Csaba.
Michael Fuhr <mike@fuhr.org> writes: > I don't know if this is related or not, but did the following issue > from a year ago ever get resolved? We were wondering then if the > Hungarian locale on some platforms might be causing problems. AFAIR we never did figure that one out. I wasn't able to reproduce it using RHEL, but the complainant was using SLES which might possibly have had different Hungarian locale rules at the time. I'm not sure how much those have changed across different releases of glibc. Now that we see the mechanism for the problem, it might well explain various odd reports we've gotten from people using a number of different locales. I've tended to write these off as locale-vs-encoding pilot error, but maybe they weren't all that. regards, tom lane
Csaba Nagy <nagy@ecircle-ag.com> writes: > ... So "tyty" and "tty" could be arguably both taken as double "ty", > except that the official form is "tty"... but from a pronunciation point > of view they ARE equivalent in hungarian. That's fair enough, but the question is should they be taken as equivalent for string-comparison purposes? (English has plenty of cases where different letter combinations sound alike, but we don't consider them equal because of that. That may not be a good analogy though. Also, if there are cases in other locales where strcoll considers non-identical strings equal, the reasoning for it might be quite different.) regards, tom lane
On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote: > Perhaps the fast-path check is a bad idea, but fixing this is not just > a matter of removing that. If we subscribe to strcoll's worldview then > we have to conclude that *text strings are not hashable*, because > strings that should be "equal" may have different hash codes. And at > least in the current PG code, that's not something we can flip on and off > depending on the locale --- texteq would have to be marked non hashable > in the system catalogs, meaning a big performance hit for *everybody* > even if their locale is not this weird. That's true, in the sense that unconverted strings are not hashable. This is what strxfrm was created for, to return the sorting key for a string. A quick C program demonstrates that indeed in that locale these two strings are equal, whereas in en_AU they are not. $ LC_ALL=hu_HU ./strxfrm potyty potty String 1: potyty Strxfrm 1: " ((\x01\x02\x02\x02\x02\x01\x02\x02\x02\x02 String 2: potty Strxfrm 2: " ((\x01\x02\x02\x02\x02\x01\x02\x02\x02\x02 $ LC_ALL=en_AU ./strxfrm potyty potty String 1: potyty Strxfrm 1: \x1B\x1A\x1F$\x1F$\x01\x02\x02\x02\x02\x02\x02\x01\x02\x02\x02\x02\x02\x02 String 2: potty Strxfrm 2: \x1B\x1A\x1F\x1F$\x01\x02\x02\x02\x02\x02\x01\x02\x02\x02\x02\x02 I think the only way to make indexes properly locale sensetive would be to either use strcoll() in all cases, or store the result from strxfrm() in the index. Anything else will break somewhere. In any case, we first need to determine which answer is correct, before we run off trying to fix it. This is Glibc 2.3.2 on a Debian Linux system. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Fri, 2005-12-16 at 18:52, Tom Lane wrote: > Csaba Nagy <nagy@ecircle-ag.com> writes: > > ... So "tyty" and "tty" could be arguably both taken as double "ty", > > except that the official form is "tty"... but from a pronunciation point > > of view they ARE equivalent in hungarian. > > That's fair enough, but the question is should they be taken as > equivalent for string-comparison purposes? (English has plenty of > cases where different letter combinations sound alike, but we don't > consider them equal because of that. That may not be a good analogy > though. Also, if there are cases in other locales where strcoll > considers non-identical strings equal, the reasoning for it might be > quite different.) Well, I'm not an expert on this one. In any case, hungarian has phonetical writing as opposed to the etymological writing English has. So in hungarian there is a 1 to 1 mapping between the sounds and the signs used to depict them... so pronunciation is somewhat more relevant in sorting I guess. But I'm not a linguist so I won't know for sure. Cheers, Csaba.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote: >> Perhaps the fast-path check is a bad idea, but fixing this is not just >> a matter of removing that. If we subscribe to strcoll's worldview then >> we have to conclude that *text strings are not hashable*, because >> strings that should be "equal" may have different hash codes. > This is what strxfrm was created for, to return the sorting key for a > string. Ah. So we could redefine hashtext() to return the hash of the strxfrm value. Slow, but a lot better than giving up hash join and hash aggregation altogether... > In any case, we first need to determine which answer is correct, before > we run off trying to fix it. Agreed. regards, tom lane
On Fri, Dec 16, 2005 at 06:54:15PM +0100, Martijn van Oosterhout wrote: > That's true, in the sense that unconverted strings are not hashable. > This is what strxfrm was created for, to return the sorting key for a > string. A quick C program demonstrates that indeed in that locale these > two strings are equal, whereas in en_AU they are not. FWIW, here's some links to Microsoft and MySQL dealing with the same issue, so we're not alone here. Hungarian seems to be a complex language to sort, but it seems that glibc is right in this case. http://blogs.msdn.com/michkap/archive/2005/11/13/491646.aspx http://bugs.mysql.com/bug.php?id=12519 Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote: > Ah. So we could redefine hashtext() to return the hash of the strxfrm > value. Slow, but a lot better than giving up hash join and hash > aggregation altogether... Not to put too fine a point on it, but either you want locale-sensetive sorting or you don't. If you do, you need to realise the cost associated with it. Correctness above speed after all. Which reminds me, I should probably finish that COLLATE patch. Then you could choose between: 'putty' = 'putyty' COLLATE C (false) 'putty' = 'putyty' COLLATE 'hu_HU' (true) Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Fri, 16 Dec 2005 11:59:48 -0600, Csaba Nagy <nagy@ecircle-ag.com> wrote: > On Fri, 2005-12-16 at 18:52, Tom Lane wrote: >> Csaba Nagy <nagy@ecircle-ag.com> writes: >> > ... So "tyty" and "tty" could be arguably both taken as double "ty", >> > except that the official form is "tty"... but from a pronunciation >> point >> > of view they ARE equivalent in hungarian. >> >> That's fair enough, but the question is should they be taken as >> equivalent for string-comparison purposes? (English has plenty of >> cases where different letter combinations sound alike, but we don't >> consider them equal because of that. That may not be a good analogy >> though. Also, if there are cases in other locales where strcoll >> considers non-identical strings equal, the reasoning for it might be >> quite different.) > > Well, I'm not an expert on this one. In any case, hungarian has > phonetical writing as opposed to the etymological writing English has. > So in hungarian there is a 1 to 1 mapping between the sounds and the > signs used to depict them... so pronunciation is somewhat more relevant > in sorting I guess. But I'm not a linguist so I won't know for sure. > Trouble is, you can never guarantee that you're dealing with actual words. What of you're comparing someone's password that happens to contain combination of letters that act in this way? > Cheers, > Csaba. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Russ
Martijn van Oosterhout <kleptog@svana.org> writes: > FWIW, here's some links to Microsoft and MySQL dealing with the same > issue, so we're not alone here. Hungarian seems to be a complex > language to sort, but it seems that glibc is right in this case. The mysql bug link has a fairly detailed description, but it dodges the question that we need to answer here: do we want to make a finer-grain distinction than glibc does? In the test data that I got from Mage, the first clue I got was from looking at the results of an ORDER BY versus an index scan: potyos potyty potty potyty potyty potty potty6 potyos potty potyty potyty potty potyty potty6 Actually, the relative order of the "potyty"s and "potty"s is completely random at the moment. You've got to admit that this looks weird: you'd expect a database's ORDER BY output to impose at least a cosmetic ordering on these strings. Per what we've heard, it wouldn't matter much to a Hungarian speaker whether the "potyty"s come before or after the "potty"s, but it seems like it should be consistently one or the other. This argument doesn't really answer the question about whether WHERE username = 'potyty' should match a stored 'potty', however. My inclination is to say "no it shouldn't directly match --- apply a normalization function to your data if you think that tyty should be canonically spelled tty". If we had per-column locales there would be a stronger argument for allowing them to be equal, but right now this folding would occur for all text in a database ... and surely this would be considered a bug for any text that happened not to be Hungarian words. But perhaps my view is overly influenced by performance considerations. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote: >> Ah. So we could redefine hashtext() to return the hash of the strxfrm >> value. Slow, but a lot better than giving up hash join and hash >> aggregation altogether... > Not to put too fine a point on it, but either you want locale-sensetive > sorting or you don't. Nobody's said anything about giving up locale-sensitive sorting. The question is about locale-sensitive equality: does it really make sense that 'tty' = 'tyty'? Would your answer change in the context '/dev/tty' = '/dev/tyty'? Are you willing to *not have access* to a text comparison operator that will make the distinction? I'm inclined to think that this is more like the occasional need for accent-insensitive comparisons. It seems generally agreed that you want something like smash('ab') = smash('áb') rather than making the strings equal in all contexts. Of course, not being a native speaker of any of the affected languages, my opinion shouldn't be taken too seriously ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > The other approach we could take is to define text comparison as > yielding equality only for bitwise-equal strings. If strcoll() returns > zero then ask strcmp() for a second opinion. Fwiw this has come up before on this list and it was discovered this is effectively what Perl does, probably for similar motivations wrt to hashes. I think it's probably the least bad solution, even if it's not really the right thing. -- greg
At 01:40 PM 12/16/2005 -0500, Tom Lane wrote: >Nobody's said anything about giving up locale-sensitive sorting. The >question is about locale-sensitive equality: does it really make sense >that 'tty' = 'tyty'? Would your answer change in the context >'/dev/tty' = '/dev/tyty'? Are you willing to *not have access* to a >text comparison operator that will make the distinction? > >I'm inclined to think that this is more like the occasional need for >accent-insensitive comparisons. It seems generally agreed that you want >something like smash('ab') = smash('áb') rather than making the >strings equal in all contexts. I agree. I would prefer for everything to be compared without any collation/corruption by default, and for there to be a function to pick the desired comparison behaviour ( Can all that functionality be done with the collate clause?). Because most databases are multi-locale whether the humans are aware of it or not: The Computer "locale", human locale #1, unknown/international locale, human locale #2, ... In a column for license keys, "tty" should rarely be the same as "tyty". In a column for base64 data (crypto hashes, etc) "tty" should NEVER be the same as "tyty". In a column for domain names, I doubt it is clear whether you want to match tty.ibm.hu just because tyty.ibm.hu exists. But in a column for license owner names, one might want "tty" and "tyty" to be the same - one might have to have a multicolumn index depending on the owner's locale of choice. I recommend that for these reasons initdb should always pick "no mangled" text by default, no matter what the locale setting is. And that users should be advised of the potential consequences of mangling or I would even say corrupting all text in their databases by default. Regards, Link.
> At 01:40 PM 12/16/2005 -0500, Tom Lane wrote: > >Nobody's said anything about giving up locale-sensitive sorting. The > >question is about locale-sensitive equality: does it really make sense > >that 'tty' = 'tyty'? Would your answer change in the context > >'/dev/tty' = '/dev/tyty'? Are you willing to *not have access* to a > >text comparison operator that will make the distinction? On Sat, Dec 17, 2005 at 11:49:48AM +0800, Lincoln Yeoh wrote: > I would prefer for everything to be compared without any > collation/corruption by default, and for there to be a function to pick the > desired comparison behaviour ( Can all that functionality be done with the > collate clause?). > In a column for license keys, "tty" should rarely be the same as "tyty". > In a column for base64 data (crypto hashes, etc) "tty" should NEVER be the > same as "tyty". > > In a column for domain names, I doubt it is clear whether you want to match > tty.ibm.hu just because tyty.ibm.hu exists. > > But in a column for license owner names, one might want "tty" and "tyty" to > be the same - one might have to have a multicolumn index depending on the > owner's locale of choice. > > I recommend that for these reasons initdb should always pick "no mangled" > text by default, no matter what the locale setting is. Tom, as a speaker of German I absolutely agree on the above. The database shouldn't be second guessing on the user intentions. If the user thinks she wants mangling of *all* text in the database by default she is wrong in most cases. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
At 11:49 AM 12/17/2005 +0800, Lincoln Yeoh wrote: >But in a column for license owner names, one might want "tty" and "tyty" >to be the same - one might have to have a multicolumn index depending on >the owner's locale of choice. To make myself clear, one might want to store a person's name in one column and the locale it belongs to in another column. I wondering whether it is possible to have something like this: create table ppl_people ( id serial, locale text, name text ); create index idx_ppl_locale_name on ppl_people ( locale,smash(locale,name)) Then one could do: select * from ppl_people where locale=$locale and smash(locale,name) between smash($locale,$start) and smash($locale,$end) order by smash(locale,name) Not sure if that is correct, but hope you can understand me anyway. Would something like this be possible using the standard SQL syntax and the COLLATE feature? Or am I thinking of doing things the wrong way, and there's a better or more standard way? Regards, Link.
Where are we on this? Given the original report: online=# select * from common_logins where username = 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum -----+----------+----------+-----------+--------+----------+---------- (0 rows) online=# select * from common_logins where username like 'potyty'; uid | username | password | lastlogin | status | usertype | loginnum --------+----------+----------+----------------------------+--------+----------+---------- 155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1 60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3 174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3 (3 rows) online=# select username, username = 'potyty' from common_logins where username like 'potyty'; username | ?column? ----------+---------- potyty | t potyty | t potyty | t (3 rows) I don't think we can state that our current behavior is correct. I realize we are being hit by the length comparison optimization, but ultimiately the issue is that the Hungarian-specific locale considers "tyty" and "tty" as the same string, which confuses our indexing comparisons. Is our fix going to be a Hungarian-specific one? --------------------------------------------------------------------------- Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote: > >> Ah. So we could redefine hashtext() to return the hash of the strxfrm > >> value. Slow, but a lot better than giving up hash join and hash > >> aggregation altogether... > > > Not to put too fine a point on it, but either you want locale-sensetive > > sorting or you don't. > > Nobody's said anything about giving up locale-sensitive sorting. The > question is about locale-sensitive equality: does it really make sense > that 'tty' = 'tyty'? Would your answer change in the context > '/dev/tty' = '/dev/tyty'? Are you willing to *not have access* to a > text comparison operator that will make the distinction? > > I'm inclined to think that this is more like the occasional need for > accent-insensitive comparisons. It seems generally agreed that you want > something like smash('ab') = smash('áb') rather than making the > strings equal in all contexts. > > Of course, not being a native speaker of any of the affected languages, > my opinion shouldn't be taken too seriously ... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Where are we on this? Waiting to see if there's any input on what the behavior needs to be. regards, tom lane
Tom Lane wrote: > Perhaps the fast-path check is a bad idea, but fixing this is not > just a matter of removing that. If we subscribe to strcoll's > worldview then we have to conclude that *text strings are not > hashable*, because strings that should be "equal" may have different > hash codes. By the way, I have always been concerned about the feature of Unicode that you can write logically equivalent strings using different code-point sequences. Namely, you often have the option of writing an accented letter using the "legacy" single codepoint (like in ISO 8859-something) or alternatively using accept plus "base letter" as two code points. Collating systems should treat them the same, so hashing the byte values won't work anyway. This is a more extreme case of "tyty" vs. "tty" because using a proper rendering system, those Unicode strings should look the same to the naked eye. Therefore, I'm doubtful that using a binary comparison as tie-breaker is proper behavior. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > By the way, I have always been concerned about the feature of Unicode > that you can write logically equivalent strings using different > code-point sequences. Namely, you often have the option of writing an > accented letter using the "legacy" single codepoint (like in ISO > 8859-something) or alternatively using accept plus "base letter" as two > code points. Collating systems should treat them the same, so hashing > the byte values won't work anyway. This is a more extreme case of > "tyty" vs. "tty" because using a proper rendering system, those Unicode > strings should look the same to the naked eye. Therefore, I'm doubtful > that using a binary comparison as tie-breaker is proper behavior. Hm. Would you expect that these sequences generate identical strxfrm output? The weight of opinion later in the thread seems to be leaning towards the idea that we do not want to accept the word of strcoll/strxfrm about whether two strings are equal: there are too many scenarios where lax equality behavior would be a serious bug, and too few where it's critical to have it. I'm still prepared to listen to argument though. A possible compromise going forward would be to introduce an additional comparison operator that tests for strcoll equality --- but I'd vote for calling it something other than "=". regards, tom lane
Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >>Where are we on this? >> >> > >Waiting to see if there's any input on what the behavior needs to be. > > > Actually, "potyty" and "potty" are not equal in Hungarian language. We use the "tyty" form in complex words. Only data sorting requires the knowledge that they are neighbours. I am very afraid of queries when a condition has different meanings in the select and in the where clause. It is the worst. Even if the "tyty" and "tty" were same, I couldn't accept the actual behavior of the indexes, but they are not same. Mage
On Sat, Dec 17, 2005 at 02:08:06PM -0500, Tom Lane wrote: > The weight of opinion later in the thread seems to be leaning towards > the idea that we do not want to accept the word of strcoll/strxfrm about > whether two strings are equal: there are too many scenarios where lax > equality behavior would be a serious bug, and too few where it's > critical to have it. I'm still prepared to listen to argument though. Well, it seems to me that the problem is that there are several differing ideas of equal, depending on what you're trying to do. The real problem is that we are only allowed to choose one to be "=" and we're not sure people will understand the issues. > A possible compromise going forward would be to introduce an additional > comparison operator that tests for strcoll equality --- but I'd vote for > calling it something other than "=". I think the real solution is to implement COLLATE support. Then we can define all sorts of collations, like: C/POSIX C with case-insensetivity hu_HU with binary tie-break hu_HU with maximum munging UTF-8 with accent insenstivity. Then we can simply default everything to C/POSIX which would be a straight binary match (priciple of least surprise) and if people want anything else they need to specify the COLLATE they want either on the column or in the query. My patch was halfway there (the grammer was sorted, as were the SQL propgation rules) but there was some work to go. I'll see if I can get it to a working state. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > I think the real solution is to implement COLLATE support. Maybe so, but we still need to figure out what we're doing for the back branches, and that won't be it ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Peter Eisentraut <peter_e@gmx.net> writes: > > By the way, I have always been concerned about the feature of Unicode > > that you can write logically equivalent strings using different > > code-point sequences. Namely, you often have the option of writing an > > accented letter using the "legacy" single codepoint (like in ISO > > 8859-something) or alternatively using accept plus "base letter" as two > > code points. Collating systems should treat them the same, so hashing > > the byte values won't work anyway. This is a more extreme case of > > "tyty" vs. "tty" because using a proper rendering system, those Unicode > > strings should look the same to the naked eye. Therefore, I'm doubtful > > that using a binary comparison as tie-breaker is proper behavior. > > Hm. Would you expect that these sequences generate identical strxfrm > output? I think this is mixing up two different things. Using iso-8859-1 to encode "é" as a single byte versus using UTF8 which would take two bytes to encode it is an issue of using two *different* encodings. The actual string of characters being encoded is precisely the same string. That is, while the sequence of bytes in the encoded string is different the sequence of characters being encoded is precisely the same. Postgres doesn't really face this issue currently since it only supports one encoding at a time anyways. If Postgres supported multiple encodings and it was necessary to compare two strings in two different encodings then they would probably both have to be converted to a common encoding (presumably UTFx for some value of x) before comparing. There is a separate issue that some characters could theoretically have multiple representations even within the same encoding. This doesn't really happen in the usual non UTF encodings (like iso-8859-x) to my knowledge, but it can happen in UTF8 or UTF16 because, for example, you could use the variable length form that takes 2 bytes or even 4 bytes for characters that are really just plain ascii characters. However there are canonicalization rules that basically rule all but the shortest representation invalid unicode strings. I assume these exist precisely to make it easier to compare or hash unicode strings. I guess it's an open question whether the database should signal an error on such invalid strings or silently treat them as equivalent to a correct encoding of the same string. On the original issue I think the bottom line is that strings are sequences of characters and two sequences of characters should only compare equal if they contain the same characters in the same order. The encodings can be different and still represent the same string, but they do have to represent the same sequence of characters. If they represent two different sequences of characters -- even if the two sequences have the same significance in the language of the reader, they're still not actually the same sequence of characters. As long as both strings are encoded in the same encoding (whether that be iso-8859-1 or utf8 or whatever) sorting by strcoll and then strcmp will effectively give this set of semantics with one exception, the case of invalid UTF encodings that are not canonicalized where it will silently treat them as distinct strings from the correctly encoded string. One day when it's possible for the two strings to be in two different encodings then they will have to both be converted to an encoding that includes the union of the two character sets covered by the two encodings. -- greg
Greg Stark wrote: > Using iso-8859-1 to encode "é" as a single byte versus using UTF8 > which would take two bytes to encode it is an issue of using two > *different* encodings. But that's not what we are discussing. > There is a separate issue that some characters could theoretically > have multiple representations even within the same encoding. That is what we are discussing. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > Greg Stark wrote: > > Using iso-8859-1 to encode "é" as a single byte versus using UTF8 > > which would take two bytes to encode it is an issue of using two > > *different* encodings. > > But that's not what we are discussing. The poster to which Tom was responding was bringing it up as an issue. I was explaining how it was different. > > There is a separate issue that some characters could theoretically > > have multiple representations even within the same encoding. > > That is what we are discussing. Well the original discussion about the hungarian strings was about yet a third case entirely. Two different sequences of characters that have the same semantic significance. -- greg
On Sat, Dec 17, 2005 at 10:30:04PM +0100, Martijn van Oosterhout wrote: > I think the real solution is to implement COLLATE support. Then we can > define all sorts of collations, like: ... > My patch was halfway there (the grammer was sorted, as were the SQL > propgation rules) but there was some work to go. I'll see if I can get > it to a working state. That'd be truly excellent. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
[snip] > Trouble is, you can never guarantee that you're dealing with actual words. > What of you're comparing someone's password that happens to contain > combination of letters that act in this way? Well, in this case why would you care about how passwords are sorted ? :-) I think if the strings are linguistically meaningless, then it's also more or less meaningless how you sort them. The best thing would be to be able to specify the locale of each column, then this problem would go away, as you could sort text differently then passwords... Cheers, Csaba.
On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > I think the real solution is to implement COLLATE support. > > Maybe so, but we still need to figure out what we're doing for the back > branches, and that won't be it ... To be honest, there are really only a handful of locales that suffer from this issue, so perhaps we should document it and move on. Perhaps we should add some sanity checks to initdb to ensure that the locale is sane. Basically things like: FOO != foo (case-sensetivity) tty != tyty (wierd hungarian case) aei != äëï (accent-sensetivity) Until COLLATE support is working, I'm not sure if we should put that much effort into dealing with these cases. Even after we have COLLATE support we'll probably have to start using strxfrm to make these locales usable for indexes. Using a binary string compares for tie-breaks seems a waste of cycles for the 99% of locales that don't need it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Csaba Nagy <nagy@ecircle-ag.com> writes: >> Trouble is, you can never guarantee that you're dealing with actual words. >> What of you're comparing someone's password that happens to contain >> combination of letters that act in this way? > Well, in this case why would you care about how passwords are sorted ? You don't. But you do care about what equality means, and the discussion in this thread is really about changing that, not about changing the behavior of sorting. (The side-effects on sorting will be so minor that it's unlikely anyone would notice, with either solution.) regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote: >> Maybe so, but we still need to figure out what we're doing for the back >> branches, and that won't be it ... > To be honest, there are really only a handful of locales that suffer > from this issue, so perhaps we should document it and move on. "Let's not fix it" is really not an acceptable answer, because the behavior in the affected locales is entirely broken (inconsistent, etc). And how do you know which locales are affected, anyway? (Also, to be blunt, the COLLATE feature is pie in the sky until it is done and supports indexes --- which I'm not sure you even know how to do yet. I'm not prepared to assume it will be in 8.2.) regards, tom lane
On Mon, Dec 19, 2005 at 10:08:12AM -0500, Tom Lane wrote: > "Let's not fix it" is really not an acceptable answer, because the > behavior in the affected locales is entirely broken (inconsistent, > etc). And how do you know which locales are affected, anyway? I don't know which locales are affected. It just can't be that widespread because we're not getting similar reports for 99% of the locales out there. I'm not saying not fixing it is an option, it's just that a second compare on equality is going to be a waste in >99% of cases. > (Also, to be blunt, the COLLATE feature is pie in the sky until it > is done and supports indexes --- which I'm not sure you even know > how to do yet. I'm not prepared to assume it will be in 8.2.) Oh, getting indexes to use COLLATE is (relatively) the easy part. Where I'm mostly stuck on is teaching the planner how to decide which index returns the right order for what it's looking for. For example, should I be returning a seperate path for each possible collation order. Obviously not, but there will certainly be several orders returned depending on indexes, maybe others depending on any ORDER BY clauses in the query. Getting *that* right is harder and my planner knowledge isn't that great. Right now I'm updating the code for CVS tip. pg_indent has a great way of making large patches cease to work :(. Some other things seem to have changed also. Oh well... It's not a huge amount of work, it's just lots of interconnected pieces. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Mon, Dec 19, 2005 at 08:37:26PM +0100, Martijn van Oosterhout wrote: > I don't know which locales are affected. It just can't be that > widespread because we're not getting similar reports for 99% of the > locales out there. Not getting reports doesn't mean the problem is rare. Perhaps people moved to another database. Perhaps people decided to solve their problem in client code. Perhaps 50% of all potential locales haven't been put to use with PostgreSQL such that the problem showed up. It's definitely worth doing something about. Had I the skills would I myself help with it. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > On Mon, Dec 19, 2005 at 08:37:26PM +0100, Martijn van Oosterhout wrote: >> I don't know which locales are affected. It just can't be that >> widespread because we're not getting similar reports for 99% of the >> locales out there. > Not getting reports doesn't mean the problem is rare. I'm not sure that we can say we're not getting reports, either. We've seen *plenty* of reports of strange comparison misbehavior. Up to now I've always written them off as pilot error (ie, incompatible locale and encoding selections) but now I suspect some of them were due to this issue. regards, tom lane
Martijn van Oosterhout wrote: >On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote: > > >>Martijn van Oosterhout <kleptog@svana.org> writes: >> >> >>>I think the real solution is to implement COLLATE support. >>> >>> >>Maybe so, but we still need to figure out what we're doing for the back >>branches, and that won't be it ... >> >> > >To be honest, there are really only a handful of locales that suffer >from this issue, so perhaps we should document it and move on. > I don't agree. Usually I read the whole documentation of the software I use, but you cannot presume that every user even with good sql skills will check the documentation for a thing he wouldn't imagine. With knowing the background it is understandable locale problem, but in the user's point of view it's a weird and serious bug which shouldn't be there. Using hu_HU with latin2 is a normal marrying. Some users (including me) don't always read the "known issues" chapter, even for a good quality software. Mage
Tom has applied a patch to fix this and backpatched it to all relivant branches. He might be preparing a summary email about this. --------------------------------------------------------------------------- Mage wrote: > Martijn van Oosterhout wrote: > > >On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote: > > > > > >>Martijn van Oosterhout <kleptog@svana.org> writes: > >> > >> > >>>I think the real solution is to implement COLLATE support. > >>> > >>> > >>Maybe so, but we still need to figure out what we're doing for the back > >>branches, and that won't be it ... > >> > >> > > > >To be honest, there are really only a handful of locales that suffer > >from this issue, so perhaps we should document it and move on. > > > I don't agree. Usually I read the whole documentation of the software I > use, but you cannot presume that every user even with good sql skills > will check the documentation for a thing he wouldn't imagine. > > With knowing the background it is understandable locale problem, but in > the user's point of view it's a weird and serious bug which shouldn't be > there. Using hu_HU with latin2 is a normal marrying. > > Some users (including me) don't always read the "known issues" chapter, > even for a good quality software. > > Mage > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073