Thread: is this a bug or I am blind?

is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 09:12:20

online=# select * from common_logins where username = 'potyty';
 uid | username | password | lastlogin | status | usertype | loginnum
-----+----------+----------+-----------+--------+----------+----------
(0 rows)

online=# select * from common_logins where username like 'potyty';
  uid   | username | password |         lastlogin          | status |
usertype | loginnum
--------+----------+----------+----------------------------+--------+----------+----------
 155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
S        |        1
  60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
S        |        3
 174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      |
S        |        3
(3 rows)

online=# select username, username = 'potyty' from common_logins where
username like 'potyty';
 username | ?column?
----------+----------
 potyty   | t
 potyty   | t
 potyty   | t
(3 rows)


psql 8.0.3, Debian.

       Mage

Re: is this a bug or I am blind?

From

"A. Kretschmer"

Date:

15 December 2005, 09:23:06

am  15.12.2005, um 14:12:23 +0100 mailte Mage folgendes:
> online=# select * from common_logins where username = 'potyty';
>  uid | username | password | lastlogin | status | usertype | loginnum
> -----+----------+----------+-----------+--------+----------+----------
> (0 rows)
>
> online=# select * from common_logins where username like 'potyty';
>   uid   | username | password |         lastlogin          | status |
> usertype | loginnum
> --------+----------+----------+----------------------------+--------+----------+----------
>  155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      | S
>    |        1
>   60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      | S
>    |        3
>  174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      | S
>    |        3
> (3 rows)

Try:

select *, length(username), length('potyty') from common_logins where username like 'potyty';


My guess:

select length(username) from common_logins where username like 'potyty';

is _NOT_ 6, there is a SPACE like 'potyty '.


HTH, Andreas
--
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net
 ===    Schollglas Unternehmensgruppe    ===

Re: is this a bug or I am blind?

From

Jaime Casanova

Date:

15 December 2005, 09:28:20

On 12/15/05, A. Kretschmer <andreas.kretschmer@schollglas.com> wrote:
> am  15.12.2005, um 14:12:23 +0100 mailte Mage folgendes:
> > online=# select * from common_logins where username = 'potyty';
> >  uid | username | password | lastlogin | status | usertype | loginnum
> > -----+----------+----------+-----------+--------+----------+----------
> > (0 rows)
> >
> > online=# select * from common_logins where username like 'potyty';
> >   uid   | username | password |         lastlogin          | status |
> > usertype | loginnum
> > --------+----------+----------+----------------------------+--------+----------+----------
> >  155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      | S
> >    |        1
> >   60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      | S
> >    |        3
> >  174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      | S
> >    |        3
> > (3 rows)
>
> Try:
>
> select *, length(username), length('potyty') from common_logins where username like 'potyty';
>
>
> My guess:
>
> select length(username) from common_logins where username like 'potyty';
>
> is _NOT_ 6, there is a SPACE like 'potyty '.
>
>

even is that is true i think you need the "comodin characters" (ie: %.
_) to make "like" behave different from simple comparisons



> HTH, Andreas
> --
> Andreas Kretschmer    (Kontakt: siehe Header)
> Heynitz:  035242/47212,      D1: 0160/7141639
> GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net
>  ===    Schollglas Unternehmensgruppe    ===
>

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

15 December 2005, 09:33:12

[snip]
> even is that is true i think you need the "comodin characters" (ie: %.
> _) to make "like" behave different from simple comparisons

Not entirely true, if the database was initialized in a different locale
than C, then the direct comparison will probably go for an index on
username, while "like" will not. Which points to a possible index
corruption... which might be interesting for the developers to
investigate, but I would guess a reindex will solve the problem for the
OP if he has it urgent...

Cheers,
Csaba.

Re: is this a bug or I am blind?

From

Richard Huxton

Date:

15 December 2005, 09:35:07

Mage wrote:
> online=# select * from common_logins where username = 'potyty';
> online=# select * from common_logins where username like 'potyty';

It's probably worth seeing whether these have different plans (EXPLAIN
ANALYSE...) and if the = is using an index but like isn't.
If so, try issuing "set enable_indexscan=false" first and see what
happens then.

If that makes a difference then I'd guess you have one of two things:
1. A corrupt index (check the REINDEX command)
2. (perhaps more likely) Some localisation issues.
    What encoding/locale settings are you using?

--
   Richard Huxton
   Archonet Ltd

Re: is this a bug or I am blind?

From

Jaime Casanova

Date:

15 December 2005, 09:40:33

On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote:
> [snip]
> > even is that is true i think you need the "comodin characters" (ie: %.
> > _) to make "like" behave different from simple comparisons
>
> Not entirely true, if the database was initialized in a different locale
> than C, then the direct comparison will probably go for an index on
> username, while "like" will not. Which points to a possible index
> corruption... which might be interesting for the developers to
> investigate, but I would guess a reindex will solve the problem for the
> OP if he has it urgent...
>
> Cheers,
> Csaba.
>
>

Mage if it's not urgent maybe you can make a post in -hackers and
follow instructions about how to get more info to see what happened
here...

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

15 December 2005, 09:43:26

[snip]
> If that makes a difference then I'd guess you have one of two things:
> 1. A corrupt index (check the REINDEX command)
> 2. (perhaps more likely) Some localisation issues.
>     What encoding/locale settings are you using?

Based on the 3rd query of the OP, where the direct comparison results in
"true" for all the rows which matched the "like", I would exclude the
localisation issues variant... unless = is not equals in all cases ;-)

Cheers,
Csaba.

Re: is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 09:50:22

A. Kretschmer wrote:

>
>select *, length(username), length('potyty') from common_logins where username like 'potyty';
>
>
>My guess:
>
>select length(username) from common_logins where username like 'potyty';
>
>is _NOT_ 6, there is a SPACE like 'potyty '.
>
>
>
If you look my 3rd query, you will see that there are no spaces, however:

select *, length(username), length('potyty') from common_logins where
username like 'potyty';
  uid   | username | password |         lastlogin          | status |
usertype | loginnum | length | length
--------+----------+----------+----------------------------+--------+----------+----------+--------+--------
 155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
S        |        1 |      6 |      6
  60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
S        |        3 |      6 |      6
 174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      |
S        |        3 |      6 |      6
(3 rows)

       Mage

Re: is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 09:57:23

Csaba Nagy wrote:

>[snip]
>
>
>>even is that is true i think you need the "comodin characters" (ie: %.
>>_) to make "like" behave different from simple comparisons
>>
>>
>
>Not entirely true, if the database was initialized in a different locale
>than C, then the direct comparison will probably go for an index on
>username, while "like" will not. Which points to a possible index
>corruption... which might be interesting for the developers to
>investigate, but I would guess a reindex will solve the problem for the
>OP if he has it urgent...
>
>
I thought that it may be a locale problem, but:
- look at my 3rd query
- potyty doesn't contain special chars

# EXPLAIN ANALYZE select * from common_logins where username = 'potyty';
                                                                QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using common_logins_username_idx on common_logins
(cost=0.00..4.30 rows=1 width=47) (actual time=0.056..0.056 rows=0 loops=1)
   Index Cond: ((username)::text = 'potyty'::text)
 Total runtime: 0.109 ms
(3 rows)

online=# EXPLAIN ANALYZE select * from common_logins where username like
'potyty';
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Seq Scan on common_logins  (cost=0.00..63833.88 rows=1 width=47)
(actual time=180.333..262.492 rows=3 loops=1)
   Filter: ((username)::text ~~ 'potyty'::text)
 Total runtime: 262.551 ms
(3 rows)

I tried it in two databases (dump and load to another one), so I don't
think that we have corrupted indexes.

I can try on a newer version of postgresql on another server.

By the way, if this is a bug then it's a serious one. We have it in
production environment.

       Mage

Re: is this a bug or I am blind?

From

"Gregory S. Williamson"

Date:

15 December 2005, 10:30:34

>
> If you look my 3rd query, you will see that there are no spaces, however:
>
> select *, length(username), length('potyty') from common_logins where
username like 'potyty';
>   uid   | username | password |         lastlogin          | status |
usertype | loginnum | length | length
> --------+----------+----------+----------------------------+--------+----------+----------+--------+--------
>  155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
S        |        1 |      6 |      6
>   60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
S        |        3 |      6 |      6
>  174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      |
S        |        3 |      6 |      6
> (3 rows)
>
>        Mage

I am puzzled by the lack of a "%" in the LIKE query. When I try this on postgres 7.4 and 8.0 I get no rows when I am
missingit; including it works as expected. 

The names have been changed to protect the guilty ;-} but the core of it is true -- no "%" means wierdnesses, I think.

gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home';
 gex_clientname
----------------
(0 rows)

gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%';
                        gex_clientname
--------------------------------------------------------------
 HomeHappinesses
 HomeMorgageValues, Inc.
(2 rows)

Could you try your query again with the wild card ?

HTH

Greg Williamson
DBA
GlobeXplorer LLC

Re: is this a bug or I am blind?

From

Jaime Casanova

Date:

15 December 2005, 10:38:33

>
> I tried it in two databases (dump and load to another one), so I don't
> think that we have corrupted indexes.
>

the problem persist after a dump a reload? then there is not because
an index corruption... can you send part of the data that reproduces
the bug?

> I can try on a newer version of postgresql on another server.
>
> By the way, if this is a bug then it's a serious one. We have it in
> production environment.
>
>       Mage
>

what locales do you have? encoding?

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: is this a bug or I am blind?

From

Richard Huxton

Date:

15 December 2005, 10:40:33

Csaba Nagy wrote:
> [snip]
>> If that makes a difference then I'd guess you have one of two things:
>> 1. A corrupt index (check the REINDEX command)
>> 2. (perhaps more likely) Some localisation issues.
>>     What encoding/locale settings are you using?
>
> Based on the 3rd query of the OP, where the direct comparison results in
> "true" for all the rows which matched the "like", I would exclude the
> localisation issues variant... unless = is not equals in all cases ;-)

Well spotted Csaba - that _would_ seem to point to the index.

--
   Richard Huxton
   Archonet Ltd

Re: is this a bug or I am blind?

From

Richard Huxton

Date:

15 December 2005, 10:41:54

Gregory S. Williamson wrote:
>
> I am puzzled by the lack of a "%" in the LIKE query. When I try this on postgres 7.4 and 8.0 I get no rows when I am
missingit; including it works as expected. 
>
> The names have been changed to protect the guilty ;-} but the core of it is true -- no "%" means wierdnesses, I
think.
>
> gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home';
>  gex_clientname
> ----------------
> (0 rows)
>
> gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%';
>                         gex_clientname
> --------------------------------------------------------------
>  HomeHappinesses
>  HomeMorgageValues, Inc.
> (2 rows)
>
> Could you try your query again with the wild card ?

But "like" without any wildcards should be the same as "=", but it isn't
in the original post.

--
   Richard Huxton
   Archonet Ltd

Re: is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 10:56:05

Jaime Casanova wrote:

>>I tried it in two databases (dump and load to another one), so I don't
>>think that we have corrupted indexes.
>>
>>
>>
>
>the problem persist after a dump a reload? then there is not because
>an index corruption... can you send part of the data that reproduces
>the bug?
>
>
>
>>I can try on a newer version of postgresql on another server.
>>
>>By the way, if this is a bug then it's a serious one. We have it in
>>production environment.
>>
>>      Mage
>>
>>
>>
>
>what locales do you have? encoding?
>
>
>
The problem can be reproduced on pgsql 8.0.3, compiled from source.
This is a third machine where the bug persists. I dumped the table and
loaded in.
I cannot send the table to you because it contains user data and
passwords. I will try to create a fake one with the same problem.

./configure --prefix=/usr/local/pgsql --with-python
client_encoding                | LATIN2
 lc_collate                     | hu_HU
 lc_ctype                       | hu_HU
 lc_messages                    | en_US
 lc_monetary                    | en_US
 lc_numeric                     | en_US
 lc_time                        | en_US
 server_encoding                | LATIN2
 server_version                 | 8.0.3

          Mage

Re: is this a bug or I am blind?

From

"Gregory S. Williamson"

Date:

15 December 2005, 11:02:44

Well, then I have the disease. The database is UNICODE:
gex_runtime=# select gex_clientname from gex_clients where gex_clientname = 'HomeHappinesses';
                        gex_clientname
--------------------------------------------------------------
 HomeGain
(1 row)

gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'HomeHappinesses';
 gex_clientname
----------------
(0 rows)

>
> Gregory S. Williamson wrote:
> >
> > I am puzzled by the lack of a "%" in the LIKE query. When I try this on
> > postgres 7.4 and 8.0 I get no rows when I am missing it; including it
> > works as expected.
> >
> > The names have been changed to protect the guilty ;-} but the core of it
> > is true -- no "%" means wierdnesses, I think.
> >
> > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home';
> >  gex_clientname
> > ----------------
> > (0 rows)
> >
> > gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'Home%';
> >                         gex_clientname
> > --------------------------------------------------------------
> >  HomeHappinesses
> >  HomeMorgageValues, Inc.
> > (2 rows)
> >
> > Could you try your query again with the wild card ?
>
> But "like" without any wildcards should be the same as "=", but it isn't
> in the original post.
>

Well then I have the same behavior -- the database is UNICODE:

gex_runtime=# select gex_clientname from gex_clients where gex_clientname = 'HomeHappinesses';
                        gex_clientname
--------------------------------------------------------------
 HomeHappinesses
(1 row)

gex_runtime=# select gex_clientname from gex_clients where gex_clientname like 'HomeHappinesses';
 gex_clientname
----------------
(0 rows)

An "=" is not equivalent to LIKE with no wildcard.

I never really thought of this as a bug, but if it is ... ring one up for Mage as a good catch.

Re: is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 11:08:13

Mage wrote:

> ./configure --prefix=/usr/local/pgsql --with-python
> client_encoding                | LATIN2
> lc_collate                     | hu_HU
> lc_ctype                       | hu_HU
> lc_messages                    | en_US
> lc_monetary                    | en_US
> lc_numeric                     | en_US
> lc_time                        | en_US
> server_encoding                | LATIN2
> server_version                 | 8.0.3
>
I have created a table that can be sent to you to examine the bug.
I am actually vacuuming it for further testing.
Where can I upload it? Sorry, I cannot host it.

       Mage

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 11:20:29

Richard Huxton <dev@archonet.com> writes:
> Csaba Nagy wrote:
>> Based on the 3rd query of the OP, where the direct comparison results in
>> "true" for all the rows which matched the "like", I would exclude the
>> localisation issues variant... unless = is not equals in all cases ;-)

> Well spotted Csaba - that _would_ seem to point to the index.

No, localization issues should be real high on your list.  In particular
I wonder whether this is the old bugaboo of using a database encoding
that's incompatible with the postmaster's locale setting.  We've seen
that on some platforms strcoll() gets completely confused by this and
returns comparison results that are not even self-consistent.

Non-self-consistent comparison results can lead to an index that is
either actually or effectively corrupt (because index searches proceed
down the wrong tree path and thus fail to find items that should be
found).  So the observation that only index searches fail is consistent
with this idea.

            regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 11:22:48

Richard Huxton <dev@archonet.com> writes:
> But "like" without any wildcards should be the same as "=", but it isn't
> in the original post.

I'm too lazy to go look at the code right now, but I think that the
reduction of "x LIKE constant-pattern" to "x = constant-pattern" is
part of the LIKE index optimization code, which means it'd only get
done in C locale.  We're missing a bet there perhaps.

            regards, tom lane

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

15 December 2005, 11:43:28

Ok, that explains then the problem... but the index is arguably corrupt
in this case, with the important difference that it's not even fixable
by reindex...

I guess what the OP really wants is a solution to his problem. If the
table is not too big, a quick fix would be to just drop the index.
Then figure out an encoding+locale combination which can be used to
properly host the same sample data which leads to failure here and still
satisfies the OP's string sorting and other needs, and then dump
reload... or is there other better way to fix things ?

I guess a note in the docs about not using the same
encoding/locale/(postgres version?)/(OS?) combination as the OP would
make sense too ?

Cheers,
Csaba.

On Thu, 2005-12-15 at 16:20, Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
> > Csaba Nagy wrote:
> >> Based on the 3rd query of the OP, where the direct comparison results in
> >> "true" for all the rows which matched the "like", I would exclude the
> >> localisation issues variant... unless = is not equals in all cases ;-)
>
> > Well spotted Csaba - that _would_ seem to point to the index.
>
> No, localization issues should be real high on your list.  In particular
> I wonder whether this is the old bugaboo of using a database encoding
> that's incompatible with the postmaster's locale setting.  We've seen
> that on some platforms strcoll() gets completely confused by this and
> returns comparison results that are not even self-consistent.
>
> Non-self-consistent comparison results can lead to an index that is
> either actually or effectively corrupt (because index searches proceed
> down the wrong tree path and thus fail to find items that should be
> found).  So the observation that only index searches fail is consistent
> with this idea.
>
>             regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 11:46:23

Mage <mage@mage.hu> writes:
>> lc_collate                     | hu_HU
>> lc_ctype                       | hu_HU
>> server_encoding                | LATIN2

Hm, are those settings actually compatible?  You need to check your
system documentation to find out what encoding "hu_HU" expects.

            regards, tom lane

Re: is this a bug or I am blind?

From

Jaime Casanova

Date:

15 December 2005, 12:12:27

On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote:
> Ok, that explains then the problem... but the index is arguably corrupt
> in this case, with the important difference that it's not even fixable
> by reindex...
>
> I guess what the OP really wants is a solution to his problem.

MAGE was reproducing the problem in a little table that can be send
but now tolds me that the problem in the test table disappear when a
VACUUM was executed... is this consistent with the idea of locale
problem?

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 12:23:20

Jaime Casanova <systemguards@gmail.com> writes:
> On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote:
>> Ok, that explains then the problem... but the index is arguably corrupt
>> in this case, with the important difference that it's not even fixable
>> by reindex...
>>
>> I guess what the OP really wants is a solution to his problem.

> MAGE was reproducing the problem in a little table that can be send
> but now tolds me that the problem in the test table disappear when a
> VACUUM was executed... is this consistent with the idea of locale
> problem?

The VACUUM might have caused the planner not to use the index anymore;
check EXPLAIN.

            regards, tom lane

Re: is this a bug or I am blind?

From

Mage

Date:

15 December 2005, 13:15:56

UPDATE
----------------

I was trying to create a demo table, because I cannot send our
confidental data. I have found weird result.

# drop table common_logins;
DROP TABLE

$ psql < ../cl.sql
SET
SET
SET
SET
SET
SET
CREATE TABLE
 setval
--------
 203650
(1 row)

ALTER TABLE
CREATE INDEX
CREATE INDEX


# select * from common_logins where username = 'potyty';
 uid | username | password | lastlogin | status | usertype | loginnum
-----+----------+----------+-----------+--------+----------+----------
(0 rows)

# VACUUM FULL analyze;
VACUUM

# select * from common_logins where username = 'potyty';
 uid | username | password | lastlogin | status | usertype | loginnum
-----+----------+----------+-----------+--------+----------+----------
(0 rows)


# select count(1) from common_logins;
 count
--------
 203361
(1 row)

# delete from common_logins where uid in (select uid from common_logins
where username not ilike 'potyty' limit 100000);
DELETE 100000

mage=# select * from common_logins where username = 'potyty';
 uid | username | password | lastlogin | status | usertype | loginnum
-----+----------+----------+-----------+--------+----------+----------
(0 rows)


# VACUUM FULL analyze;
VACUUM

# select * from common_logins where username = 'potyty';
  uid   | username | password |         lastlogin          | status |
usertype | loginnum
--------+----------+----------+----------------------------+--------+----------+----------
 155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
S        |        1
  60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
S        |        3
(2 rows)


# delete from common_logins where uid in (select uid from common_logins
where username not ilike 'potyty' limit 80000);
DELETE 80000

# VACUUM FULL analyze;
VACUUM

# select * from common_logins where username = 'potyty';
  uid   | username | password |         lastlogin          | status |
usertype | loginnum
--------+----------+----------+----------------------------+--------+----------+----------
 174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      |
S        |        3
  60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
S        |        3
 155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
S        |        1
(3 rows)


The 2 rows part seems to be (for me) a non-locale-related, but serious
problem.
I have the data file, it is confidental, but I can send it to official
pg developers if needed.

          Mage



Tom Lane wrote:

>Jaime Casanova <systemguards@gmail.com> writes:
>
>
>>On 12/15/05, Csaba Nagy <nagy@ecircle-ag.com> wrote:
>>
>>
>>>Ok, that explains then the problem... but the index is arguably corrupt
>>>in this case, with the important difference that it's not even fixable
>>>by reindex...
>>>
>>>I guess what the OP really wants is a solution to his problem.
>>>
>>>
>
>
>
>>MAGE was reproducing the problem in a little table that can be send
>>but now tolds me that the problem in the test table disappear when a
>>VACUUM was executed... is this consistent with the idea of locale
>>problem?
>>
>>
>
>The VACUUM might have caused the planner not to use the index anymore;
>check EXPLAIN.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>

Re: is this a bug or I am blind?

From

Greg Stark

Date:

15 December 2005, 17:55:32

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Non-self-consistent comparison results can lead to an index that is
> either actually or effectively corrupt (because index searches proceed
> down the wrong tree path and thus fail to find items that should be
> found).  So the observation that only index searches fail is consistent
> with this idea.

I wondered if there were some simple tests Postgres could do to notice the
problem and report it. Checking to make sure strcoll(a,b) = -strcoll(b,a) for
example.

But then I thought of another idea. What if Postgres just used strxfrm()
instead of strcoll everywhere? Then it ought to never produce inconsistent
results. At least if strxfrm() doesn't just return randomly varying results
for the same inputs. I suspect the worst case in practice is that strxfrm()
will return the same data for just about every input string, which would
hopefully be noticed by the user. But at least wouldn't cause corrupted
indexes.

--
greg

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 18:18:34

Greg Stark <gsstark@mit.edu> writes:
> But then I thought of another idea. What if Postgres just used strxfrm()
> instead of strcoll everywhere? Then it ought to never produce inconsistent
> results. At least if strxfrm() doesn't just return randomly varying results
> for the same inputs.

AFAICS the most that could accomplish is to make failures more obvious;
it wouldn't actually fix anything.  I don't think that's worth the
runtime penalty it would incur.

What we *ought* to be doing is trying to figure a way to detect and
disallow inconsistent locale/encoding combinations.  We've avoided that
because there seems no general platform-independent way to find out the
encoding expected by a locale.  But surely we could manage to make it
work at least on glibc and Windows, which would be a step ahead of doing
nothing.

            regards, tom lane

Re: is this a bug or I am blind?

From

Greg Stark

Date:

15 December 2005, 19:14:12

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > But then I thought of another idea. What if Postgres just used strxfrm()
> > instead of strcoll everywhere? Then it ought to never produce inconsistent
> > results. At least if strxfrm() doesn't just return randomly varying results
> > for the same inputs.
>
> AFAICS the most that could accomplish is to make failures more obvious;
> it wouldn't actually fix anything.  I don't think that's worth the
> runtime penalty it would incur.

What runtime penalty? It seems likely that strcoll is implemented by the
equivalent of calling strxfrm twice internally anyways. I don't see how that
could produce inconsistent results unless the internal interface has some
error handling codepath that returns an error that isn't being checked.

In fact calling strxfrm() has the potential to open up some optimizations.
Like caching the transformed string for the duration of a single sort
operation instead of repeatedly transforming it.

> What we *ought* to be doing is trying to figure a way to detect and
> disallow inconsistent locale/encoding combinations.  We've avoided that
> because there seems no general platform-independent way to find out the
> encoding expected by a locale.  But surely we could manage to make it
> work at least on glibc and Windows, which would be a step ahead of doing
> nothing.

Harumph. When I suggested having a strxfrm() function like the ones three
different people have independently developed and posted that would work
though slowly on all platforms, used only standard libc functions and perform
fine on at least glibc you complained it wasn't portable enough.

--
greg

Re: is this a bug or I am blind?

From

Tom Lane

Date:

15 December 2005, 19:55:51

Greg Stark <gsstark@mit.edu> writes:
> What runtime penalty? It seems likely that strcoll is implemented by the
> equivalent of calling strxfrm twice internally anyways.

Only by a very incompetent implementor.

            regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 13:12:14

Mage <mage@mage.hu> writes:
> Tom, I can send you the data I dumped and you can try it out with same
> settings. I trust you.

Thanks.  After digging through it, I can exhibit the problem: in hu_HU
locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably
all glibc installations will do this), I get

mage=# select 'potyty'::varchar = 'potty'::varchar;
 ?column?
----------
 f
(1 row)

mage=# select 'potyty'::varchar < 'potty'::varchar;
 ?column?
----------
 f
(1 row)

mage=# select 'potyty'::varchar > 'potty'::varchar;
 ?column?
----------
 f
(1 row)

Needless to say, this makes sorting and btree indexing very unhappy,
as they take the trichotomy law as an article of faith ;-)

I don't know anything about hu_HU comparison rules, but it appears that
strcoll() believes that these two strings should be considered equal.
Is that sane?  The immediate cause of the problem is that texteq() and
textne() have a "fast path" for unequal-length inputs:

    /* fast path for different-length inputs */
    if (VARSIZE(arg1) != VARSIZE(arg2))
        result = false;
    else
        result = (text_cmp(arg1, arg2) == 0);

(text_cmp is what invokes strcoll.)  Thus the = operator returns false,
while the other two go to strcoll() and then return false.

Perhaps the fast-path check is a bad idea, but fixing this is not just
a matter of removing that.  If we subscribe to strcoll's worldview then
we have to conclude that *text strings are not hashable*, because
strings that should be "equal" may have different hash codes.  And at
least in the current PG code, that's not something we can flip on and off
depending on the locale --- texteq would have to be marked non hashable
in the system catalogs, meaning a big performance hit for *everybody*
even if their locale is not this weird.

The other approach we could take is to define text comparison as
yielding equality only for bitwise-equal strings.  If strcoll() returns
zero then ask strcmp() for a second opinion.  This would mean that we'd
sort according to strcoll in the main, but strings that strcoll
considers equal but aren't physically identical would sort in codeset
order.  I can't see that this would do any harm in the context of
sorting rules, but the question of what equality means is something for
users to answer.  Do you think that these two queries ought to yield the
same rows in hu_HU locale, or not?
    select * from common_logins where username = 'potyty';
    select * from common_logins where username = 'potty';

            regards, tom lane

Re: is this a bug or I am blind?

From

Michael Fuhr

Date:

16 December 2005, 13:33:16

On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote:
> Thanks.  After digging through it, I can exhibit the problem: in hu_HU
> locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably
> all glibc installations will do this), I get

I don't know if this is related or not, but did the following issue
from a year ago ever get resolved?  We were wondering then if the
Hungarian locale on some platforms might be causing problems.

http://archives.postgresql.org/pgsql-bugs/2004-12/msg00206.php
http://archives.postgresql.org/pgsql-bugs/2004-12/msg00228.php

--
Michael Fuhr

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

16 December 2005, 13:33:55

On Fri, 2005-12-16 at 18:12, Tom Lane wrote:
[snip]
> I don't know anything about hu_HU comparison rules, but it appears that
> strcoll() believes that these two strings should be considered equal.
> Is that sane?

It is sane in a way, as the "ty" combination is pronounced together as a
single consonant in hungarian, and "tty" is the 'strong' version of it.
The usual way of making a consonant strong in Hungarian is to double it,
which works well for simple consonants, but for "ty" is normally written
"tty". So "tyty" and "tty" could be arguably both taken as double "ty",
except that the official form is "tty"... but from a pronunciation point
of view they ARE equivalent in hungarian.

Cheers,
Csaba.

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 13:40:29

Michael Fuhr <mike@fuhr.org> writes:
> I don't know if this is related or not, but did the following issue
> from a year ago ever get resolved?  We were wondering then if the
> Hungarian locale on some platforms might be causing problems.

AFAIR we never did figure that one out.  I wasn't able to reproduce
it using RHEL, but the complainant was using SLES which might possibly
have had different Hungarian locale rules at the time.  I'm not sure
how much those have changed across different releases of glibc.

Now that we see the mechanism for the problem, it might well explain
various odd reports we've gotten from people using a number of different
locales.  I've tended to write these off as locale-vs-encoding pilot
error, but maybe they weren't all that.

            regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 13:52:41

Csaba Nagy <nagy@ecircle-ag.com> writes:
> ... So "tyty" and "tty" could be arguably both taken as double "ty",
> except that the official form is "tty"... but from a pronunciation point
> of view they ARE equivalent in hungarian.

That's fair enough, but the question is should they be taken as
equivalent for string-comparison purposes?  (English has plenty of
cases where different letter combinations sound alike, but we don't
consider them equal because of that.  That may not be a good analogy
though.  Also, if there are cases in other locales where strcoll
considers non-identical strings equal, the reasoning for it might be
quite different.)

            regards, tom lane

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

16 December 2005, 13:54:30

On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote:
> Perhaps the fast-path check is a bad idea, but fixing this is not just
> a matter of removing that.  If we subscribe to strcoll's worldview then
> we have to conclude that *text strings are not hashable*, because
> strings that should be "equal" may have different hash codes.  And at
> least in the current PG code, that's not something we can flip on and off
> depending on the locale --- texteq would have to be marked non hashable
> in the system catalogs, meaning a big performance hit for *everybody*
> even if their locale is not this weird.

That's true, in the sense that unconverted strings are not hashable.
This is what strxfrm was created for, to return the sorting key for a
string. A quick C program demonstrates that indeed in that locale these
two strings are equal, whereas in en_AU they are not.

$ LC_ALL=hu_HU ./strxfrm potyty potty
String  1: potyty
Strxfrm 1: " ((\x01\x02\x02\x02\x02\x01\x02\x02\x02\x02
String  2: potty
Strxfrm 2: " ((\x01\x02\x02\x02\x02\x01\x02\x02\x02\x02
$ LC_ALL=en_AU ./strxfrm potyty potty
String  1: potyty
Strxfrm 1: \x1B\x1A\x1F$\x1F$\x01\x02\x02\x02\x02\x02\x02\x01\x02\x02\x02\x02\x02\x02
String  2: potty
Strxfrm 2: \x1B\x1A\x1F\x1F$\x01\x02\x02\x02\x02\x02\x01\x02\x02\x02\x02\x02

I think the only way to make indexes properly locale sensetive would be
to either use strcoll() in all cases, or store the result from
strxfrm() in the index. Anything else will break somewhere.

In any case, we first need to determine which answer is correct, before
we run off trying to fix it.

This is Glibc 2.3.2 on a Debian Linux system.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

16 December 2005, 14:03:52

On Fri, 2005-12-16 at 18:52, Tom Lane wrote:
> Csaba Nagy <nagy@ecircle-ag.com> writes:
> > ... So "tyty" and "tty" could be arguably both taken as double "ty",
> > except that the official form is "tty"... but from a pronunciation point
> > of view they ARE equivalent in hungarian.
>
> That's fair enough, but the question is should they be taken as
> equivalent for string-comparison purposes?  (English has plenty of
> cases where different letter combinations sound alike, but we don't
> consider them equal because of that.  That may not be a good analogy
> though.  Also, if there are cases in other locales where strcoll
> considers non-identical strings equal, the reasoning for it might be
> quite different.)

Well, I'm not an expert on this one. In any case, hungarian has
phonetical writing as opposed to the etymological writing English has.
So in hungarian there is a 1 to 1 mapping between the sounds and the
signs used to depict them... so pronunciation is somewhat more relevant
in sorting I guess. But I'm not a linguist so I won't know for sure.

Cheers,
Csaba.

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 14:07:02

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Dec 16, 2005 at 12:12:08PM -0500, Tom Lane wrote:
>> Perhaps the fast-path check is a bad idea, but fixing this is not just
>> a matter of removing that.  If we subscribe to strcoll's worldview then
>> we have to conclude that *text strings are not hashable*, because
>> strings that should be "equal" may have different hash codes.

> This is what strxfrm was created for, to return the sorting key for a
> string.

Ah.  So we could redefine hashtext() to return the hash of the strxfrm
value.  Slow, but a lot better than giving up hash join and hash
aggregation altogether...

> In any case, we first need to determine which answer is correct, before
> we run off trying to fix it.

Agreed.

            regards, tom lane

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

16 December 2005, 14:09:42

On Fri, Dec 16, 2005 at 06:54:15PM +0100, Martijn van Oosterhout wrote:
> That's true, in the sense that unconverted strings are not hashable.
> This is what strxfrm was created for, to return the sorting key for a
> string. A quick C program demonstrates that indeed in that locale these
> two strings are equal, whereas in en_AU they are not.

FWIW, here's some links to Microsoft and MySQL dealing with the same
issue, so we're not alone here. Hungarian seems to be a complex
language to sort, but it seems that glibc is right in this case.

http://blogs.msdn.com/michkap/archive/2005/11/13/491646.aspx
http://bugs.mysql.com/bug.php?id=12519

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-109200.dat

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

16 December 2005, 14:19:26

On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote:
> Ah.  So we could redefine hashtext() to return the hash of the strxfrm
> value.  Slow, but a lot better than giving up hash join and hash
> aggregation altogether...

Not to put too fine a point on it, but either you want locale-sensetive
sorting or you don't. If you do, you need to realise the cost
associated with it. Correctness above speed after all.

Which reminds me, I should probably finish that COLLATE patch. Then you
could choose between:

'putty' = 'putyty' COLLATE C       (false)
'putty' = 'putyty' COLLATE 'hu_HU' (true)

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-109202.dat

Re: is this a bug or I am blind?

From

"Russ Brown"

Date:

16 December 2005, 14:26:24

On Fri, 16 Dec 2005 11:59:48 -0600, Csaba Nagy <nagy@ecircle-ag.com> wrote:

> On Fri, 2005-12-16 at 18:52, Tom Lane wrote:
>> Csaba Nagy <nagy@ecircle-ag.com> writes:
>> > ... So "tyty" and "tty" could be arguably both taken as double "ty",
>> > except that the official form is "tty"... but from a pronunciation
>> point
>> > of view they ARE equivalent in hungarian.
>>
>> That's fair enough, but the question is should they be taken as
>> equivalent for string-comparison purposes?  (English has plenty of
>> cases where different letter combinations sound alike, but we don't
>> consider them equal because of that.  That may not be a good analogy
>> though.  Also, if there are cases in other locales where strcoll
>> considers non-identical strings equal, the reasoning for it might be
>> quite different.)
>
> Well, I'm not an expert on this one. In any case, hungarian has
> phonetical writing as opposed to the etymological writing English has.
> So in hungarian there is a 1 to 1 mapping between the sounds and the
> signs used to depict them... so pronunciation is somewhat more relevant
> in sorting I guess. But I'm not a linguist so I won't know for sure.
>

Trouble is, you can never guarantee that you're dealing with actual words.
What of you're comparing someone's password that happens to contain
combination of letters that act in this way?

> Cheers,
> Csaba.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq



--

Russ

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 14:29:03

Martijn van Oosterhout <kleptog@svana.org> writes:
> FWIW, here's some links to Microsoft and MySQL dealing with the same
> issue, so we're not alone here. Hungarian seems to be a complex
> language to sort, but it seems that glibc is right in this case.

The mysql bug link has a fairly detailed description, but it dodges the
question that we need to answer here: do we want to make a finer-grain
distinction than glibc does?  In the test data that I got from Mage,
the first clue I got was from looking at the results of an ORDER BY
versus an index scan:

 potyos
 potyty
 potty
 potyty
 potyty
 potty
 potty6

 potyos
 potty
 potyty
 potyty
 potty
 potyty
 potty6

Actually, the relative order of the "potyty"s and "potty"s is completely
random at the moment.  You've got to admit that this looks weird: you'd
expect a database's ORDER BY output to impose at least a cosmetic
ordering on these strings.  Per what we've heard, it wouldn't matter
much to a Hungarian speaker whether the "potyty"s come before or after
the "potty"s, but it seems like it should be consistently one or the
other.

This argument doesn't really answer the question about whether
WHERE username = 'potyty' should match a stored 'potty', however.
My inclination is to say "no it shouldn't directly match --- apply a
normalization function to your data if you think that tyty should be
canonically spelled tty".  If we had per-column locales there would
be a stronger argument for allowing them to be equal, but right now
this folding would occur for all text in a database ... and surely
this would be considered a bug for any text that happened not to be
Hungarian words.  But perhaps my view is overly influenced by
performance considerations.

            regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

16 December 2005, 14:40:24

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote:
>> Ah.  So we could redefine hashtext() to return the hash of the strxfrm
>> value.  Slow, but a lot better than giving up hash join and hash
>> aggregation altogether...

> Not to put too fine a point on it, but either you want locale-sensetive
> sorting or you don't.

Nobody's said anything about giving up locale-sensitive sorting.  The
question is about locale-sensitive equality: does it really make sense
that 'tty' = 'tyty'?  Would your answer change in the context
'/dev/tty' = '/dev/tyty'?  Are you willing to *not have access* to a
text comparison operator that will make the distinction?

I'm inclined to think that this is more like the occasional need for
accent-insensitive comparisons.  It seems generally agreed that you want
something like smash('ab') = smash('áb') rather than making the
strings equal in all contexts.

Of course, not being a native speaker of any of the affected languages,
my opinion shouldn't be taken too seriously ...

            regards, tom lane

Re: is this a bug or I am blind?

From

Greg Stark

Date:

16 December 2005, 17:57:51

Tom Lane <tgl@sss.pgh.pa.us> writes:

> The other approach we could take is to define text comparison as
> yielding equality only for bitwise-equal strings.  If strcoll() returns
> zero then ask strcmp() for a second opinion.

Fwiw this has come up before on this list and it was discovered this is
effectively what Perl does, probably for similar motivations wrt to hashes.

I think it's probably the least bad solution, even if it's not really the
right thing.

--
greg

Re: is this a bug or I am blind?

From

Lincoln Yeoh

Date:

16 December 2005, 23:55:20

At 01:40 PM 12/16/2005 -0500, Tom Lane wrote:

>Nobody's said anything about giving up locale-sensitive sorting.  The
>question is about locale-sensitive equality: does it really make sense
>that 'tty' = 'tyty'?  Would your answer change in the context
>'/dev/tty' = '/dev/tyty'?  Are you willing to *not have access* to a
>text comparison operator that will make the distinction?
>
>I'm inclined to think that this is more like the occasional need for
>accent-insensitive comparisons.  It seems generally agreed that you want
>something like smash('ab') = smash('Ã¡b') rather than making the
>strings equal in all contexts.

I agree.

I would prefer for everything to be compared without any
collation/corruption by default, and for there to be a function to pick the
desired comparison behaviour ( Can all that functionality be done with the
collate clause?).

Because most databases are multi-locale whether the humans are aware of it
or not:

The Computer "locale", human locale #1, unknown/international locale, human
locale #2, ...

In a column for license keys, "tty" should rarely be the same as "tyty".
In a column for base64 data (crypto hashes, etc) "tty" should NEVER be the
same as "tyty".
In a column for domain names, I doubt it is clear whether you want to match
tty.ibm.hu just because tyty.ibm.hu exists.

But in a column for license owner names, one might want "tty" and "tyty" to
be the same - one might have to have a multicolumn index depending on the
owner's locale of choice.

I recommend that for these reasons initdb should always pick "no mangled"
text by default, no matter what the locale setting is. And that users
should be advised of the potential consequences of mangling or I would even
say corrupting all text in their databases by default.

Regards,
Link.

Re: is this a bug or I am blind?

From

Karsten Hilbert

Date:

17 December 2005, 04:57:51

> At 01:40 PM 12/16/2005 -0500, Tom Lane wrote:
> >Nobody's said anything about giving up locale-sensitive sorting.  The
> >question is about locale-sensitive equality: does it really make sense
> >that 'tty' = 'tyty'?  Would your answer change in the context
> >'/dev/tty' = '/dev/tyty'?  Are you willing to *not have access* to a
> >text comparison operator that will make the distinction?

On Sat, Dec 17, 2005 at 11:49:48AM +0800, Lincoln Yeoh wrote:
> I would prefer for everything to be compared without any
> collation/corruption by default, and for there to be a function to pick the
> desired comparison behaviour ( Can all that functionality be done with the
> collate clause?).

> In a column for license keys, "tty" should rarely be the same as "tyty".
> In a column for base64 data (crypto hashes, etc) "tty" should NEVER be the
> same as "tyty".
>
> In a column for domain names, I doubt it is clear whether you want to match
> tty.ibm.hu just because tyty.ibm.hu exists.
>
> But in a column for license owner names, one might want "tty" and "tyty" to
> be the same - one might have to have a multicolumn index depending on the
> owner's locale of choice.
>
> I recommend that for these reasons initdb should always pick "no mangled"
> text by default, no matter what the locale setting is.
Tom,

as a speaker of German I absolutely agree on the above. The
database shouldn't be second guessing on the user
intentions. If the user thinks she wants mangling of *all*
text in the database by default she is wrong in most cases.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: is this a bug or I am blind?

From

Lincoln Yeoh

Date:

17 December 2005, 09:56:23

At 11:49 AM 12/17/2005 +0800, Lincoln Yeoh wrote:
>But in a column for license owner names, one might want "tty" and "tyty"
>to be the same - one might have to have a multicolumn index depending on
>the owner's locale of choice.

To make myself clear, one might want to store a person's name in one column
and the locale it belongs to in another column.

I wondering whether it is possible to have something like this:

create table ppl_people (
id serial,
locale text,
name text
);

create index idx_ppl_locale_name on ppl_people
( locale,smash(locale,name))

Then one could do:

select * from ppl_people where
locale=$locale
and smash(locale,name) between smash($locale,$start) and smash($locale,$end)
order by smash(locale,name)

Not sure if that is correct, but hope you can understand me anyway.

Would something like this be possible using the standard SQL syntax and the
COLLATE feature?

Or am I thinking of doing things the wrong way, and there's a better or
more standard way?

Regards,
Link.

Re: is this a bug or I am blind?

From

Bruce Momjian

Date:

17 December 2005, 13:26:59

Where are we on this?  Given the original report:

    online=# select * from common_logins where username = 'potyty';
     uid | username | password | lastlogin | status | usertype | loginnum
    -----+----------+----------+-----------+--------+----------+----------
    (0 rows)

    online=# select * from common_logins where username like 'potyty';
      uid   | username | password |         lastlogin          | status |
    usertype | loginnum
    --------+----------+----------+----------------------------+--------+----------+----------
     155505 | potyty   | board    | 2004-08-16 17:45:55.723829 | A      |
    S        |        1
      60067 | potyty   | board    | 2004-07-07 20:22:17.68699  | A      |
    S        |        3
     174041 | potyty   | board    | 2005-02-17 00:00:13.706144 | A      |
    S        |        3
    (3 rows)

    online=# select username, username = 'potyty' from common_logins where
    username like 'potyty';
     username | ?column?
    ----------+----------
     potyty   | t
     potyty   | t
     potyty   | t
    (3 rows)

I don't think we can state that our current behavior is correct. I
realize we are being hit by the length comparison optimization, but
ultimiately the issue is that the Hungarian-specific locale considers
"tyty" and "tty" as the same string, which confuses our indexing
comparisons.

Is our fix going to be a Hungarian-specific one?

---------------------------------------------------------------------------

Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > On Fri, Dec 16, 2005 at 01:06:58PM -0500, Tom Lane wrote:
> >> Ah.  So we could redefine hashtext() to return the hash of the strxfrm
> >> value.  Slow, but a lot better than giving up hash join and hash
> >> aggregation altogether...
>
> > Not to put too fine a point on it, but either you want locale-sensetive
> > sorting or you don't.
>
> Nobody's said anything about giving up locale-sensitive sorting.  The
> question is about locale-sensitive equality: does it really make sense
> that 'tty' = 'tyty'?  Would your answer change in the context
> '/dev/tty' = '/dev/tyty'?  Are you willing to *not have access* to a
> text comparison operator that will make the distinction?
>
> I'm inclined to think that this is more like the occasional need for
> accent-insensitive comparisons.  It seems generally agreed that you want
> something like smash('ab') = smash('áb') rather than making the
> strings equal in all contexts.
>
> Of course, not being a native speaker of any of the affected languages,
> my opinion shouldn't be taken too seriously ...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: is this a bug or I am blind?

From

Tom Lane

Date:

17 December 2005, 13:45:41

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Where are we on this?

Waiting to see if there's any input on what the behavior needs to be.

            regards, tom lane

Re: is this a bug or I am blind?

From

Peter Eisentraut

Date:

17 December 2005, 14:52:31

Tom Lane wrote:
> Perhaps the fast-path check is a bad idea, but fixing this is not
> just a matter of removing that.  If we subscribe to strcoll's
> worldview then we have to conclude that *text strings are not
> hashable*, because strings that should be "equal" may have different
> hash codes.

By the way, I have always been concerned about the feature of Unicode
that you can write logically equivalent strings using different
code-point sequences.  Namely, you often have the option of writing an
accented letter using the "legacy" single codepoint (like in ISO
8859-something) or alternatively using accept plus "base letter" as two
code points.  Collating systems should treat them the same, so hashing
the byte values won't work anyway.  This is a more extreme case of
"tyty" vs. "tty" because using a proper rendering system, those Unicode
strings should look the same to the naked eye.  Therefore, I'm doubtful
that using a binary comparison as tie-breaker is proper behavior.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: is this a bug or I am blind?

From

Tom Lane

Date:

17 December 2005, 15:08:11

Peter Eisentraut <peter_e@gmx.net> writes:
> By the way, I have always been concerned about the feature of Unicode
> that you can write logically equivalent strings using different
> code-point sequences.  Namely, you often have the option of writing an
> accented letter using the "legacy" single codepoint (like in ISO
> 8859-something) or alternatively using accept plus "base letter" as two
> code points.  Collating systems should treat them the same, so hashing
> the byte values won't work anyway.  This is a more extreme case of
> "tyty" vs. "tty" because using a proper rendering system, those Unicode
> strings should look the same to the naked eye.  Therefore, I'm doubtful
> that using a binary comparison as tie-breaker is proper behavior.

Hm.  Would you expect that these sequences generate identical strxfrm
output?

The weight of opinion later in the thread seems to be leaning towards
the idea that we do not want to accept the word of strcoll/strxfrm about
whether two strings are equal: there are too many scenarios where lax
equality behavior would be a serious bug, and too few where it's
critical to have it.  I'm still prepared to listen to argument though.

A possible compromise going forward would be to introduce an additional
comparison operator that tests for strcoll equality --- but I'd vote for
calling it something other than "=".

            regards, tom lane

Re: is this a bug or I am blind?

From

Mage

Date:

17 December 2005, 15:12:15

Tom Lane wrote:

>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>
>
>>Where are we on this?
>>
>>
>
>Waiting to see if there's any input on what the behavior needs to be.
>
>
>
Actually, "potyty" and "potty" are not equal in Hungarian language. We
use the "tyty" form in complex words. Only data sorting requires the
knowledge that they are neighbours.

I am very afraid of queries when a condition has different meanings in
the select and in the where clause. It is the worst. Even if the "tyty"
and "tty" were same, I couldn't accept the actual behavior of the
indexes, but they are not same.

       Mage

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

17 December 2005, 17:30:21

On Sat, Dec 17, 2005 at 02:08:06PM -0500, Tom Lane wrote:
> The weight of opinion later in the thread seems to be leaning towards
> the idea that we do not want to accept the word of strcoll/strxfrm about
> whether two strings are equal: there are too many scenarios where lax
> equality behavior would be a serious bug, and too few where it's
> critical to have it.  I'm still prepared to listen to argument though.

Well, it seems to me that the problem is that there are several
differing ideas of equal, depending on what you're trying to do. The
real problem is that we are only allowed to choose one to be "=" and
we're not sure people will understand the issues.

> A possible compromise going forward would be to introduce an additional
> comparison operator that tests for strcoll equality --- but I'd vote for
> calling it something other than "=".

I think the real solution is to implement COLLATE support. Then we can
define all sorts of collations, like:

C/POSIX
C with case-insensetivity
hu_HU with binary tie-break
hu_HU with maximum munging
UTF-8 with accent insenstivity.

Then we can simply default everything to C/POSIX which would be a
straight binary match (priciple of least surprise) and if people want
anything else they need to specify the COLLATE they want either on the
column or in the query.

My patch was halfway there (the grammer was sorted, as were the SQL
propgation rules) but there was some work to go. I'll see if I can get
it to a working state.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-109295.dat

Re: is this a bug or I am blind?

From

Tom Lane

Date:

17 December 2005, 18:01:20

Martijn van Oosterhout <kleptog@svana.org> writes:
> I think the real solution is to implement COLLATE support.

Maybe so, but we still need to figure out what we're doing for the back
branches, and that won't be it ...

            regards, tom lane

Re: is this a bug or I am blind?

From

Greg Stark

Date:

17 December 2005, 21:38:21

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > By the way, I have always been concerned about the feature of Unicode
> > that you can write logically equivalent strings using different
> > code-point sequences.  Namely, you often have the option of writing an
> > accented letter using the "legacy" single codepoint (like in ISO
> > 8859-something) or alternatively using accept plus "base letter" as two
> > code points.  Collating systems should treat them the same, so hashing
> > the byte values won't work anyway.  This is a more extreme case of
> > "tyty" vs. "tty" because using a proper rendering system, those Unicode
> > strings should look the same to the naked eye.  Therefore, I'm doubtful
> > that using a binary comparison as tie-breaker is proper behavior.
>
> Hm.  Would you expect that these sequences generate identical strxfrm
> output?

I think this is mixing up two different things.

Using iso-8859-1 to encode "é" as a single byte versus using UTF8 which would
take two bytes to encode it is an issue of using two *different* encodings.
The actual string of characters being encoded is precisely the same string.
That is, while the sequence of bytes in the encoded string is different the
sequence of characters being encoded is precisely the same.

Postgres doesn't really face this issue currently since it only supports one
encoding at a time anyways. If Postgres supported multiple encodings and it
was necessary to compare two strings in two different encodings then they
would probably both have to be converted to a common encoding (presumably UTFx
for some value of x) before comparing.

There is a separate issue that some characters could theoretically have
multiple representations even within the same encoding. This doesn't really
happen in the usual non UTF encodings (like iso-8859-x) to my knowledge, but
it can happen in UTF8 or UTF16 because, for example, you could use the
variable length form that takes 2 bytes or even 4 bytes for characters that
are really just plain ascii characters.

However there are canonicalization rules that basically rule all but the
shortest representation invalid unicode strings. I assume these exist
precisely to make it easier to compare or hash unicode strings. I guess it's
an open question whether the database should signal an error on such invalid
strings or silently treat them as equivalent to a correct encoding of the same
string.

On the original issue I think the bottom line is that strings are sequences of
characters and two sequences of characters should only compare equal if they
contain the same characters in the same order.

The encodings can be different and still represent the same string, but they
do have to represent the same sequence of characters. If they represent two
different sequences of characters -- even if the two sequences have the same
significance in the language of the reader, they're still not actually the
same sequence of characters.

As long as both strings are encoded in the same encoding (whether that be
iso-8859-1 or utf8 or whatever) sorting by strcoll and then strcmp will
effectively give this set of semantics with one exception, the case of invalid
UTF encodings that are not canonicalized where it will silently treat them as
distinct strings from the correctly encoded string.

One day when it's possible for the two strings to be in two different
encodings then they will have to both be converted to an encoding that
includes the union of the two character sets covered by the two encodings.

--
greg

Re: is this a bug or I am blind?

From

Peter Eisentraut

Date:

17 December 2005, 22:14:23

Greg Stark wrote:
> Using iso-8859-1 to encode "é" as a single byte versus using UTF8
> which would take two bytes to encode it is an issue of using two
> *different* encodings.

But that's not what we are discussing.

> There is a separate issue that some characters could theoretically
> have multiple representations even within the same encoding.

That is what we are discussing.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: is this a bug or I am blind?

From

Greg Stark

Date:

17 December 2005, 22:33:18

Peter Eisentraut <peter_e@gmx.net> writes:

> Greg Stark wrote:
> > Using iso-8859-1 to encode "é" as a single byte versus using UTF8
> > which would take two bytes to encode it is an issue of using two
> > *different* encodings.
>
> But that's not what we are discussing.

The poster to which Tom was responding was bringing it up as an issue.
I was explaining how it was different.

> > There is a separate issue that some characters could theoretically
> > have multiple representations even within the same encoding.
>
> That is what we are discussing.

Well the original discussion about the hungarian strings was about yet a third
case entirely. Two different sequences of characters that have the same
semantic significance.

--
greg

Re: is this a bug or I am blind?

From

Karsten Hilbert

Date:

18 December 2005, 12:41:06

On Sat, Dec 17, 2005 at 10:30:04PM +0100, Martijn van Oosterhout wrote:

> I think the real solution is to implement COLLATE support. Then we can
> define all sorts of collations, like:
...
> My patch was halfway there (the grammer was sorted, as were the SQL
> propgation rules) but there was some work to go. I'll see if I can get
> it to a working state.
That'd be truly excellent.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: is this a bug or I am blind?

From

Csaba Nagy

Date:

19 December 2005, 06:45:39

[snip]
> Trouble is, you can never guarantee that you're dealing with actual words.
> What of you're comparing someone's password that happens to contain
> combination of letters that act in this way?

Well, in this case why would you care about how passwords are sorted ?
:-)

I think if the strings are linguistically meaningless, then it's also
more or less meaningless how you sort them. The best thing would be to
be able to specify the locale of each column, then this problem would go
away, as you could sort text differently then passwords...

Cheers,
Csaba.

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

19 December 2005, 08:28:12

On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > I think the real solution is to implement COLLATE support.
>
> Maybe so, but we still need to figure out what we're doing for the back
> branches, and that won't be it ...

To be honest, there are really only a handful of locales that suffer
from this issue, so perhaps we should document it and move on. Perhaps
we should add some sanity checks to initdb to ensure that the locale is
sane. Basically things like:

FOO != foo   (case-sensetivity)
tty != tyty  (wierd hungarian case)
aei != äëï   (accent-sensetivity)

Until COLLATE support is working, I'm not sure if we should put that
much effort into dealing with these cases. Even after we have COLLATE
support we'll probably have to start using strxfrm to make these
locales usable for indexes.

Using a binary string compares for tie-breaks seems a waste of cycles
for the 99% of locales that don't need it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-109344.dat

Re: is this a bug or I am blind?

From

Tom Lane

Date:

19 December 2005, 11:02:09

Csaba Nagy <nagy@ecircle-ag.com> writes:
>> Trouble is, you can never guarantee that you're dealing with actual words.
>> What of you're comparing someone's password that happens to contain
>> combination of letters that act in this way?

> Well, in this case why would you care about how passwords are sorted ?

You don't.  But you do care about what equality means, and the
discussion in this thread is really about changing that, not about
changing the behavior of sorting.  (The side-effects on sorting will
be so minor that it's unlikely anyone would notice, with either
solution.)

            regards, tom lane

Re: is this a bug or I am blind?

From

Tom Lane

Date:

19 December 2005, 11:08:13

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote:
>> Maybe so, but we still need to figure out what we're doing for the back
>> branches, and that won't be it ...

> To be honest, there are really only a handful of locales that suffer
> from this issue, so perhaps we should document it and move on.

"Let's not fix it" is really not an acceptable answer, because the
behavior in the affected locales is entirely broken (inconsistent,
etc).  And how do you know which locales are affected, anyway?

(Also, to be blunt, the COLLATE feature is pie in the sky until it
is done and supports indexes --- which I'm not sure you even know
how to do yet.  I'm not prepared to assume it will be in 8.2.)

            regards, tom lane

Re: is this a bug or I am blind?

From

Martijn van Oosterhout

Date:

19 December 2005, 15:37:42

On Mon, Dec 19, 2005 at 10:08:12AM -0500, Tom Lane wrote:
> "Let's not fix it" is really not an acceptable answer, because the
> behavior in the affected locales is entirely broken (inconsistent,
> etc).  And how do you know which locales are affected, anyway?

I don't know which locales are affected. It just can't be that
widespread because we're not getting similar reports for 99% of the
locales out there. I'm not saying not fixing it is an option, it's just
that a second compare on equality is going to be a waste in >99% of
cases.

> (Also, to be blunt, the COLLATE feature is pie in the sky until it
> is done and supports indexes --- which I'm not sure you even know
> how to do yet.  I'm not prepared to assume it will be in 8.2.)

Oh, getting indexes to use COLLATE is (relatively) the easy part. Where
I'm mostly stuck on is teaching the planner how to decide which index
returns the right order for what it's looking for.

For example, should I be returning a seperate path for each possible
collation order. Obviously not, but there will certainly be several
orders returned depending on indexes, maybe others depending on any
ORDER BY clauses in the query. Getting *that* right is harder and my
planner knowledge isn't that great.

Right now I'm updating the code for CVS tip. pg_indent has a great way
of making large patches cease to work :(. Some other things seem to
have changed also. Oh well... It's not a huge amount of work, it's just
lots of interconnected pieces.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-109369.dat

Re: is this a bug or I am blind?

From

Karsten Hilbert

Date:

20 December 2005, 12:02:34

On Mon, Dec 19, 2005 at 08:37:26PM +0100, Martijn van Oosterhout wrote:

> I don't know which locales are affected. It just can't be that
> widespread because we're not getting similar reports for 99% of the
> locales out there.
Not getting reports doesn't mean the problem is rare.
Perhaps people moved to another database. Perhaps people
decided to solve their problem in client code. Perhaps 50%
of all potential locales haven't been put to use with
PostgreSQL such that the problem showed up.

It's definitely worth doing something about. Had I the
skills would I myself help with it.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: is this a bug or I am blind?

From

Tom Lane

Date:

20 December 2005, 12:21:51

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> On Mon, Dec 19, 2005 at 08:37:26PM +0100, Martijn van Oosterhout wrote:
>> I don't know which locales are affected. It just can't be that
>> widespread because we're not getting similar reports for 99% of the
>> locales out there.

> Not getting reports doesn't mean the problem is rare.

I'm not sure that we can say we're not getting reports, either.  We've
seen *plenty* of reports of strange comparison misbehavior.  Up to now
I've always written them off as pilot error (ie, incompatible locale and
encoding selections) but now I suspect some of them were due to this
issue.

            regards, tom lane

Re: is this a bug or I am blind?

From

Mage

Date:

21 December 2005, 16:59:32

Martijn van Oosterhout wrote:

>On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote:
>
>
>>Martijn van Oosterhout <kleptog@svana.org> writes:
>>
>>
>>>I think the real solution is to implement COLLATE support.
>>>
>>>
>>Maybe so, but we still need to figure out what we're doing for the back
>>branches, and that won't be it ...
>>
>>
>
>To be honest, there are really only a handful of locales that suffer
>from this issue, so perhaps we should document it and move on.
>
I don't agree. Usually I read the whole documentation of the software I
use, but you cannot presume that every user even with good sql skills
will check the documentation for a thing he wouldn't imagine.

With knowing the background it is understandable locale problem, but in
the user's point of view it's a weird and serious bug which shouldn't be
there. Using hu_HU with latin2 is a normal marrying.

Some users (including me) don't always read the "known issues" chapter,
even for a good quality software.

       Mage

Re: is this a bug or I am blind?

From

Bruce Momjian

Date:

22 December 2005, 20:40:26

Tom has applied a patch to fix this and backpatched it to all relivant
branches.  He might be preparing a summary email about this.

---------------------------------------------------------------------------

Mage wrote:
> Martijn van Oosterhout wrote:
>
> >On Sat, Dec 17, 2005 at 05:01:15PM -0500, Tom Lane wrote:
> >
> >
> >>Martijn van Oosterhout <kleptog@svana.org> writes:
> >>
> >>
> >>>I think the real solution is to implement COLLATE support.
> >>>
> >>>
> >>Maybe so, but we still need to figure out what we're doing for the back
> >>branches, and that won't be it ...
> >>
> >>
> >
> >To be honest, there are really only a handful of locales that suffer
> >from this issue, so perhaps we should document it and move on.
> >
> I don't agree. Usually I read the whole documentation of the software I
> use, but you cannot presume that every user even with good sql skills
> will check the documentation for a thing he wouldn't imagine.
>
> With knowing the background it is understandable locale problem, but in
> the user's point of view it's a weird and serious bug which shouldn't be
> there. Using hu_HU with latin2 is a normal marrying.
>
> Some users (including me) don't always read the "known issues" chapter,
> even for a good quality software.
>
>        Mage
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073