Re: is this a bug or I am blind? - Mailing list pgsql-general

Mage <mage@mage.hu> writes:
> Tom, I can send you the data I dumped and you can try it out with same
> settings. I trust you.

Thanks.  After digging through it, I can exhibit the problem: in hu_HU
locale and LATIN2 encoding, on a Linux box (Fedora Core 4, but probably
all glibc installations will do this), I get

mage=# select 'potyty'::varchar = 'potty'::varchar;
 ?column?
----------
 f
(1 row)

mage=# select 'potyty'::varchar < 'potty'::varchar;
 ?column?
----------
 f
(1 row)

mage=# select 'potyty'::varchar > 'potty'::varchar;
 ?column?
----------
 f
(1 row)

Needless to say, this makes sorting and btree indexing very unhappy,
as they take the trichotomy law as an article of faith ;-)

I don't know anything about hu_HU comparison rules, but it appears that
strcoll() believes that these two strings should be considered equal.
Is that sane?  The immediate cause of the problem is that texteq() and
textne() have a "fast path" for unequal-length inputs:

    /* fast path for different-length inputs */
    if (VARSIZE(arg1) != VARSIZE(arg2))
        result = false;
    else
        result = (text_cmp(arg1, arg2) == 0);

(text_cmp is what invokes strcoll.)  Thus the = operator returns false,
while the other two go to strcoll() and then return false.

Perhaps the fast-path check is a bad idea, but fixing this is not just
a matter of removing that.  If we subscribe to strcoll's worldview then
we have to conclude that *text strings are not hashable*, because
strings that should be "equal" may have different hash codes.  And at
least in the current PG code, that's not something we can flip on and off
depending on the locale --- texteq would have to be marked non hashable
in the system catalogs, meaning a big performance hit for *everybody*
even if their locale is not this weird.

The other approach we could take is to define text comparison as
yielding equality only for bitwise-equal strings.  If strcoll() returns
zero then ask strcmp() for a second opinion.  This would mean that we'd
sort according to strcoll in the main, but strings that strcoll
considers equal but aren't physically identical would sort in codeset
order.  I can't see that this would do any harm in the context of
sorting rules, but the question of what equality means is something for
users to answer.  Do you think that these two queries ought to yield the
same rows in hu_HU locale, or not?
    select * from common_logins where username = 'potyty';
    select * from common_logins where username = 'potty';


            regards, tom lane

pgsql-general by date:

Previous
From: "Andrew B. Young"
Date:
Subject: How to store the time zone with a timestamp
Next
From: Michael Fuhr
Date:
Subject: Re: is this a bug or I am blind?