Thread: how is text-equality handled in postgresql?

how is text-equality handled in postgresql?

From

Gábor Farkas

Date:

15 January 2014, 09:10:32

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

thanks,
gabor

Re: how is text-equality handled in postgresql?

From

Ivan Voras

Date:

15 January 2014, 10:40:02

On 15/01/2014 10:10, Gábor Farkas wrote:
> hi,
>
> when i create an unique-constraint on a varchar field, how exactly
> does postgresql compare the texts? i'm asking because in UNICODE there
> are a lot of complexities about this..
>
> or in other words, when are two varchars equal in postgres? when their
> bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

(which breaks on certain systems which don't have complete UTF-8 support
- I'm in favour of importing ICU at least as an optional dependancy,
similar to what the FreeBSD's patch does:
http://people.freebsd.org/~girgen/postgresql-icu/).

Attachment

signature.asc

Re: how is text-equality handled in postgresql?

From

Amit Langote

Date:

15 January 2014, 11:37:23

On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 15/01/2014 10:10, Gábor Farkas wrote:
>> hi,
>>
>> when i create an unique-constraint on a varchar field, how exactly
>> does postgresql compare the texts? i'm asking because in UNICODE there
>> are a lot of complexities about this..
>>
>> or in other words, when are two varchars equal in postgres? when their
>> bytes are? or some algorithm is applied?
>
> By default, it is "whatever the operating system thinks it's right".
> PostgreSQL doesn't have its own collation code, it uses the OS's locale
> support for this.
>

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

#ifdef HAVE_LOCALE_T
                if (mylocale)
                        result = strcoll_l(a1p, a2p, mylocale);
                else
#endif
                        result = strcoll(a1p, a2p);

                /*
                 * In some locales strcoll() can claim that
nonidentical strings are
                 * equal.  Believing that would be bad news for a
number of reasons,
                 * so we follow Perl's lead and sort "equal" strings
according to
                 * strcmp().
                 */
                if (result == 0)
                        result = strcmp(a1p, a2p);

--
Amit Langote

Re: how is text-equality handled in postgresql?

From

Ivan Voras

Date:

15 January 2014, 12:02:21

On 15/01/2014 12:36, Amit Langote wrote:
> On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 15/01/2014 10:10, Gábor Farkas wrote:
>>> hi,
>>>
>>> when i create an unique-constraint on a varchar field, how exactly
>>> does postgresql compare the texts? i'm asking because in UNICODE there
>>> are a lot of complexities about this..
>>>
>>> or in other words, when are two varchars equal in postgres? when their
>>> bytes are? or some algorithm is applied?
>>
>> By default, it is "whatever the operating system thinks it's right".
>> PostgreSQL doesn't have its own collation code, it uses the OS's locale
>> support for this.
>>
>
> Just to add to this, whenever strcoll() (a locale aware comparator)
> says two strings are equal, postgres re-compares them using strcmp().
> See following code snippet off
> src/backend/utils/adt/varlena.c:varstr_cmp() -

>                 /*
>                  * In some locales strcoll() can claim that
> nonidentical strings are
>                  * equal.  Believing that would be bad news for a
> number of reasons,
>                  * so we follow Perl's lead and sort "equal" strings
> according to
>                  * strcmp().
>                  */
>                 if (result == 0)
>                         result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?

Attachment

signature.asc

Re: how is text-equality handled in postgresql?

From

Amit Langote

Date:

15 January 2014, 12:29:29

On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 15/01/2014 12:36, Amit Langote wrote:
>> On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>>> On 15/01/2014 10:10, Gábor Farkas wrote:
>>>> hi,
>>>>
>>>> when i create an unique-constraint on a varchar field, how exactly
>>>> does postgresql compare the texts? i'm asking because in UNICODE there
>>>> are a lot of complexities about this..
>>>>
>>>> or in other words, when are two varchars equal in postgres? when their
>>>> bytes are? or some algorithm is applied?
>>>
>>> By default, it is "whatever the operating system thinks it's right".
>>> PostgreSQL doesn't have its own collation code, it uses the OS's locale
>>> support for this.
>>>
>>
>> Just to add to this, whenever strcoll() (a locale aware comparator)
>> says two strings are equal, postgres re-compares them using strcmp().
>> See following code snippet off
>> src/backend/utils/adt/varlena.c:varstr_cmp() -
>
>>                 /*
>>                  * In some locales strcoll() can claim that
>> nonidentical strings are
>>                  * equal.  Believing that would be bad news for a
>> number of reasons,
>>                  * so we follow Perl's lead and sort "equal" strings
>> according to
>>                  * strcmp().
>>                  */
>>                 if (result == 0)
>>                         result = strcmp(a1p, a2p);
>
> That seems odd and inefficient. Why would it be necessary? I would think
> indexing (and other collation-sensitive operations) don't care what the
> actual collation result is for arbitrary blobs of strings, as long as
> they are stable?
>

This is the behavior since quite some time introduced by this commit

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

--
Amit Langote

Re: how is text-equality handled in postgresql?

From

Ivan Voras

Date:

15 January 2014, 13:01:17

On 15/01/2014 13:29, Amit Langote wrote:
> On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 15/01/2014 12:36, Amit Langote wrote:

>>>                  * In some locales strcoll() can claim that
>>> nonidentical strings are
>>>                  * equal.  Believing that would be bad news for a
>>> number of reasons,
>>>                  * so we follow Perl's lead and sort "equal" strings
>>> according to
>>>                  * strcmp().
>>>                  */
>>>                 if (result == 0)
>>>                         result = strcmp(a1p, a2p);
>>
>> That seems odd and inefficient. Why would it be necessary? I would think
>> indexing (and other collation-sensitive operations) don't care what the
>> actual collation result is for arbitrary blobs of strings, as long as
>> they are stable?
>>
>
> This is the behavior since quite some time introduced by this commit
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

Ok, the commit comment is:

"Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical.  This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well.  Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all."

... so it's just another workaround for OS specific locale issues - to
me it looks like just another reason to use ICU.

Attachment

signature.asc

Re: how is text-equality handled in postgresql?

From

Tom Lane

Date:

15 January 2014, 14:43:52

Ivan Voras <ivoras@freebsd.org> writes:
> On 15/01/2014 12:36, Amit Langote wrote:
>> Just to add to this, whenever strcoll() (a locale aware comparator)
>> says two strings are equal, postgres re-compares them using strcmp().

> That seems odd and inefficient. Why would it be necessary? I would think
> indexing (and other collation-sensitive operations) don't care what the
> actual collation result is for arbitrary blobs of strings, as long as
> they are stable?

If we didn't do it like this, we could not use hashing techniques for
text --- at least not unless we could find a hash function guaranteed
to yield the same values for any two strings that strcoll() claims are
equal.

            regards, tom lane

Re: how is text-equality handled in postgresql?

From

Vick Khera

Date:

15 January 2014, 16:41:25

On Wed, Jan 15, 2014 at 4:10 AM, Gábor Farkas <gabor.farkas@gmail.com> wrote:

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

On this topic, when I write my strings to the DB and search from the DB, should I canonicalize them first as NKFC (or some other), or just let the DB figure it out? In my specific case I use perl DBI with place holders to submit my queries.