Thread: String comparison and the SQL standard

String comparison and the SQL standard

From
Albe Laurenz
Date:
While researching a problem with a different database system,
I came across the following in the SQL standard ISO/IEC 9075-2:2003,
Section 8.2 (<comparison predicate>), General Rules:

3) The comparison of two character strings is determined as follows:
   a) Let CS be the collation as determined by Subclause 9.13,
      "Collation determination", for the declared types of the
      two character strings.
   b) If the length in characters of X is not equal to the length
      in characters of Y, then the shorter string is effectively
      replaced, for the purposes of comparison, with a copy of itself
      that has been extended to the length of the longer string by
      concatenation on the right of one or more pad characters,
      where the pad character is chosen based on CS.
      If CS has the NO PAD characteristic, then the pad character is
      an implementation-dependent character different from
      any character in the character set of X and Y that collates
      less than any string under CS.
      Otherwise, the pad character is a <space>.

That would effectively mean that 'a'='a   ' is TRUE for
all character string types.

Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
that very result, while PostgreSQL and Oracle gave me FALSE.

Does anybody know if we deviate from the standard on purpose
in this case?  I searched the archives, but all I could find was
http://www.postgresql.org/message-id/20051019154026.X995@ganymede.hub.org

Yours,
Laurenz Albe


Re: String comparison and the SQL standard

From
Albe Laurenz
Date:
I wrote:
> While researching a problem with a different database system,
> I came across the following in the SQL standard ISO/IEC 9075-2:2003,
> Section 8.2 (<comparison predicate>), General Rules:
>
> 3) [...]
>
> That would effectively mean that 'a'='a   ' is TRUE for
> all character string types.

Searching further, I found ISO/IEC 9075-2:2003,
Section 9.8 (Determination of identical values),
General Rules:

2) Case:
   a) If V1 and V2 are both null, then V1 is identical to V2.
   b) If V1 is null and V2 is not null, or if V1 is not null and
      V2 is null, then V1 is not identical to V2.
   c) If V1 and V2 are of comparable predefined types, then
      Case:
      i) If V1 and V2 are character strings, then let L be
         CHARACTER_LENGTH(V1).
         Case:
         1) If CHARACTER_LENGTH(V2) equals L, and if for all i,
            1 (one) ≤ i ≤ L, the i-th character of V1 corresponds
            to the same character position of ISO/IEC 10646 as
            the i-th character of V2, then V1 is identical to V2.
         2) Otherwise, V1 is not identical to V2.

That seems slightly contradictory to the above; I can only resolve
this to mean that 'a' and 'a   ' are not identical in SQL, but
the equality comparison operator should still treat them as equal.

Does anybody have deeper insight into this?

Yours,
Laurenz Albe


Re: String comparison and the SQL standard

From
Tom Lane
Date:
Albe Laurenz <laurenz.albe@wien.gv.at> writes:
> While researching a problem with a different database system,
> I came across the following in the SQL standard ISO/IEC 9075-2:2003,
> Section 8.2 (<comparison predicate>), General Rules:

> 3) The comparison of two character strings is determined as follows:
>    a) Let CS be the collation as determined by Subclause 9.13,
>       "Collation determination", for the declared types of the
>       two character strings.
>    b) If the length in characters of X is not equal to the length
>       in characters of Y, then the shorter string is effectively
>       replaced, for the purposes of comparison, with a copy of itself
>       that has been extended to the length of the longer string by
>       concatenation on the right of one or more pad characters,
>       where the pad character is chosen based on CS.
>       If CS has the NO PAD characteristic, then the pad character is
>       an implementation-dependent character different from
>       any character in the character set of X and Y that collates
>       less than any string under CS.
>       Otherwise, the pad character is a <space>.

The PAD case is specifying the way that CHAR(n) comparison should work.
(We don't expose the PAD/NO PAD distinction in any other way than
CHAR vs VARCHAR/TEXT types.)

AFAICS, the NO PAD case is ignorable BS: they are basically specifying
implementation not semantics there, and in a way that is totally
brain-dead.  There isn't necessarily any such character as the one they
blithely posit.  Moreover, the whole description seems to assume that
string comparison is single-pass left-to-right, which has little to do
with any modern collation specification.  We just rely on strcmp to
decide that shorter strings are "less" than longer ones, which is the
point of this spec AFAICT.

Note that we don't actually do CHAR(n) comparison like that either,
but instead choose to strip trailing spaces before the comparison.

In any case, the most significant word in that whole paragraph is
"effectively", which means you can do it however you want as long
as you get an equivalent comparison result.

> That would effectively mean that 'a'='a   ' is TRUE for
> all character string types.

In the PAD case, yes.  Else no.

> Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
> that very result, while PostgreSQL and Oracle gave me FALSE.

This probably has more to do with what these systems think the
data type of an undecorated literal is, than with whether they do
trailing-space-insensitive comparison all the time.

            regards, tom lane


Re: String comparison and the SQL standard

From
"Kevin Grittner"
Date:
Tom Lane wrote:

> This probably has more to do with what these systems think the
> data type of an undecorated literal is, than with whether they do
> trailing-space-insensitive comparison all the time.

I suspect so. Keep in mind that PostgreSQL does not comply with the
standard in this regard, but this has been discussed before and it
is a concious decision to deviate.  See section section 5.3
<literal> syntax 17:

| The declared type of a <character string literal> is fixed-length
| character string.

... so according to the standard 'a' is CHAR(1) and 'a   ' is
CHAR(4), and they should compare as equal. To make literals for
user-defined types more convenient to write, PostgreSQL treats
those literals as type UNKOWN until forced to resolve them to a
type, and in the absence of any other clues uses text instead of
char(n). The reason for that is that the community finds the
semantics of char(n) strange and generally encourages use of text
or varchar(n) for character-based types.

I'm not sure that this is as clearly and prominently documented as
it should be.

-Kevin


Re: String comparison and the SQL standard

From
Albe Laurenz
Date:
Tom Lane wrote:
>>    b) If the length in characters of X is not equal to the length
>>       in characters of Y, then the shorter string is effectively
>>       replaced, for the purposes of comparison, with a copy of itself
>>       that has been extended to the length of the longer string by
>>       concatenation on the right of one or more pad characters,
>>       where the pad character is chosen based on CS.
>>       If CS has the NO PAD characteristic, then the pad character is
>>       an implementation-dependent character different from
>>       any character in the character set of X and Y that collates
>>       less than any string under CS.
>>       Otherwise, the pad character is a <space>.
>
> The PAD case is specifying the way that CHAR(n) comparison should work.
> (We don't expose the PAD/NO PAD distinction in any other way than
> CHAR vs VARCHAR/TEXT types.)
>
> AFAICS, the NO PAD case is ignorable BS: [...]

> In any case, the most significant word in that whole paragraph is
> "effectively", which means you can do it however you want as long
> as you get an equivalent comparison result.
>
>> That would effectively mean that 'a'='a   ' is TRUE for
>> all character string types.
>
> In the PAD case, yes.  Else no.

Thanks for the clarification.

>> Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
>> that very result, while PostgreSQL and Oracle gave me FALSE.
>
> This probably has more to do with what these systems think the
> data type of an undecorated literal is, than with whether they do
> trailing-space-insensitive comparison all the time.

I tested not only with string literals, but also comparing
table columns of the respective types.

I came up with the following table of semantics used for
comparisons:

           | CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
-----------+-----------------+-----------------------+--------------------+
Oracle     |    PAD SPACE    |        NO PAD         |      NO PAD        |
-----------+-----------------+-----------------------+--------------------+
PostgreSQL |    PAD SPACE    |        NO PAD         |     PAD SPACE      |
-----------+-----------------+-----------------------+--------------------+
MySQL      |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
-----------+-----------------+-----------------------+--------------------+
SQL Server |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
-----------+-----------------+-----------------------+--------------------+

Yours,
Laurenz Albe


Re: String comparison and the SQL standard

From
Tom Lane
Date:
Albe Laurenz <laurenz.albe@wien.gv.at> writes:
> I tested not only with string literals, but also comparing
> table columns of the respective types.

> I came up with the following table of semantics used for
> comparisons:

>            | CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
> -----------+-----------------+-----------------------+--------------------+
> Oracle     |    PAD SPACE    |        NO PAD         |      NO PAD        |
> -----------+-----------------+-----------------------+--------------------+
> PostgreSQL |    PAD SPACE    |        NO PAD         |     PAD SPACE      |
> -----------+-----------------+-----------------------+--------------------+
> MySQL      |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
> -----------+-----------------+-----------------------+--------------------+
> SQL Server |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
> -----------+-----------------+-----------------------+--------------------+

Interesting.  Did you determine which type is assigned to an
unmarked literal string by each system?

BTW, the last entry for PG surprised me a bit, because I would've
expected the varchar semantics to "win".  Some experimentation shows
that you're correct about comparing char and varchar: we coerce the
varchar to char and use bpchareq, which ignores trailing blanks.
But if you compare char and text, we coerce the char to text (stripping
any trailing blanks as we do so) and then apply texteq.  So in that
scenario, trailing blanks in the char datum are ignored, but trailing
blanks in the text datum are not, which is the behavior I was
remembering.  It's a bit surprising that the two cases are resolved
differently --- I think that's coming out of the "most exact matches"
rule for ambiguous-operator resolution, because the available operators
are declared char = char and text = text.

            regards, tom lane


Re: String comparison and the SQL standard

From
Albe Laurenz
Date:
Tom Lane wrote:
>> I tested not only with string literals, but also comparing
>> table columns of the respective types.
>
>> I came up with the following table of semantics used for
>> comparisons:
>
>>            | CHAR(n)=CHAR(n) | VARCHAR(n)=VARCHAR(n) | CHAR(n)=VARCHAR(n) |
>> -----------+-----------------+-----------------------+--------------------+
>> Oracle     |    PAD SPACE    |        NO PAD         |      NO PAD        |
>> -----------+-----------------+-----------------------+--------------------+
>> PostgreSQL |    PAD SPACE    |        NO PAD         |     PAD SPACE      |
>> -----------+-----------------+-----------------------+--------------------+
>> MySQL      |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
>> -----------+-----------------+-----------------------+--------------------+
>> SQL Server |    PAD SPACE    |       PAD SPACE       |     PAD SPACE      |
>> -----------+-----------------+-----------------------+--------------------+
>
> Interesting.  Did you determine which type is assigned to an
> unmarked literal string by each system?

In Oracle it is treated like a CHAR(n):
http://docs.oracle.com/cd/E11882_01/server.112/e26088/sql_elements002.htm#SQLRF51039

"Oracle uses blank-padded comparison semantics only when
 both values in the comparison are either expressions of
 data type CHAR, NCHAR, text literals, or values returned
 by the USER function."

I don't know about MySQL and SQL Server, but since they pad
strings with space for comparison in all cases, it probably
does not make a difference.

Yours,
Laurenz Albe