Re: [SQL] Comparison semantics of CHAR data type - Mailing list pgsql-hackers

From Thomas Fanghaenel
Subject Re: [SQL] Comparison semantics of CHAR data type
Date
Msg-id CAK+WP1x5NOXXoh36cEcT=JRZ=_a7=53cb66TfAzMM9DnzFqURw@mail.gmail.com
Whole thread Raw
In response to Re: [SQL] Comparison semantics of CHAR data type  (Kevin Grittner <kgrittn@ymail.com>)
Responses Re: [SQL] Comparison semantics of CHAR data type  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Fri, Oct 11, 2013 at 3:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
> You have some good questions here, though there are two interrelated
> things going on here.  First is collation, and the second is the
> trimming of spaces from char() comparisons.

Sorry, I should have probably mentioned more explicitly that I'm only
concerned about the latter.  I might get yelled at for saying this, but
I think the rather ubiquitous right-trimming of bpchar values is wrong,
and does - in combination with memcmp() for comparison - lead to some
non-standard behavior.

> Now, on to the trailing space issue using the default TEXT value for
> strings, first in UTF8:
>
>
>         test=> SHOW lc_collate;
>          lc_collate
>         -------------
>          en_US.UTF-8
>         (1 row)
>
>         test=> select E'ab\n' < E'ab ';
>          ?column?
>         ----------
>          f
>         (1 row)
>
> then in "C":
>
>         test2=> SHOW lc_collate;
>          lc_collate
>         ------------
>          C
>         (1 row)
>
>         test2=> select E'ab\n' < E'ab ';
>          ?column?
>         ----------
>          t
>         (1 row)
>
> This matches the \n/space issue we saw above.  Now, here is where
> CHAR() starts to show the unusual behavior you saw, first in UTF8:
>
>         test=> SHOW lc_collate;
>          lc_collate
>         -------------
>          en_US.UTF-8
>         (1 row)
>
>         test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
>          ?column?
>         ----------
>          f
>         (1 row)
>
> then in C:
>
>         test2=> SHOW lc_collate;
>          lc_collate
>         ------------
>          C
>         (1 row)
>
>         test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
>          ?column?
>         ----------
>          f
>         (1 row)
>
> You can see the UTF8 case is fine because \n is considered greater
> than space, but in the C locale, where \n is less than space, the
> false return value shows the problem with
> internal_bpchar_pattern_compare() trimming the string and first
> comparing on lengths.  This is exactly the problem you outline, where
> space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values.  If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

For example, this seems to be the case on OS X:
       select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
collate "en_US";        ?column?       ----------        t       (1 row)
       select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";        ?column?       ----------        t
  (1 row)
 
       select 'ab '::varchar(10) collate "en_US" <
E'ab\n'::varchar(10) collate "en_US";        ?column?       ----------        f       (1 row)
       select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
collate "C";        ?column?       ----------        f       (1 row)

So here there's actually not only the same \n/space issue as in the C
collation (which would go away if the bpchar value weren't trimmed).
It also shows that there might be slight differences in behavior,
depending which platform you're running on.

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> What matters in general isn't where the characters fall when comparing
> individual bytes, but how the strings containing them sort according
> to the applicable collation.  That said, my recollection of the spec
> is that when two CHAR(n) values are compared, the shorter should be
> blank-padded before making the comparison.

Not necessarily.  The SQL Standard actually ties this to the collation
sequence that is in use.  Without a lot of context, this is from
Subclause 8.2, "<comparison predicate>", General Rule 3)b):
 b) If the length in characters of X is not equal to the length in    characters of Y, then the shorter string is
effectivelyreplaced,    for the purposes of comparison, with a copy of itself that has been    extended to the length
ofthe longer string by concatenation on the    right of one or more pad characters, where the pad character is
chosenbased on CS. If CS has the NO PAD characteristic, then the    pad character is an implementation-dependent
characterdifferent    from any character in the character set of X and Y that collates    less than any string under
CS.Otherwise, the pad character is a    <space>.
 

In my opinion, that's just a lot of handwaving, to the extent that in
practice different vendors interpret this clause differently.  It seems
that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
has uses NO PAD semantics whenever there's a VARCHAR type involved in
the comparison.

But all that is actually a whole different can of worms, and slightly
besides the point of my original question.  How to properly compare
strings with different lentgths has been discussed before, see for
instance the thread in [1].  My intention was not to get that started
again.  As far as I can see, the consensus seems to be that when using
the C locale, string comparisons should be done using NO PAD semantics.
(It sure gives some strange semantics if you have varchars with trailing
spaces, but it's perfectly legal.)

The point is that my testcase deals with strings of the same length.
Thus, the above clause doesn't really apply.  The standard, to my
understanding, says that fixed-length character values are padded when
the row is constructed.  And once that happens, those spaces become part
of the value.  It's invalid to strip them, unless done explicitly.

> Since we only have the CHAR(n) type to improve compliance with the SQL
> specification, and we don't generally encourage its use, I think we
> should fix any non-compliant behavior.  That seems to mean that if you
> take two CHAR values and compare them, it should give the same result
> as comparing the same two values as VARCHAR using the same collation
> with the shorter value padded with spaces.
>
> So this is correct:
>
> test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
>  ?column?
> ----------
>  t
> (1 row)
>
> ... because it matches:
>
> test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
>  ?column?
> ----------
>  t
> (1 row)

Again, this touches on the same point as Bruce's example above.  Right
now these two queries might produce identical results on Linux, because
of the way strcoll() behaves.  On OS X you get different results:
       select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";        ?column?       ----------
t      (1 row)
 
       select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
collate "en_US";        ?column?       ----------        f       (1 row)

I have no idea who's right.  But doesn't this count as evidence that
right-trimming pbchars is not even safe with the en_US collation?

-- Thomas


[1]
http://www.postgresql.org/message-id/flat/A737B7A37273E048B164557ADEF4A58B0579A7AB@ntex2010a.host.magwien.gv.at#A737B7A37273E048B164557ADEF4A58B0579A7AB@ntex2010a.host.magwien.gv.at

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>> Thomas Fanghaenel wrote:
>
>>> I was wondering about the proper semantics of CHAR comparisons in some corner
>>> cases that involve control characters with values that are less than 0x20
>>> (space).
>
> What matters in general isn't where the characters fall when
> comparing individual bytes, but how the strings containing them
> sort according to the applicable collation.  That said, my
> recollection of the spec is that when two CHAR(n) values are
> compared, the shorter should be blank-padded before making the
> comparison.  *That* said, I think the general advice is to stay
> away from CHAR(n) in favor or VARCHAR(n) or TEXT, and I think that
> is good advice.
>
>> I am sorry for this long email, but I would be interested to see what
>> other hackers think about this issue.
>
> Since we only have the CHAR(n) type to improve compliance with the
> SQL specification, and we don't generally encourage its use, I
> think we should fix any non-compliant behavior.  That seems to mean
> that if you take two CHAR values and compare them, it should give
> the same result as comparing the same two values as VARCHAR using
> the same collation with the shorter value padded with spaces.
>
> So this is correct:
>
> test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
>  ?column?
> ----------
>  t
> (1 row)
>
> ... because it matches:
>
> test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
>  ?column?
> ----------
>  t
> (1 row)
>
> But this is incorrect:
>
> test=# select 'ab'::char(3) collate "C" < E'ab\n'::char(3) collate "C";
>  ?column?
> ----------
>  t
> (1 row)
>
> ... because it doesn't match:
>
> test=# select 'ab '::varchar(3) collate "C" < E'ab\n'::varchar(3) collate "C";
>  ?column?
> ----------
>  f
> (1 row)
>
> Of course, I have no skin in the game, because it took me about two
> weeks after my first time converting a database with CHAR columns
> to PostgreSQL to change them all to VARCHAR, and do that as part of
> all future conversions.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: buildfarm failures on smew and anole
Next
From: Andres Freund
Date:
Subject: Re: Patch for reserved connections for replication users