Re: [SQL] Comparison semantics of CHAR data type - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [SQL] Comparison semantics of CHAR data type
Date
Msg-id 20131016181711.GC18048@momjian.us
Whole thread Raw
In response to Re: [SQL] Comparison semantics of CHAR data type  (Thomas Fanghaenel <tfanghaenel@salesforce.com>)
Responses Re: [SQL] Comparison semantics of CHAR data type
List pgsql-hackers
> > You can see the UTF8 case is fine because \n is considered greater
> > than space, but in the C locale, where \n is less than space, the
> > false return value shows the problem with
> > internal_bpchar_pattern_compare() trimming the string and first
> > comparing on lengths.  This is exactly the problem you outline, where
> > space trimming assumes everything is less than a space.
> 
> For collations other than C some of those issues that have to do with
> string comparisons might simply be hidden, depending on how strcoll()
> handles inputs off different lengths: If strcoll() applies implicit
> space padding to the shorter value, there won't be any visible
> difference in ordering between bpchar and varchar values.  If strcoll()
> does not apply such space padding, the right-trimming of bpchar values
> causes very similar issues even in a en_US collation.
> 
> For example, this seems to be the case on OS X:
> 
>         select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
> collate "en_US";
>          ?column?
>         ----------
>          t
>         (1 row)
> 
>         select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";
>          ?column?
>         ----------
>          t
>         (1 row)
> 
>         select 'ab '::varchar(10) collate "en_US" <
> E'ab\n'::varchar(10) collate "en_US";
>          ?column?
>         ----------
>          f
>         (1 row)

The above query returns true on Linux, so there certainly is a
platform-specific difference there.  The others are the same.

>         select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
> collate "C";
>          ?column?
>         ----------
>          f
>         (1 row)
> 
> So here there's actually not only the same \n/space issue as in the C
> collation (which would go away if the bpchar value weren't trimmed).
> It also shows that there might be slight differences in behavior,
> depending which platform you're running on.
> 
> On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> > What matters in general isn't where the characters fall when comparing
> > individual bytes, but how the strings containing them sort according
> > to the applicable collation.  That said, my recollection of the spec
> > is that when two CHAR(n) values are compared, the shorter should be
> > blank-padded before making the comparison.
> 
> Not necessarily.  The SQL Standard actually ties this to the collation
> sequence that is in use.  Without a lot of context, this is from
> Subclause 8.2, "<comparison predicate>", General Rule 3)b):
> 
>   b) If the length in characters of X is not equal to the length in
>      characters of Y, then the shorter string is effectively replaced,
>      for the purposes of comparison, with a copy of itself that has been
>      extended to the length of the longer string by concatenation on the
>      right of one or more pad characters, where the pad character is
>      chosen based on CS. If CS has the NO PAD characteristic, then the
>      pad character is an implementation-dependent character different
>      from any character in the character set of X and Y that collates
>      less than any string under CS. Otherwise, the pad character is a
>      <space>.
> 
> In my opinion, that's just a lot of handwaving, to the extent that in
> practice different vendors interpret this clause differently.  It seems
> that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
> has uses NO PAD semantics whenever there's a VARCHAR type involved in
> the comparison.
> 
> But all that is actually a whole different can of worms, and slightly
> besides the point of my original question.  How to properly compare
> strings with different lentgths has been discussed before, see for
> instance the thread in [1].  My intention was not to get that started
> again.  As far as I can see, the consensus seems to be that when using
> the C locale, string comparisons should be done using NO PAD semantics.
> (It sure gives some strange semantics if you have varchars with trailing
> spaces, but it's perfectly legal.)
> 
> The point is that my testcase deals with strings of the same length.
> Thus, the above clause doesn't really apply.  The standard, to my
> understanding, says that fixed-length character values are padded when
> the row is constructed.  And once that happens, those spaces become part
> of the value.  It's invalid to strip them, unless done explicitly.

Yes, there are three types of comparisons that are important here:
1. 'a'::CHAR(3) < 'a'::CHAR(3)2. 'a '::CHAR(3) < E'a\n'::CHAR(3)3. 'a'::CHAR(3) < 'a'::CHAR(4)

You are saying it is only #3 where we can substitute the special
always-lower pad character, while it appears that Postgres does this in
cases #2 and #3.

> > Since we only have the CHAR(n) type to improve compliance with the SQL
> > specification, and we don't generally encourage its use, I think we
> > should fix any non-compliant behavior.  That seems to mean that if you
> > take two CHAR values and compare them, it should give the same result
> > as comparing the same two values as VARCHAR using the same collation
> > with the shorter value padded with spaces.
> >
> > So this is correct:
> >
> > test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
> >  ?column?
> > ----------
> >  t
> > (1 row)
> >
> > ... because it matches:
> >
> > test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
> >  ?column?
> > ----------
> >  t
> > (1 row)
> 
> Again, this touches on the same point as Bruce's example above.  Right
> now these two queries might produce identical results on Linux, because
> of the way strcoll() behaves.  On OS X you get different results:
> 
>         select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
>          ?column?
>         ----------
>          t
>         (1 row)
> 
>         select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
> collate "en_US";
>          ?column?
>         ----------
>          f
>         (1 row)

The above returns true on Linux.

> I have no idea who's right.  But doesn't this count as evidence that
> right-trimming pbchars is not even safe with the en_US collation?

Well, one problem in fixing this is that there are existing CHAR()
indexes that embed this behavior, and unless we want to break pg_upgrade
for CHAR() indexes, I am not sure what options we have except to
document this.

Let me also add that we don't propogate the CHAR() length through
any/most/all? function calls and operators, so the length comparison
wouldn't work in all cases anyway, e.g.
SELECT 'a'::CHAR(4) || 'b'::CHAR(4); ?column?---------- ab(1 row)

While space truncation is odd, it is predictable.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: removing old ports and architectures
Next
From: Kohei KaiGai
Date:
Subject: Re: Triggers on foreign tables