Re: [SQL] Comparison semantics of CHAR data type - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [SQL] Comparison semantics of CHAR data type
Date
Msg-id 20131011194437.GA3614@momjian.us
Whole thread Raw
Responses Re: [SQL] Comparison semantics of CHAR data type
List pgsql-hackers
On Sun, Sep 22, 2013 at 08:51:26PM -0400, Thomas Fanghaenel wrote:
> I was wondering about the proper semantics of CHAR comparisons in some corner
> cases that involve control characters with values that are less than 0x20
> (space).
> 
> Consider the following testcase:
> 
> ===
> create table t (a int, b char(10));
> 
> insert into t values (1, 'foo');
> insert into t values (2, 'foo ');
> insert into t values (3, E'foo\t');
> insert into t values (4, E'foo\n');
> insert into t values (5, E'foo \n');
> insert into t values (6, 'foobar');
> 
> select * from t order by b;
> ===
> 
> What's the proper order of these string values in the CHAR domain?  The way I
> interpret the SQL Standard (and assuming that \t and \n collate lower than a
> space), it's supposed to be this:
> 
>   (3) < (4) < (5) < (1) <= (2) < (6)
> 
> Postgres comes up with this:
> 
>   (1) <= (2) < (3) < (4) < (5) < (6)
> 
> The reason is that the bpchar functions that implement the relative comparison
> operators for CHAR(n) effectively strip trailing whitespaces before doing the
> comparison.  One might argue that doing this is not correct.  The standard
> seems to mandate that all CHAR(n) values are actually considered to be of width
> n, and that trailing spaces are indeed relevant for comparison.  In other
> words, stripping them would only be possible if it can be guaranteed that there
> are no characters in the character set that collate lower than a space.
> 
> Any thoughts on this?  I searched the mailing list archives, but couldn't find
> any relevant discussion.  There were plenty of threads that argue whether or
> not it's semantically correct to strip trailing spaces from CHAR(n) values, but
> the issue of characters collating below a space does not seem to have brought
> up in any of those discussions before.

[I am moving this thread to hackers because I think it needs internals
review.]

You have some good questions here, though there are two interrelated
things going on here.  First is collation, and the second is the
trimming of spaces from char() comparisons.  Let's look at collation
first:
test=> SHOW lc_collate; lc_collate------------- en_US.UTF-8(1 row)test=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?----------ab a c(2 rows)
 

You will notice spaces are not considered important in a UTF8 collation.
If we do this in the C collation, we get a different result:
test=> CREATE DATABASE test2 WITH LC_COLLATE = 'C' TEMPLATE template0;CREATE DATABASEtest=> \c test2You are now
connectedto database "test2" as user "postgres".test2=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?----------a c ab(2 rows)
 

Also, when using ORDER BY, it isn't clear if the values are ordered that
way due to being greater/less-than, or just randomly.   For example, I
found your example above gave different ordering if I inserted the
values differently, using a UTF8 collation.

Let me use comparisons instead using UTF8 for clarity:
test=> SHOW lc_collate; lc_collate------------- en_US.UTF-8(1 row)test=> select 'a c' < 'ab'; ?column?---------- f(1
row)

and "C":
test2=> SHOW lc_collate; lc_collate------------ C(1 row)test2=> select 'a c' < 'ab'; ?column?---------- t(1 row)

Now, let's look at ASCII characters less than space, first in UTF8:
test=> SHOW lc_collate; lc_collate------------- en_US.UTF-8(1 row)test=> select E'a\nb' < E'a b'; ?column?----------
f(1row)
 

and in C:
test2=> SHOW lc_collate; lc_collate------------ C(1 row)test2=> select E'a\nb' < E'a b'; ?column?---------- t(1 row)

You can see that newline is greater than space in UTF8, but not in C.

Now, on to the trailing space issue using the default TEXT value for
strings, first in UTF8:

test=> SHOW lc_collate; lc_collate------------- en_US.UTF-8(1 row)test=> select E'ab\n' < E'ab '; ?column?----------
f(1row)
 

then in "C":
test2=> SHOW lc_collate; lc_collate------------ C(1 row)test2=> select E'ab\n' < E'ab '; ?column?---------- t(1 row)
This matches the \n/space issue we saw above.  Now, here is where CHAR()
starts to show the unusual behavior you saw, first in UTF8:
test=> SHOW lc_collate; lc_collate------------- en_US.UTF-8(1 row)test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?----------f(1 row)
 

then in C:
test2=> SHOW lc_collate; lc_collate------------ C(1 row)test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?----------f(1 row)
 

You can see the UTF8 case is fine because \n is considered greater than
space, but in the C locale, where \n is less than space, the false
return value shows the problem with internal_bpchar_pattern_compare()
trimming the string and first comparing on lengths.  This is exactly the
problem you outline, where space trimming assumes everything is less
than a space.

I am sorry for this long email, but I would be interested to see what
other hackers think about this issue.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: buildfarm failures on smew and anole
Next
From: Andrew Dunstan
Date:
Subject: Re: buildfarm failures on smew and anole