Re: [SQL] Comparison semantics of CHAR data type - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: [SQL] Comparison semantics of CHAR data type |
Date | |
Msg-id | 20131016181711.GC18048@momjian.us Whole thread Raw |
In response to | Re: [SQL] Comparison semantics of CHAR data type (Thomas Fanghaenel <tfanghaenel@salesforce.com>) |
Responses |
Re: [SQL] Comparison semantics of CHAR data type
|
List | pgsql-hackers |
> > You can see the UTF8 case is fine because \n is considered greater > > than space, but in the C locale, where \n is less than space, the > > false return value shows the problem with > > internal_bpchar_pattern_compare() trimming the string and first > > comparing on lengths. This is exactly the problem you outline, where > > space trimming assumes everything is less than a space. > > For collations other than C some of those issues that have to do with > string comparisons might simply be hidden, depending on how strcoll() > handles inputs off different lengths: If strcoll() applies implicit > space padding to the shorter value, there won't be any visible > difference in ordering between bpchar and varchar values. If strcoll() > does not apply such space padding, the right-trimming of bpchar values > causes very similar issues even in a en_US collation. > > For example, this seems to be the case on OS X: > > select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10) > collate "en_US"; > ?column? > ---------- > t > (1 row) > > select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C"; > ?column? > ---------- > t > (1 row) > > select 'ab '::varchar(10) collate "en_US" < > E'ab\n'::varchar(10) collate "en_US"; > ?column? > ---------- > f > (1 row) The above query returns true on Linux, so there certainly is a platform-specific difference there. The others are the same. > select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10) > collate "C"; > ?column? > ---------- > f > (1 row) > > So here there's actually not only the same \n/space issue as in the C > collation (which would go away if the bpchar value weren't trimmed). > It also shows that there might be slight differences in behavior, > depending which platform you're running on. > > On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > > What matters in general isn't where the characters fall when comparing > > individual bytes, but how the strings containing them sort according > > to the applicable collation. That said, my recollection of the spec > > is that when two CHAR(n) values are compared, the shorter should be > > blank-padded before making the comparison. > > Not necessarily. The SQL Standard actually ties this to the collation > sequence that is in use. Without a lot of context, this is from > Subclause 8.2, "<comparison predicate>", General Rule 3)b): > > b) If the length in characters of X is not equal to the length in > characters of Y, then the shorter string is effectively replaced, > for the purposes of comparison, with a copy of itself that has been > extended to the length of the longer string by concatenation on the > right of one or more pad characters, where the pad character is > chosen based on CS. If CS has the NO PAD characteristic, then the > pad character is an implementation-dependent character different > from any character in the character set of X and Y that collates > less than any string under CS. Otherwise, the pad character is a > <space>. > > In my opinion, that's just a lot of handwaving, to the extent that in > practice different vendors interpret this clause differently. It seems > that SQLServer and DB2 do PAD semantics across the board, whereas Oracle > has uses NO PAD semantics whenever there's a VARCHAR type involved in > the comparison. > > But all that is actually a whole different can of worms, and slightly > besides the point of my original question. How to properly compare > strings with different lentgths has been discussed before, see for > instance the thread in [1]. My intention was not to get that started > again. As far as I can see, the consensus seems to be that when using > the C locale, string comparisons should be done using NO PAD semantics. > (It sure gives some strange semantics if you have varchars with trailing > spaces, but it's perfectly legal.) > > The point is that my testcase deals with strings of the same length. > Thus, the above clause doesn't really apply. The standard, to my > understanding, says that fixed-length character values are padded when > the row is constructed. And once that happens, those spaces become part > of the value. It's invalid to strip them, unless done explicitly. Yes, there are three types of comparisons that are important here: 1. 'a'::CHAR(3) < 'a'::CHAR(3)2. 'a '::CHAR(3) < E'a\n'::CHAR(3)3. 'a'::CHAR(3) < 'a'::CHAR(4) You are saying it is only #3 where we can substitute the special always-lower pad character, while it appears that Postgres does this in cases #2 and #3. > > Since we only have the CHAR(n) type to improve compliance with the SQL > > specification, and we don't generally encourage its use, I think we > > should fix any non-compliant behavior. That seems to mean that if you > > take two CHAR values and compare them, it should give the same result > > as comparing the same two values as VARCHAR using the same collation > > with the shorter value padded with spaces. > > > > So this is correct: > > > > test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US"; > > ?column? > > ---------- > > t > > (1 row) > > > > ... because it matches: > > > > test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US"; > > ?column? > > ---------- > > t > > (1 row) > > Again, this touches on the same point as Bruce's example above. Right > now these two queries might produce identical results on Linux, because > of the way strcoll() behaves. On OS X you get different results: > > select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US"; > ?column? > ---------- > t > (1 row) > > select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) > collate "en_US"; > ?column? > ---------- > f > (1 row) The above returns true on Linux. > I have no idea who's right. But doesn't this count as evidence that > right-trimming pbchars is not even safe with the en_US collation? Well, one problem in fixing this is that there are existing CHAR() indexes that embed this behavior, and unless we want to break pg_upgrade for CHAR() indexes, I am not sure what options we have except to document this. Let me also add that we don't propogate the CHAR() length through any/most/all? function calls and operators, so the length comparison wouldn't work in all cases anyway, e.g. SELECT 'a'::CHAR(4) || 'b'::CHAR(4); ?column?---------- ab(1 row) While space truncation is odd, it is predictable. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
pgsql-hackers by date: