Thread: char(n) to varchar or text conversion should strip trailing spaces
I've gotten really tired of explaining to newbies why stuff involving char(n) fields doesn't work like they expect. Our current behavior is not valid per SQL92 anyway, I believe. I think there is a pretty simple solution now that we have pg_cast: we could stop treating char(n) as binary-equivalent to varchar/text, and instead define it as requiring a runtime conversion (which would be essentially the rtrim() function). The cast in the other direction would be assignment-only, so that any expression that involves mixed char(n) and varchar/text operations would be evaluated in varchar rules after stripping char's insignificant trailing blanks. If we did this, then operations like WHERE UPPER(charcolumn) = 'FOO' would work as a newbie expects. I believe that we'd come a lot closer to spec compliance on the behavior of char(n), too. Comments? regards, tom lane
Re: char(n) to varchar or text conversion should strip trailing spaces
From
"Zeugswetter Andreas SB SD"
Date:
> I've gotten really tired of explaining to newbies why stuff involving > char(n) fields doesn't work like they expect. Our current behavior is > not valid per SQL92 anyway, I believe. > > I think there is a pretty simple solution now that we have pg_cast: > we could stop treating char(n) as binary-equivalent to varchar/text, > and instead define it as requiring a runtime conversion (which would > be essentially the rtrim() function). The cast in the other direction > would be assignment-only, so that any expression that involves mixed > char(n) and varchar/text operations would be evaluated in varchar > rules after stripping char's insignificant trailing blanks. > > If we did this, then operations like > WHERE UPPER(charcolumn) = 'FOO' > would work as a newbie expects. I believe that we'd come a lot closer > to spec compliance on the behavior of char(n), too. I am all for it. That would much more closely match what I would expect. One alternate possible approach would maybe be to change the on-disk representation to really be binary compatible and change the input output and operator functions ? IIRC fixed width optimizations do not gain as much as in earlier versions anyway. Then char(n) would have the benefit of beeing trailing blank insensitive and having the optimal storage format. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > One alternate possible approach would maybe be to change the on-disk > representation to really be binary compatible and change the input > output and operator functions ? Hmm ... now that's an interesting thought. So the input converter would actively strip trailing blanks, output would add them back, and only in a few char(n)-specific functions would we need to pretend they are there. This would mean that char(n)-to-text is binary compatible but the reverse direction is not (it must strip trailing blanks) --- but that coercion could be folded into the existing length-limit checker for char(n) columns. > IIRC fixed width optimizations do not gain as > much as in earlier versions anyway. There are no fixed-width optimizations for char(n) at all, now that we have multibyte enabled all the time. Seems like a great idea to me. regards, tom lane
Tom Lane writes: > Hmm ... now that's an interesting thought. So the input converter would > actively strip trailing blanks, output would add them back, But how would the output know how many to put back? -- Peter Eisentraut peter_e@gmx.net
I said: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> One alternate possible approach would maybe be to change the on-disk >> representation to really be binary compatible and change the input >> output and operator functions ? > Seems like a great idea to me. On further thought I've got some reservations about it. The main problem is that the physical contents of a char(n) Datum wouldn't provide the complete semantic meaning: you must know the typmod as well to know the number of padding spaces that are supposed to be there. We have special provisions to allow input and output functions to know the typmod, but in more general cases functions will not receive a typmod, and in any case they cannot deliver a typmod. So for any operation that wanted to behave as though the padding spaces are real, there'd be a problem. We could maybe play some games with having char(n) values be expanded (space-padded) during expression evaluation and then trimmed for storage, but this strikes me as awfully messy, not to mention redundant with the compression done by TOAST. Also: I've been reading through the spec in more detail, and what I now realize is that they expect trailing blanks to be ignored by default in both char(n) and varchar(n) comparisons! In fact, it's not really the datatype that determines this, but the collation attribute, and they specify that the default collation must have the PAD SPACE attribute (which essentially means that trailing spaces are not significant). What's more, AFAICT padding spaces in char(n) are treated as real data by every operation except comparison --- for example, they are real data in concatenation. I don't think we really want to meet the letter of the spec here :-( It seems quite schizoid to treat pad spaces as real data for everything except comparison. Certainly I do not want to do that for varchar or text datatypes. The idea of having char(n)-to-text conversion strip trailing blanks still appeals to me, but I have to withdraw the claim that it'd improve our spec compliance; it wouldn't. I'm now wondering whether it wouldn't be better to leave the data representation as-is (padding spaces are stored), and still allow binary compatibility both ways, but add a collection of duplicate pg_proc and pg_operator entries so that char-ness is preserved where appropriate. For example, we'd need both these pg_proc entries for UPPER():upper(text) returns textupper(character) returns character They could point at the same C routine, but the parser would select the first when the input is text or varchar, and the second when the input is character. This would solve the original complaint aboutupper('foo '::char(6)) = 'FOO' needing to yield TRUE. The extra pg_proc entries would be a tad tedious, but I think we'd only need a couple dozen to satisfy the spec's requirements. Some (perhaps not all) of these functions would need to be duplicated: btrim(text) btrim(text,text) convert(text,name) convert(text,name,name) convert_using(text,text) initcap(text) lower(text) lpad(text,integer) lpad(text,integer,text) ltrim(text) ltrim(text,text) max(text) min(text) overlay(text,text,integer) overlay(text,text,integer,integer) repeat(text,integer) replace(text,text,text) rpad(text,integer) rpad(text,integer,text) rtrim(text) rtrim(text,text) split_part(text,text,integer) substr(text,integer) substr(text,integer,integer) substring(text,integer) substring(text,integer,integer) text_larger(text,text) text_smaller(text,text) textcat(text,text) translate(text,text,text) upper(text) regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> Hmm ... now that's an interesting thought. So the input converter would >> actively strip trailing blanks, output would add them back, > But how would the output know how many to put back? The output routine would need access to the column typmod. Which it would have, in simple "SELECT columnname" cases, but this is a serious weakness of the scheme in general. See my followup post of a few minutes ago. regards, tom lane