Thread: char(n) to varchar or text conversion should strip trailing spaces

char(n) to varchar or text conversion should strip trailing spaces

From
Tom Lane
Date:
I've gotten really tired of explaining to newbies why stuff involving
char(n) fields doesn't work like they expect.  Our current behavior is
not valid per SQL92 anyway, I believe.

I think there is a pretty simple solution now that we have pg_cast:
we could stop treating char(n) as binary-equivalent to varchar/text,
and instead define it as requiring a runtime conversion (which would
be essentially the rtrim() function).  The cast in the other direction
would be assignment-only, so that any expression that involves mixed
char(n) and varchar/text operations would be evaluated in varchar
rules after stripping char's insignificant trailing blanks.

If we did this, then operations like    WHERE UPPER(charcolumn) = 'FOO'
would work as a newbie expects.  I believe that we'd come a lot closer
to spec compliance on the behavior of char(n), too.

Comments?
        regards, tom lane


Re: char(n) to varchar or text conversion should strip trailing spaces

From
"Zeugswetter Andreas SB SD"
Date:
> I've gotten really tired of explaining to newbies why stuff involving
> char(n) fields doesn't work like they expect.  Our current behavior is
> not valid per SQL92 anyway, I believe.
>
> I think there is a pretty simple solution now that we have pg_cast:
> we could stop treating char(n) as binary-equivalent to varchar/text,
> and instead define it as requiring a runtime conversion (which would
> be essentially the rtrim() function).  The cast in the other direction
> would be assignment-only, so that any expression that involves mixed
> char(n) and varchar/text operations would be evaluated in varchar
> rules after stripping char's insignificant trailing blanks.
>
> If we did this, then operations like
>         WHERE UPPER(charcolumn) = 'FOO'
> would work as a newbie expects.  I believe that we'd come a lot closer
> to spec compliance on the behavior of char(n), too.

I am all for it. That would much more closely match what I would expect.

One alternate possible approach would maybe be to change the on-disk
representation to really be binary compatible and change the input
output and operator functions ? IIRC fixed width optimizations do not gain as
much as in earlier versions anyway. Then char(n) would have the benefit of
beeing trailing blank insensitive and having the optimal storage format.

Andreas


Re: char(n) to varchar or text conversion should strip trailing spaces

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> One alternate possible approach would maybe be to change the on-disk
> representation to really be binary compatible and change the input 
> output and operator functions ?

Hmm ... now that's an interesting thought.  So the input converter would
actively strip trailing blanks, output would add them back, and only in
a few char(n)-specific functions would we need to pretend they are there.

This would mean that char(n)-to-text is binary compatible but the
reverse direction is not (it must strip trailing blanks) --- but that
coercion could be folded into the existing length-limit checker for
char(n) columns.

> IIRC fixed width optimizations do not gain as 
> much as in earlier versions anyway.

There are no fixed-width optimizations for char(n) at all, now that we
have multibyte enabled all the time.

Seems like a great idea to me.
        regards, tom lane


Re: char(n) to varchar or text conversion should strip

From
Peter Eisentraut
Date:
Tom Lane writes:

> Hmm ... now that's an interesting thought.  So the input converter would
> actively strip trailing blanks, output would add them back,

But how would the output know how many to put back?

-- 
Peter Eisentraut   peter_e@gmx.net



Re: char(n) to varchar or text conversion should strip trailing spaces

From
Tom Lane
Date:
I said:
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> One alternate possible approach would maybe be to change the on-disk
>> representation to really be binary compatible and change the input 
>> output and operator functions ?

> Seems like a great idea to me.

On further thought I've got some reservations about it.  The main
problem is that the physical contents of a char(n) Datum wouldn't
provide the complete semantic meaning: you must know the typmod as well
to know the number of padding spaces that are supposed to be there.
We have special provisions to allow input and output functions to know
the typmod, but in more general cases functions will not receive a
typmod, and in any case they cannot deliver a typmod.  So for any
operation that wanted to behave as though the padding spaces are real,
there'd be a problem.

We could maybe play some games with having char(n) values be expanded
(space-padded) during expression evaluation and then trimmed for
storage, but this strikes me as awfully messy, not to mention redundant
with the compression done by TOAST.


Also: I've been reading through the spec in more detail, and what I now
realize is that they expect trailing blanks to be ignored by default in
both char(n) and varchar(n) comparisons!  In fact, it's not really the
datatype that determines this, but the collation attribute, and they
specify that the default collation must have the PAD SPACE attribute
(which essentially means that trailing spaces are not significant).
What's more, AFAICT padding spaces in char(n) are treated as real data
by every operation except comparison --- for example, they are real
data in concatenation.

I don't think we really want to meet the letter of the spec here :-(
It seems quite schizoid to treat pad spaces as real data for everything
except comparison.  Certainly I do not want to do that for varchar or
text datatypes.

The idea of having char(n)-to-text conversion strip trailing blanks
still appeals to me, but I have to withdraw the claim that it'd improve
our spec compliance; it wouldn't.


I'm now wondering whether it wouldn't be better to leave the data
representation as-is (padding spaces are stored), and still allow binary
compatibility both ways, but add a collection of duplicate pg_proc and
pg_operator entries so that char-ness is preserved where appropriate.
For example, we'd need both these pg_proc entries for UPPER():upper(text) returns textupper(character) returns
character
They could point at the same C routine, but the parser would select the
first when the input is text or varchar, and the second when the input
is character.  This would solve the original complaint aboutupper('foo   '::char(6)) = 'FOO'
needing to yield TRUE.  The extra pg_proc entries would be a tad
tedious, but I think we'd only need a couple dozen to satisfy the spec's
requirements.  Some (perhaps not all) of these functions would need to
be duplicated:

btrim(text)
btrim(text,text)
convert(text,name)
convert(text,name,name)
convert_using(text,text)
initcap(text)
lower(text)
lpad(text,integer)
lpad(text,integer,text)
ltrim(text)
ltrim(text,text)
max(text)
min(text)
overlay(text,text,integer)
overlay(text,text,integer,integer)
repeat(text,integer)
replace(text,text,text)
rpad(text,integer)
rpad(text,integer,text)
rtrim(text)
rtrim(text,text)
split_part(text,text,integer)
substr(text,integer)
substr(text,integer,integer)
substring(text,integer)
substring(text,integer,integer)
text_larger(text,text)
text_smaller(text,text)
textcat(text,text)
translate(text,text,text)
upper(text)
        regards, tom lane


Re: char(n) to varchar or text conversion should strip

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> Hmm ... now that's an interesting thought.  So the input converter would
>> actively strip trailing blanks, output would add them back,

> But how would the output know how many to put back?

The output routine would need access to the column typmod.  Which it
would have, in simple "SELECT columnname" cases, but this is a serious
weakness of the scheme in general.  See my followup post of a few
minutes ago.
        regards, tom lane