Thread: charin(), text_char() should return something else for empty input

charin(), text_char() should return something else for empty input

From
Tom Lane
Date:
I have been chasing Domingo Alvarez Duarte's report of funny behavior
when assigning an empty string to a "char" variable in plpgsql.  What
it comes down to is that text-to-char conversion does not behave very
well for zero-length input.  charin() returns a null character, leading
to the following bizarreness:

regression=# select 'z' || (''::"char") || 'q';?column?
----------z
(1 row)

regression=# select length('z' || (''::"char") || 'q');length
--------     3
(1 row)

The concatenation result is 'z\0q', which doesn't print nicely :-(.

text_char() produces a completely random result, eg:

regression=# select ''::text::"char";?column?
----------~
(1 row)

and could even coredump in the worst case, since it tries to fetch the
first character of the text input no matter whether there is one or not.

I propose that both of these operations should return a space character
for an empty input string.  This is by analogy to space-padding as you'd
get with char(1).  Any objections?
        regards, tom lane


I wrote:
> I propose that both of these operations should return a space character
> for an empty input string.  This is by analogy to space-padding as you'd
> get with char(1).  Any objections?

An alternative approach is to make charin and text_char map empty
strings to the null character (\0), and conversely make charout and
char_text map the null character to empty strings.  charout already
acts that way, in effect, since it has to produce a null-terminated
C string.  This way would have the advantage that there would still
be a reversible dump and reload representation for a "char" field
containing '\0', whereas space-padding would cause such a field to
become ' ' after reload.  But it's a little strange if you think that
"char" ought to behave the same as char(1).

Comments?
        regards, tom lane


At 02:37 PM 5/28/01 -0400, Tom Lane wrote:
>I wrote:
>> I propose that both of these operations should return a space character
>> for an empty input string.  This is by analogy to space-padding as you'd
>> get with char(1).  Any objections?
>
>An alternative approach is to make charin and text_char map empty
>strings to the null character (\0), and conversely make charout and
>char_text map the null character to empty strings.  charout already
>acts that way, in effect, since it has to produce a null-terminated
>C string.  This way would have the advantage that there would still
>be a reversible dump and reload representation for a "char" field
>containing '\0', whereas space-padding would cause such a field to
>become ' ' after reload.  But it's a little strange if you think that
>"char" ought to behave the same as char(1).
>
>Comments?

I personally wouldn't expect "char" to behave exactly as "char(1)",
because I understand it to be a one-byte variable which holds a single
(not zero or one) character.

Mapping '' to ' ' doesn't make a lot of sense to me.  It isn't what
I'd expect.

I think the behavior you describe in this note is better.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Don Baccus <dhogaza@pacifier.com> writes:
> Mapping '' to ' ' doesn't make a lot of sense to me.  It isn't what
> I'd expect.
> I think the behavior you describe in this note is better.

I'm coming to that conclusion as well.  If you look closely, both
charin() and charout() act that way already; so the second proposal
boils down to making the text <=> char conversion functions act in
accordance with the way that char's I/O conversions already act.
That seems a less drastic change than altering both I/O and conversion
behavior.
        regards, tom lane


Re: Re: charin(), text_char() should return something else for empty input

From
ncm@zembu.com (Nathan Myers)
Date:
On Mon, May 28, 2001 at 02:37:32PM -0400, Tom Lane wrote:
> I wrote:
> > I propose that both of these operations should return a space character
> > for an empty input string.  This is by analogy to space-padding as you'd
> > get with char(1).  Any objections?
> 
> An alternative approach is to make charin and text_char map empty
> strings to the null character (\0), and conversely make charout and
> char_text map the null character to empty strings.  charout already
> acts that way, in effect, since it has to produce a null-terminated
> C string.  This way would have the advantage that there would still
> be a reversible dump and reload representation for a "char" field
> containing '\0', whereas space-padding would cause such a field to
> become ' ' after reload.  But it's a little strange if you think that
> "char" ought to behave the same as char(1).

Does the standard require any particular behavior in with NUL 
characters?  I'd like to see PG move toward treating them as ordinary 
control characters.  I realize that at best it will take a long time 
to get there.  C is irretrievably mired in the "NUL is a terminator"
swamp, but SQL isn't C.

Nathan Myers
ncm@zembu.com


Re: Re: charin(), text_char() should return something else for empty input

From
Peter Eisentraut
Date:
Nathan Myers writes:

> Does the standard require any particular behavior in with NUL
> characters?

The standard describes the behaviour of the character types in terms of
character sets.  This decouples glyphs, encoding, and storage.  So
theoretically you could (AFAICT) define a character set that encodes some
meaningful character with code zero, but the implementation is not
required to handle this zero byte internally, it could catch it during
input and represent it with an escape code.

The standard also defines some possible "built-in" character sets, such as
LATIN1 and UTF16.  Most of these do not naturally contain a character that
is encoded with the zero byte.  In the case of the ISO8BIT/ASCII_FULL
charset, the standard explicitly says that the zero byte is not contained
in the character set.

In general, I don't see a point in accepting a zero byte in character
strings.  If you want to store binary data there are binary data types (or
effort could be invested in them).

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Peter Eisentraut <peter_e@gmx.net> writes:
> In general, I don't see a point in accepting a zero byte in character
> strings.  If you want to store binary data there are binary data types (or
> effort could be invested in them).

If we were starting in a green field then I'd think it worthwhile to
maintain null-byte-cleanness for the textual datatypes.  At this point,
though, the amount of pain involved seems to vastly outweigh the value.
The major problem is that I/O conventions not based on null-terminated
strings would break all existing user-defined datatypes.  (Changing our
own code is one thing, breaking users' code is something else.)  There
are minor-by-comparison problems like not being able to use strcoll()
for locale-sensitive comparisons anymore...

I agree with Peter that spending some extra effort on bytea and/or
similar types is probably a more profitable answer.
        regards, tom lane