Thread: Array access to type "name"

Array access to type "name"

From
Peter Eisentraut
Date:
The type "name" can be subscripted like an array to access the individual
"char" elements.  But since a character stored in a "name" value isn't
necessarily one byte, this is incorrect.  Does anything rely on this
facility, or would it be better to remove it for type "name"?

Here's an example that produces a failure:

$ export PGCLIENTENCODING=latin1
$ createdb -E UNICODE test
$ psql test
=> create table åland (a int);
=> create table überschall (b int);
=> select relname[0] from pg_class;
ERROR:  Could not convert UTF-8 to ISO8859-1

--
Peter Eisentraut   peter_e@gmx.net



Re: Array access to type "name"

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> The type "name" can be subscripted like an array to access the individual
> "char" elements.  But since a character stored in a "name" value isn't
> necessarily one byte, this is incorrect.  Does anything rely on this
> facility, or would it be better to remove it for type "name"?

The fact that it isn't very useful for multibyte character sets doesn't
seem to me to be reason to remove it for everyone ...

> Here's an example that produces a failure:
> $ export PGCLIENTENCODING=latin1
> $ createdb -E UNICODE test
> $ psql test
> => create table �land (a int);
> => create table �berschall (b int);
> => select relname[0] from pg_class;
> ERROR:  Could not convert UTF-8 to ISO8859-1

I'm not having any luck duplicating that here, but in any case what the
above suggests to me is lack of robustness in the output conversion
chain for type "char".  Or do you want to legislate that byte values
corresponding to the first bytes of multibyte character sequences are
illegal values for type "char"?  I'd have a problem with that ...
        regards, tom lane



Re: Array access to type "name"

From
Peter Eisentraut
Date:
Tom Lane writes:

> I'm not having any luck duplicating that here, but in any case what the
> above suggests to me is lack of robustness in the output conversion
> chain for type "char".  Or do you want to legislate that byte values
> corresponding to the first bytes of multibyte character sequences are
> illegal values for type "char"?  I'd have a problem with that ...

I think it comes down to defining what we really want.  Clearly, "char" is
a byte, not a character, much like in C.  Perhaps we should adopt the
bytea escape mechanism for "char" values above 127.  Otherwise, what gets
stored and what gets printed out both depends on character set conversion
issues, which seems yucky.

Now you can define name[x] to be the x'th *byte* of name, but that seems
contrived and inconsistent with the original purpose, because whether you
get useful or garbage values depends on the character set encoding.  If
you want to select the x'th character, use substring(), if you want access
to bytes, use bytea.  The character set encoding is an internal matter
that should not be accessible to users.

Btw., the issue is even a bit more serious than the example I posted:

$ dropdb test
$ createdb -E UNICODE test
$ psql test
=> create table åland (a int);
=> \d
ERROR:  Could not convert UTF-8 to ISO8859-1

(Latest sources.)

--
Peter Eisentraut   peter_e@gmx.net



Re: Array access to type "name"

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I think it comes down to defining what we really want.  Clearly, "char" is
> a byte, not a character, much like in C.  Perhaps we should adopt the
> bytea escape mechanism for "char" values above 127.  Otherwise, what gets
> stored and what gets printed out both depends on character set conversion
> issues, which seems yucky.

That would be okay with me.

> Now you can define name[x] to be the x'th *byte* of name, but that seems
> contrived and inconsistent with the original purpose, because whether you
> get useful or garbage values depends on the character set encoding.

"Original purpose" is in the eye of the beholder, maybe.  It's not
practical to fix name subscripting to make it be multibyte-aware
(not least because the output type couldn't be "char").  So the only
alternative to leaving it alone is to remove the capability entirely.
Which strikes me as overly rigid.  It is a useful facility for people
using 1-byte character sets, and I see no reason to take it away from
them just because it isn't very useful in multibyte character sets.


> Btw., the issue is even a bit more serious than the example I posted:

> $ dropdb test
> $ createdb -E UNICODE test
> $ psql test
> => create table �land (a int);
> => \d
> ERROR:  Could not convert UTF-8 to ISO8859-1

I suspected that that issue didn't really have anything to do with
subscripting, and I guess I was right.  But I still can't duplicate
the error.  I take it you are using client_encoding ISO8859-1 ...
but what exactly is the funny character involved?  It comes across
here as \345 but I bet something munged it in transmission, because
what I see is

test=# create table �land (a int);
CREATE TABLE
test=# \d       List of relationsSchema | Name | Type  |  Owner
--------+------+-------+----------public | ,and | table | postgres
(1 row)

        regards, tom lane



Re: Array access to type "name"

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Btw., the issue is even a bit more serious than the example I posted:
> $ dropdb test
> $ createdb -E UNICODE test
> $ psql test
> => create table �land (a int);
> => \d
> ERROR:  Could not convert UTF-8 to ISO8859-1
> (Latest sources.)

I still can't duplicate this, but I see different behavior now that
I've fixed that silliness of not having any encoding conversion on
incoming queries.  Would you check again with CVS tip?
        regards, tom lane