Thread: Array access to type "name"
The type "name" can be subscripted like an array to access the individual "char" elements. But since a character stored in a "name" value isn't necessarily one byte, this is incorrect. Does anything rely on this facility, or would it be better to remove it for type "name"? Here's an example that produces a failure: $ export PGCLIENTENCODING=latin1 $ createdb -E UNICODE test $ psql test => create table åland (a int); => create table überschall (b int); => select relname[0] from pg_class; ERROR: Could not convert UTF-8 to ISO8859-1 -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > The type "name" can be subscripted like an array to access the individual > "char" elements. But since a character stored in a "name" value isn't > necessarily one byte, this is incorrect. Does anything rely on this > facility, or would it be better to remove it for type "name"? The fact that it isn't very useful for multibyte character sets doesn't seem to me to be reason to remove it for everyone ... > Here's an example that produces a failure: > $ export PGCLIENTENCODING=latin1 > $ createdb -E UNICODE test > $ psql test > => create table �land (a int); > => create table �berschall (b int); > => select relname[0] from pg_class; > ERROR: Could not convert UTF-8 to ISO8859-1 I'm not having any luck duplicating that here, but in any case what the above suggests to me is lack of robustness in the output conversion chain for type "char". Or do you want to legislate that byte values corresponding to the first bytes of multibyte character sequences are illegal values for type "char"? I'd have a problem with that ... regards, tom lane
Tom Lane writes: > I'm not having any luck duplicating that here, but in any case what the > above suggests to me is lack of robustness in the output conversion > chain for type "char". Or do you want to legislate that byte values > corresponding to the first bytes of multibyte character sequences are > illegal values for type "char"? I'd have a problem with that ... I think it comes down to defining what we really want. Clearly, "char" is a byte, not a character, much like in C. Perhaps we should adopt the bytea escape mechanism for "char" values above 127. Otherwise, what gets stored and what gets printed out both depends on character set conversion issues, which seems yucky. Now you can define name[x] to be the x'th *byte* of name, but that seems contrived and inconsistent with the original purpose, because whether you get useful or garbage values depends on the character set encoding. If you want to select the x'th character, use substring(), if you want access to bytes, use bytea. The character set encoding is an internal matter that should not be accessible to users. Btw., the issue is even a bit more serious than the example I posted: $ dropdb test $ createdb -E UNICODE test $ psql test => create table åland (a int); => \d ERROR: Could not convert UTF-8 to ISO8859-1 (Latest sources.) -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > I think it comes down to defining what we really want. Clearly, "char" is > a byte, not a character, much like in C. Perhaps we should adopt the > bytea escape mechanism for "char" values above 127. Otherwise, what gets > stored and what gets printed out both depends on character set conversion > issues, which seems yucky. That would be okay with me. > Now you can define name[x] to be the x'th *byte* of name, but that seems > contrived and inconsistent with the original purpose, because whether you > get useful or garbage values depends on the character set encoding. "Original purpose" is in the eye of the beholder, maybe. It's not practical to fix name subscripting to make it be multibyte-aware (not least because the output type couldn't be "char"). So the only alternative to leaving it alone is to remove the capability entirely. Which strikes me as overly rigid. It is a useful facility for people using 1-byte character sets, and I see no reason to take it away from them just because it isn't very useful in multibyte character sets. > Btw., the issue is even a bit more serious than the example I posted: > $ dropdb test > $ createdb -E UNICODE test > $ psql test > => create table �land (a int); > => \d > ERROR: Could not convert UTF-8 to ISO8859-1 I suspected that that issue didn't really have anything to do with subscripting, and I guess I was right. But I still can't duplicate the error. I take it you are using client_encoding ISO8859-1 ... but what exactly is the funny character involved? It comes across here as \345 but I bet something munged it in transmission, because what I see is test=# create table �land (a int); CREATE TABLE test=# \d List of relationsSchema | Name | Type | Owner --------+------+-------+----------public | ,and | table | postgres (1 row) regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > Btw., the issue is even a bit more serious than the example I posted: > $ dropdb test > $ createdb -E UNICODE test > $ psql test > => create table �land (a int); > => \d > ERROR: Could not convert UTF-8 to ISO8859-1 > (Latest sources.) I still can't duplicate this, but I see different behavior now that I've fixed that silliness of not having any encoding conversion on incoming queries. Would you check again with CVS tip? regards, tom lane