Thread: text and bytea

text and bytea

From
"hernan gonzalez"
Date:
(After dealing a while with this, and learning a little, I though of
post this as comment in the docs, but perhaps someone who knows better
can correct or clarify)

=====================================================================================================

The issues of charset encodings and the distinction between "text" and
"bytea" can led to some confusions. Say I have a database with UTF-8
default encoding.

db=# \encoding
UTF8

db=# create table chartest ( c text);         -- just one text field
CREATE TABLE

db=# insert into chartest (c) values ('á');    --  just one non ascii
string: acute a
INSERT 0 1

db=# select c from chartest;  --  to check our terminal works ok
 c
---
 á
(1 row)

db=# select ascii(c) from chartest; --  to check the "real" stored value
 ascii
-------
   225
(1 row)

OK, now let's try to use the to_ascii() function, useful for
accent-insentive queries...

db=# select to_ascii(c) from chartest;
ERROR:  encoding conversion from UTF8 to ASCII not supported

That's right, the docs say that to_ascii() requires a string in LATIN9
encoding (and a few others)
How do we convert a string from UTF8 to LATIN9 ? Ah, here's is the
convert_to() function. Lets try:

db=# select to_ascii(convert_to(c,'LATIN9')) from chartest;
ERROR:  function to_ascii(bytea) does not exist
LINE 1: select to_ascii(convert_to(c,'LATIN9')) from chartest;
               ^
HINT:  No function matches the given name and argument types. You
might need to add explicit type casts.

Oops. convert_to() returns a bytea (byte array), to_char() expects a
text, not quite the same thing.
Should we try an explicit cast ?

db=# select to_ascii(convert_to(c,'LATIN9')::text) from chartest;
ERROR:  encoding conversion from UTF8 to ASCII not supported

Mmmm.... does not work, the casting assumes the default encoding (UTF8).
How do we convince postgresql to interpret a byte array (mind it:
already representing a text codification)
as a text? Looking at the string related functions, it seems that
encode() might do the trick.
We try the 'escape' option (not quite satisfactory, we are not trying
to escape anything but...)

db=# select to_ascii(encode(convert_to(c,'LATIN9'),'escape')) from chartest;
ERROR:  encoding conversion from UTF8 to ASCII not supported

Damn, it seems that the encoding worked but I need to convince
to_ascii() that the text it receives is not UTF8.
Lets see:

db=# select to_ascii(encode(convert_to(c,'LATIN9'),'escape'),'LATIN9')
from chartest;
 to_ascii
----------
 a
(1 row)


Wow! It worked!

I'm not sure if this is the "correct" way of doing it. If so, it does
not seems very satisfaying to me. Besides being a little too
convoluted, the distintion of "text" vs "bytea" seems half baked... I
would have expect one of these alternatives:
1) "bytea" is just a bunch of arbitrary bytes, a "text" is the
codification of a string in a fixed encoding (may be that of the
server installation, or the database one, or even a postgresql
internally choosen). In this case, "texts" only exist in this
encoding, the conversions to-from LATIN1, etc are only to-from bytes
arrays. This is how Java works (and is fairly clean for me).
2) "bytea" is just a bunch of arbitrary bytes, a "text" is the
codification of a string in a arbitrary encoding, and the chosen
encoding is part of the text content. That is, postgresql knows (at
'runtime') the encoding of the string.
Instead, in Postgresql  implementation (the gurus might correct me) a
"text" very similar to a bytea, except for some restrictions (null
bytes and invalid sequences). This IMHO results in some confusion and
clumsiness in the conversions and text functions.

=============================================

Hernán

Re: text and bytea

From
Martijn van Oosterhout
Date:
On Thu, Feb 21, 2008 at 02:34:15PM -0200, hernan gonzalez wrote:
> (After dealing a while with this, and learning a little, I though of
> post this as comment in the docs, but perhaps someone who knows better
> can correct or clarify)

It seems to me that postgres is trying to do as you suggest: text is
characters and bytea is bytes, like in Java. You don't indicate what
version you are using, this area was rejigged recently.

The most surprising this is that to_ascii won't accept a bytea. TBH the
whole to_ascii function seems somewhat half-baked. If what you're
trying to do is remove accents, there are perl functions around that do
that. Basically, the switch to a different normal form then drop all
the accent characters.

Essentially, Postgres supports UTF-8, but doesn't understand Unicode
characters much at all.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

Re: text and bytea

From
Alvaro Herrera
Date:
Martijn van Oosterhout escribió:

> The most surprising this is that to_ascii won't accept a bytea. TBH the
> whole to_ascii function seems somewhat half-baked. If what you're
> trying to do is remove accents, there are perl functions around that do
> that. Basically, the switch to a different normal form then drop all
> the accent characters.

Yeah, it's been a common suggestion to use convert() in combination with
to_ascii on UTF-8 databases, and I didn't notice that the convert()
shuffling would take that ability away :-(  I don't think requiring
plperl is nice however.  Perhaps we could get around the problem by
using byteaout/textin.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: text and bytea

From
"hernan gonzalez"
Date:
>  It seems to me that postgres is trying to do as you suggest: text is
>  characters and bytea is bytes, like in Java.

But the big difference is that, for text type, postgresql knows "this
is a text"
but doesnt know the encoding, as my example showed. This goes against
the concept of "text vs bytes" distintion, which per se is very useful
and powerful
(specially in this Unicode world) and leads to a dubious/clumsy string api
(IMHO, as always).

> You don't indicate what
>  version you are using, this area was rejigged recently.
>

Sorry, I forget to say that my examples are for last version (8.3)

Cheers

--
Hernán J. González

Re: text and bytea

From
Martijn van Oosterhout
Date:
On Fri, Feb 22, 2008 at 01:54:46PM -0200, hernan gonzalez wrote:
> >  It seems to me that postgres is trying to do as you suggest: text is
> >  characters and bytea is bytes, like in Java.
>
> But the big difference is that, for text type, postgresql knows "this
> is a text" but doesnt know the encoding, as my example showed. This
> goes against the concept of "text vs bytes" distintion, which per se
> is very useful and powerful (specially in this Unicode world) and
> leads to a dubious/clumsy string api (IMHO, as always).

Umm, I think all you showed was that the to_ascii() function was
broken. Postgres knows exactly what encoding the string is in, the
backend encoding: in your case UTF-8.

Now, it would be nice if postgres could handle other encodings in the
backend, but there's no agreement on how to implement that feature so
it isn't implemented.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

Re: text and bytea

From
"hernan gonzalez"
Date:
>  Umm, I think all you showed was that the to_ascii() function was
>  broken. Postgres knows exactly what encoding the string is in, the
>  backend encoding: in your case UTF-8.

That would be fine, if it were true; then, one could assume that every
postgresql function that returns a text gets ALWAYS the standard
backend encoding (again: as in Java). But consider the result
postgresql gets from this (from my example):
     encode(convert_to(c,'LATIN9'),'escape')
That's something of type text (a strign), postgresql believes it's
UTF8, but it's not  (it probably woud not even validate as a valid
utf8 sequence).
IMHO, the semantics of encode() and decode() are correct (the bridge
between bytea and text ... in the backend encoding; they should be the
only bridge), convert() is also ok (deals with bytes), but
convert_to() and convert_from() are dubious if not broken: they imply
texts in arbitrary encodings (for output or input) , lead to
anomalities and shouldnt be necessary at all.

Cheers

Hernán J. González
http://hjg.com.ar/

Re: text and bytea

From
Gregory Stark
Date:
"hernan gonzalez" <hgonzalez@gmail.com> writes:

> IMHO, the semantics of encode() and decode() are correct (the bridge
> between bytea and text ... in the backend encoding; they should be the
> only bridge), convert() is also ok (deals with bytes), but
> convert_to() and convert_from() are dubious if not broken: they imply
> texts in arbitrary encodings (for output or input) , lead to
> anomalities and shouldnt be necessary at all.

postgres=# \df convert_from
                         List of functions
   Schema   |     Name     | Result data type | Argument data types
------------+--------------+------------------+---------------------
 pg_catalog | convert_from | text             | bytea, name
(1 row)

postgres=# \df convert_to
                        List of functions
   Schema   |    Name    | Result data type | Argument data types
------------+------------+------------------+---------------------
 pg_catalog | convert_to | bytea            | text, name
(1 row)


Looks like they produce and consume byteas to me.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: text and bytea

From
"hernan gonzalez"
Date:
> IMHO, the semantics of encode() and decode() are correct (the bridge
> between bytea and text ... in the backend encoding; they should be the
> only bridge), convert() is also ok (deals with bytes), but
> convert_to() and convert_from() are dubious if not broken: they imply
> texts in arbitrary encodings (for output or input) , lead to
> anomalities and shouldnt be necessary at all.

Sorry, my mistake. I meant the opposite:

convert_to() and convert_from()  are the "correct" bridge (text <=>
bytea) functions.
Also convert() is ok.
The objetionable ones IMHO are decode()/encode(), which can
consume/produce a "non-utf8 string" (I mean, not the backend encoding)

Going back to the line:

encode(convert_to(c,'LATIN9'),'escape')

Here we have:
 c => text  (ut8)
 convert_to(..). => bytea (represents a char sequence in latin9 encoding)
 encode(...) => text (in latin9 encoding?)

Cheers

Hernán J. González
http://hjg.com.ar/

Re: text and bytea

From
"hernan gonzalez"
Date:
Another example  (Psotgresql 8.3.0, UTF-8  server/client encoding)

test=# create table chartest ( c text);
test=# insert into chartest (c) values ('¡Hasta mañana!');
test=# create view vchartest as
 select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest;

test=# select c,octet_length(c) from chartest ;
       c        | octet_length
----------------+--------------
 ¡Hasta mañana! |           16

test=# select c1,octet_length(c1) from vchartest ;
      c1      | octet_length
--------------+--------------
 Hasta maana! |           14

(the field is seen as as text by postgresql, with the default
encoding.. UTF8; it is actually not)

test=# select * from vchartest where c1 like '%a%';
      c1
--------------
 Hasta maana!
(1 row)

test=# select * from vchartest where c1 ilike '%a%';
ERROR:  invalid byte sequence for encoding "UTF8": 0xa1
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

That "ilike" breaks and "like" is rather random, it seems that the
later has some optimization does not check the validty of the utf8
stream. But thats not the point. The point is that IMO postgresql
should always handle text in the backend encoding, there should no
exists funcions that are designed to produce/consume texts in other
encodings. Perhaps the "encode" function is ill defined, and should be
rethinked. Two alternatives:
1. For special binary-to-ascii encodings (base64,hex). Keep its
present signature but remove/deprecate the "escape" type. It returns a
text in the backend encoding.
2  For arbitrary binary encodings. Change its signature so that it
returns bytea.
Of course, all this applies symmetrically to decode().

Appart, and in accordance with this, I think to_ascii() should accept
only one text argument.

Re: text and bytea

From
Tom Lane
Date:
"hernan gonzalez" <hgonzalez@gmail.com> writes:
> The objetionable ones IMHO are decode()/encode(), which can
> consume/produce a "non-utf8 string" (I mean, not the backend encoding)

Huh?  Those deal with bytea too --- in fact, they've got nothing at
all to do with multibyte character representations.  They're for
handling hex and base64 and suchlike representations of binary data.

            regards, tom lane

Re: text and bytea

From
Tom Lane
Date:
"hernan gonzalez" <hgonzalez@gmail.com> writes:
> test=# create view vchartest as
>  select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest;

Hmm.  This isn't a very sensible combination that you've written here,
but I see the point: encode(..., 'escape') is broken in that it fails
to convert high-bit-set bytes into \nnn sequences.  At least in
multibyte backend encodings, we *must* do that to produce valid textual
output.  I suspect that for consistency we should do it regardless of
backend encoding.

            regards, tom lane

Re: text and bytea

From
Bruce Momjian
Date:
Tom Lane wrote:
> "hernan gonzalez" <hgonzalez@gmail.com> writes:
> > test=# create view vchartest as
> >  select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest;
>
> Hmm.  This isn't a very sensible combination that you've written here,
> but I see the point: encode(..., 'escape') is broken in that it fails
> to convert high-bit-set bytes into \nnn sequences.  At least in
> multibyte backend encodings, we *must* do that to produce valid textual
> output.  I suspect that for consistency we should do it regardless of
> backend encoding.

With Tom's encoding() patch applied I assume there is no TODO item here.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +