Thread: plperl.c patch to correctly support bytea inputs and output to functions and triggers.

I've found a bug with the way plperl/plperlu handles bytea types.  It
fails to correctly handle bytea binary inputs and outputs.  Perl
scalars are wrongly treated as C strings when a the source or
destination tuple in postgres is of type bytea.

The attached patch alters the code path for inputs and outputs with
the BYTEAOID type.

"Works for me"

==== begin plperl_bytea.sql ====
create or replace function pl_octet_length(bytea) returns integer as $$
return length($_[0]);
$$ language plperl;
create or replace function pl_produce_bytea() returns bytea as $$
return "\0\0\0\0\0\0see me?";
$$ language plperl;
==== end plperl_bytea.sql ===

PROBLEM:

pltest=# select pl_octet_length(E'\012\015'), octet_length(E'\012\015');
pl_octet_length | octet_length
-----------------+--------------
                8 |            2
(1 row)
pltest=# select pl_produce_bytea(), octet_length(pl_produce_bytea());
pl_produce_bytea | octet_length
------------------+--------------
                   |            0
(1 row)

AFTER PATCH:

pltest=# select pl_octet_length(E'\012\015'), octet_length(E'\012\015');
pl_octet_length | octet_length
-----------------+--------------
                2 |            2
(1 row)

pltest=# select pl_produce_bytea(), octet_length(pl_produce_bytea());
         pl_produce_bytea         | octet_length
---------------------------------+--------------
\000\000\000\000\000\000see me? |           13
(1 row)


// Theo Schlossnagle
// Principal@OmniTI: http://omniti.com
// Esoteric Curio: http://www.lethargy.org/~jesus/



Attachment
Theo Schlossnagle <jesus@omniti.com> writes:
> I've found a bug with the way plperl/plperlu handles bytea types.  It
> fails to correctly handle bytea binary inputs and outputs.

Define "correctly".  The proposed patch seems to be "let's handle
bytea differently from every other data type", and that sure doesn't
sound like a path I want to tread.

            regards, tom lane
On Apr 28, 2007, at 1:26 PM, Tom Lane wrote:

> Theo Schlossnagle <jesus@omniti.com> writes:
>> I've found a bug with the way plperl/plperlu handles bytea types.  It
>> fails to correctly handle bytea binary inputs and outputs.
>
> Define "correctly".  The proposed patch seems to be "let's handle
> bytea differently from every other data type", and that sure doesn't
> sound like a path I want to tread.

As far as I can tell, bytea is the only datatype now that suffers
from data loss.  In this I could be mistaken.  I took my cues form
the way postgres handles inputing records, it switches on whether
they were received in a binary fashion or not.  Since we're inside
and have a Datum (or are making one) already, everything is just
memory chunks and some characteristic of the Oid should be used to
determine whether the data should be treated as binary.  As is clear
from the patch, I used "if(Oid == BYTEAOID)" as the characteristic
and perhaps there is a more robust way.

If I return a bytes from perl that looks like: "hello\0there",
postgres sees a 5 byte string "hello".  That's data loss and makes it
useless as a datatype as I cannot return things like images and other
binary data.

When passing the string E'hello there\015\012' into a bytea receiving
perl function, there is no way for me to get at the actual data
passed to me.  Instead I get the Cstring: "hello there\\015\\012"
which is 19 characters long instead of the 13 bytes of "bytea" data.
Worse? E'hello\000there' will be materialized as a 5 bytes "bytea" in
perl actually loosing the remainder of the data.  This also makes it
impossible to work with bytes data in the plperl language; not hard,
impossible.

In a lot of ways, bytea is different from every other data type, it
is one that isn't suitable for chatacter set conversion, doesn't
trivially cast to other varying size data types (like text, varchar,
etc.).  It also is the only one (of its friends text, varchar, etc.)
that suffers from data loss if used with InputFunctionCall and
OutpuFunctionCall and not handled correctly with ReceiveFunctionCall
and SendFunctionCall.

If bytea is instead a class of datatypes that represent arbitrary
binary data, I'd agree that the patch should be changed to switch on
that sort of identifier instead of the BYTEAOID Oid.  If you'd clue
me into how one would go about identifying if the datatype Oid is to
be treated as an arbitrary length octet sequence not subject to
characterset conversion, then I'd happy revise the patch to be more
correct.

Best regards,

Theo

// Theo Schlossnagle
// Principal@OmniTI: http://omniti.com
// Esoteric Curio: http://www.lethargy.org/~jesus/
Theo Schlossnagle <jesus@omniti.com> writes:
> If I return a bytes from perl that looks like: "hello\0there",
> postgres sees a 5 byte string "hello".

You have failed to pay any attention to the escaping rules for bytea if
you do that.

            regards, tom lane
Tom Lane wrote:
> Theo Schlossnagle <jesus@omniti.com> writes:
>> If I return a bytes from perl that looks like: "hello\0there",
>> postgres sees a 5 byte string "hello".
>
> You have failed to pay any attention to the escaping rules for bytea if
> you do that.

As a note, the return from the function is a bytea, so no decoding is
done.  If I return a string with a \0 in it, then it will be clipped
short.  If I return a string with a "\\000" in it, I will get the
four-character string \000.  There is no way to correctly return a
binary string that contains a \0 from plperl as it stands.

There should be no escaping required to return a bytea from within a pl
in postgres.  It would mean that when I have a 20MB binary column, I
would have to use up to four times that space just to return it into the
callers context.  That would be a bug in its own right.  I certainly
don't want to have to convert a binary blob into an escaped version to
hand it back from within the database -- that couldn't possible be a
"desired" use pattern.

As I see it, bytea *are* special.  They are treated specially in other
portions of postgres (where SendFunctionCall is used over
OuputFunctionCall).  If I have a binary passthrough function in plperl,
it should just work:

create function pl_passthru(bytea) returns bytea as $$
   return $_[0];
$$ language plperl;

That function should give me back what I passed it.  At least from a
type defined equality standpoint.  I'd argue that any function in any pl
should satisfied that reflexive property.  Simply returning the passed
argument should not alter it assuming the input and output types are the
same.

While I realize that each pl has its nuances a type conversion from:

              pg2pl(A)     pl2pg(A)
postgrestype(A) -> pltype(A) -> postgrestype(A)

I would make the assumption when using a pl that: pl2pg(pg2pl(A)) === A

If there was escaping required to make this so, I would hope that the pl
implementation would do that for me.

I'm not sure I follow the argument against the patch.  As it stands,
bytea type is supported by plperl (I am allowed to create functions with
both input and output types of bytea.  Both input and output of bytea
are broken.  The patch fixes that.  The only argument I could see
against the patch would be if there were other types that also required
this special treatment.  Are there?

--
// Theo Schlossnagle
// Principal@OmniTI: http://omniti.com
// Esoteric Curio: http://www.lethargy.org/~jesus/
Theo Schlossnagle <jesus@omniti.com> writes:
> As a note, the return from the function is a bytea, so no decoding is
> done.  If I return a string with a \0 in it, then it will be clipped
> short.  If I return a string with a "\\000" in it, I will get the
> four-character string \000.  There is no way to correctly return a
> binary string that contains a \0 from plperl as it stands.

That is simply incorrect.

regression=# create or replace function foo() returns bytea as $$
  return 'foo\000bar';
$$ language plperl;
CREATE FUNCTION
regression=# select foo();
    foo
------------
 foo\000bar
(1 row)

regression=# select length(foo());
 length
--------
      7
(1 row)

> There should be no escaping required to return a bytea from within a pl
> in postgres.

That's not a bug report, that's a proposal for a non-backward-compatible
behavioral change.  pgsql-bugs is not the forum for such proposals.
If you want to take it up in pgsql-hackers, be prepared to answer
questions about why bytea should behave differently from all other
types, and what about bytea members of arrays and records, and what
about consistency with the other PLs?

> If I have a binary passthrough function in plperl,
> it should just work:

> create function pl_passthru(bytea) returns bytea as $$
>    return $_[0];
> $$ language plperl;

Worksforme.

regression=# select pl_passthru(E'foo\\000bar'::bytea);
 pl_passthru
-------------
 foo\000bar
(1 row)

regression=# select length(pl_passthru(E'foo\\000bar'::bytea));
 length
--------
      7
(1 row)

            regards, tom lane
On Apr 30, 2007, at 11:32 AM, Tom Lane wrote:

> Theo Schlossnagle <jesus@omniti.com> writes:
>> As a note, the return from the function is a bytea, so no decoding is
>> done.  If I return a string with a \0 in it, then it will be clipped
>> short.  If I return a string with a "\\000" in it, I will get the
>> four-character string \000.  There is no way to correctly return a
>> binary string that contains a \0 from plperl as it stands.
>
> That is simply incorrect.
>
> regression=# create or replace function foo() returns bytea as $$
>   return 'foo\000bar';
> $$ language plperl;
> CREATE FUNCTION
> regression=# select foo();
>     foo
> ------------
>  foo\000bar
> (1 row)
>
> regression=# select length(foo());
>  length
> --------
>       7
> (1 row)

I stand corrected.

>> There should be no escaping required to return a bytea from within
>> a pl
>> in postgres.
>
> That's not a bug report, that's a proposal for a non-backward-
> compatible
> behavioral change.  pgsql-bugs is not the forum for such proposals.
> If you want to take it up in pgsql-hackers, be prepared to answer
> questions about why bytea should behave differently from all other
> types, and what about bytea members of arrays and records, and what
> about consistency with the other PLs?
>
>> If I have a binary passthrough function in plperl,
>> it should just work:
>
>> create function pl_passthru(bytea) returns bytea as $$
>>    return $_[0];
>> $$ language plperl;
>
> Worksforme.
>
> regression=# select pl_passthru(E'foo\\000bar'::bytea);
>  pl_passthru
> -------------
>  foo\000bar
> (1 row)
>
> regression=# select length(pl_passthru(E'foo\\000bar'::bytea));
>  length
> --------
>       7
> (1 row)

Okay.  Not a bug.

I think that requiring a user to \ decode binary input is very very
expensive as well as having to \ encode it.  It seems wrong to me.

Are you interested in moving this patch to the feature queue?  I get
the impression you don't like the idea of making the bytea type work
more seamlessly in perl.  Perhaps I'm mistaken.

// Theo Schlossnagle
// Principal@OmniTI: http://omniti.com
// Esoteric Curio: http://www.lethargy.org/~jesus/