Thread: Bytea as C string in pg_convert?

Bytea as C string in pg_convert?

From

"Brendan Jurd"

Date:

24 September 2007, 06:33:29

Hi hackers,

In the process of trying to unify the various text/cstring conversions
in the backend, I came across some stuff that seemed weird in
pg_convert().

From src/backend/utils/mb/mbutils.c:345:

Datum
pg_convert(PG_FUNCTION_ARGS)
{       bytea      *string = PG_GETARG_TEXT_P(0);

Is this a typo?  Seems this should be PG_GETARG_BYTEA_P.

Moving on from that, the function takes the bytea argument and
converts it into a C string (using the exact the same technique as
textout, which is why I noticed it).

The documentation is very clear that bytea values "specifically allow
storing octets of value zero and other "non-printable" octets".  That
being the case, is it sane to convert a bytea to a cstring at all?
The possibility of having valid nulls in the value renders the whole
point of a null-terminated character array ... well, null.

Before putting it into a cstring, the function does put the bytea
value through pg_verify_mbstr(), so basically the issue goes away if
we accept the premise that we will never allow a character encoding
where the null byte is valid.  However, if we reject that premise
there's a problem.

pg_convert() does pass the length of the bytea along to
pg_do_encoding_conversion(), so either
a) the encoding functions properly respect length and ignore nulls in
the string, in which case the null at the end is worthless and we may
as well just operate on the VARDATA of the bytea, orb) the encoding functions treat a null byte as the end of the
string,
in which case they are broken w.r.t. to bytea input.

Regards,
BJ

Re: Bytea as C string in pg_convert?

From

Andrew Dunstan

Date:

24 September 2007, 11:50:06

Brendan Jurd wrote:
> Hi hackers,
>
> In the process of trying to unify the various text/cstring conversions
> in the backend, I came across some stuff that seemed weird in
> pg_convert().
>
> >From src/backend/utils/mb/mbutils.c:345:
>
> Datum
> pg_convert(PG_FUNCTION_ARGS)
> {
>         bytea      *string = PG_GETARG_TEXT_P(0);
>
> Is this a typo?  Seems this should be PG_GETARG_BYTEA_P.
>
> Moving on from that, the function takes the bytea argument and
> converts it into a C string (using the exact the same technique as
> textout, which is why I noticed it).
>
> The documentation is very clear that bytea values "specifically allow
> storing octets of value zero and other "non-printable" octets".  That
> being the case, is it sane to convert a bytea to a cstring at all?
> The possibility of having valid nulls in the value renders the whole
> point of a null-terminated character array ... well, null.
>
> Before putting it into a cstring, the function does put the bytea
> value through pg_verify_mbstr(), so basically the issue goes away if
> we accept the premise that we will never allow a character encoding
> where the null byte is valid.  However, if we reject that premise
> there's a problem.
>
> pg_convert() does pass the length of the bytea along to
> pg_do_encoding_conversion(), so either
>
>  a) the encoding functions properly respect length and ignore nulls in
> the string, in which case the null at the end is worthless and we may
> as well just operate on the VARDATA of the bytea, or
>  b) the encoding functions treat a null byte as the end of the string,
> in which case they are broken w.r.t. to bytea input.
>
>
>   

Please read the recent discussions about encoding issues. convert() now 
returns a bytea precisely because we cannot be sure that the data 
returned will be valid in the database encoding. The behaviour here is 
entirely intentional. We have just closed every hole we are aware of 
whereby data that isn't valid in the database encoding can enter the 
database. We're not about to reopen them.

And yes, we will not accept an encoding with a null byte. We don't even 
accept nulls in Unicode. If we do accept such an encoding then this 
would be among the least of our problems, I suspect.

We can and possibly should change the GETARG call, but the varlena types 
are structurally equivalent, so it's not a mortal sin being committed here.

cheers

andrew

Re: Bytea as C string in pg_convert?

From

Tom Lane

Date:

24 September 2007, 14:33:06

Andrew Dunstan <andrew@dunslane.net> writes:
> We can and possibly should change the GETARG call, but the varlena types 
> are structurally equivalent, so it's not a mortal sin being committed here.

We *definitely* should change it --- the reason for having all those
variant macros in the first place was to help document the argument
types of V1 functions.  It's too bad the compiler doesn't warn about
the type mismatch here ...
        regards, tom lane

Re: Bytea as C string in pg_convert?

From

Andrew Dunstan

Date:

24 September 2007, 15:07:24

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> We can and possibly should change the GETARG call, but the varlena types 
>> are structurally equivalent, so it's not a mortal sin being committed here.
>>     
>
> We *definitely* should change it --- the reason for having all those
> variant macros in the first place was to help document the argument
> types of V1 functions.  It's too bad the compiler doesn't warn about
> the type mismatch here ...
>
>             
>   

I have changed it. The thing is, though, that this function not only 
performs the convert() function but acts as the engine for convert_to() 
and convert_from(). Those functions do some silent transformations, in 
one case passing a text Datum as the first argument and in the other 
case the returning the result as text. If there's a better way to do 
this I'll be happy to learn, but it seems to me it would involve some 
duplication - I tried to avoid that where possible.

Maybe I should just put a note in the code saying what we're doing, and 
why it's OK.

cheers

andrew

Re: Bytea as C string in pg_convert?

From

Tom Lane

Date:

24 September 2007, 15:17:54

Andrew Dunstan <andrew@dunslane.net> writes:
> I have changed it. The thing is, though, that this function not only 
> performs the convert() function but acts as the engine for convert_to() 
> and convert_from(). Those functions do some silent transformations, in 
> one case passing a text Datum as the first argument and in the other 
> case the returning the result as text. If there's a better way to do 
> this I'll be happy to learn, but it seems to me it would involve some 
> duplication - I tried to avoid that where possible.

Hmm.  One suggestion would be to have an internal function declared
as taking and returning "struct varlena *", with a comment saying that
we depend on text and bytea both being compatible with this.  All three
SQL-visible functions are then thin wrappers around that.
        regards, tom lane

Re: Bytea as C string in pg_convert?

From

Andrew Dunstan

Date:

24 September 2007, 16:01:54


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I have changed it. The thing is, though, that this function not only 
>> performs the convert() function but acts as the engine for convert_to() 
>> and convert_from(). Those functions do some silent transformations, in 
>> one case passing a text Datum as the first argument and in the other 
>> case the returning the result as text. If there's a better way to do 
>> this I'll be happy to learn, but it seems to me it would involve some 
>> duplication - I tried to avoid that where possible.
>>     
>
> Hmm.  One suggestion would be to have an internal function declared
> as taking and returning "struct varlena *", with a comment saying that
> we depend on text and bytea both being compatible with this.  All three
> SQL-visible functions are then thin wrappers around that.
>
>     
>   

Doesn't strike me as much of an advance, to be honest.

My current top priority is fixing the MSVC build .bat files like Magnus 
wants, which will take a bit of time.

cheers

andrew

Re: Bytea as C string in pg_convert?

From

Tom Lane

Date:

24 September 2007, 16:13:07

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Hmm.  One suggestion would be to have an internal function declared
>> as taking and returning "struct varlena *", with a comment saying that
>> we depend on text and bytea both being compatible with this.  All three
>> SQL-visible functions are then thin wrappers around that.

> Doesn't strike me as much of an advance, to be honest.

As you wish, but at least a comment noting that these functions are
relying on binary compatibility of text and bytea would be a good idea.

> My current top priority is fixing the MSVC build .bat files like Magnus 
> wants, which will take a bit of time.

If you are intending to make the buildfarm use the perl scripts, then
this should definitely be high priority --- we don't want the farm
testing some other build process than what ordinary users will use.
        regards, tom lane