Thread: invalid UTF-8 via pl/perl

invalid UTF-8 via pl/perl

From
Hannu Krosing
Date:
It is possible to get an invalid byte sequence into a text field via pl,
in this case pl/perl :

---8<------8<------8<------8<------8<------8<---
CREATE TABLE utf_test
( id serial PRIMARY KEY, data character varying
);

CREATE OR REPLACE FUNCTION invalid_utf_seq() RETURNS character varying AS
$BODY$
return "\xd0";
$BODY$
LANGUAGE 'plperlu' VOLATILE STRICT;

insert into utf_test(data) values(invalid_utf_seq());
---8<------8<------8<------8<------8<------8<---

This results in a table, which has invalid utf sequence in it and
consequently does not pass dump/load

What would be the best place to fix this ?

Should there be checks in all text types ? 
(probably too expensive)

Or should pl/perl check it's return values for compliance with
server_encoding ?

Or should postgresql itself check that pl-s return what they promise to
return ?


-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Hannu Krosing wrote:

[plperl can return data that is not valid in the server encoding and it 
is not caught]

> This results in a table, which has invalid utf sequence in it and
> consequently does not pass dump/load
>
> What would be the best place to fix this ?
>
> Should there be checks in all text types ? 
> (probably too expensive)
>   

The plperl code has no type-specific checks, and in any case limiting it 
to "text" types would defeat third party and contrib types of which it 
knows nothing (think citext). We should check all strings returned by 
plperl.
> Or should pl/perl check it's return values for compliance with
> server_encoding ?
>   

I think the plperl glue code should check returned strings using 
pg_verifymbstr().

> Or should postgresql itself check that pl-s return what they promise to
> return ?
>
>   


There is no central place for it to check. The pl glue code is the right 
place, I think.

cheers

andrew




Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:
>
> I think the plperl glue code should check returned strings using
> pg_verifymbstr().
>
>

Please test this patch. I think we'd probably want to trap the encoding
error and issue a customised error message, but this plugs all the holes
I can see with the possible exception of values inserted via SPI calls.
I'll check that out.

cheers

andrew
Index: src/pl/plperl/plperl.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plperl/plperl.c,v
retrieving revision 1.157
diff -c -r1.157 plperl.c
*** src/pl/plperl/plperl.c    31 Dec 2009 19:41:37 -0000    1.157
--- src/pl/plperl/plperl.c    3 Jan 2010 01:37:33 -0000
***************
*** 630,636 ****
                       errmsg("Perl hash contains nonexistent column \"%s\"",
                              key)));
          if (SvOK(val))
!             values[attn - 1] = SvPV(val, PL_na);
      }
      hv_iterinit(perlhash);

--- 630,642 ----
                       errmsg("Perl hash contains nonexistent column \"%s\"",
                              key)));
          if (SvOK(val))
!         {
!             char * aval;
!
!             aval = SvPV(val, PL_na);
!             pg_verifymbstr(aval, strlen(aval), false);
!             values[attn - 1] = aval;
!         }
      }
      hv_iterinit(perlhash);

***************
*** 829,836 ****
          atttypmod = tupdesc->attrs[attn - 1]->atttypmod;
          if (SvOK(val))
          {
              modvalues[slotsused] = InputFunctionCall(&finfo,
!                                                      SvPV(val, PL_na),
                                                       typioparam,
                                                       atttypmod);
              modnulls[slotsused] = ' ';
--- 835,846 ----
          atttypmod = tupdesc->attrs[attn - 1]->atttypmod;
          if (SvOK(val))
          {
+             char * aval;
+
+             aval = SvPV(val, PL_na);
+             pg_verifymbstr(aval,strlen(aval), false);
              modvalues[slotsused] = InputFunctionCall(&finfo,
!                                                      aval,
                                                       typioparam,
                                                       atttypmod);
              modnulls[slotsused] = ' ';
***************
*** 1468,1474 ****
          }

          val = SvPV(perlret, PL_na);
!
          retval = InputFunctionCall(&prodesc->result_in_func, val,
                                     prodesc->result_typioparam, -1);
      }
--- 1478,1484 ----
          }

          val = SvPV(perlret, PL_na);
!         pg_verifymbstr(val, strlen(val), false);
          retval = InputFunctionCall(&prodesc->result_in_func, val,
                                     prodesc->result_typioparam, -1);
      }
***************
*** 2125,2131 ****
              }

              val = SvPV(sv, PL_na);
!
              ret = InputFunctionCall(&prodesc->result_in_func, val,
                                      prodesc->result_typioparam, -1);
              isNull = false;
--- 2135,2141 ----
              }

              val = SvPV(sv, PL_na);
!             pg_verifymbstr(val, strlen(val), false);
              ret = InputFunctionCall(&prodesc->result_in_func, val,
                                      prodesc->result_typioparam, -1);
              isNull = false;

Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:
>
>
> Andrew Dunstan wrote:
>>
>> I think the plperl glue code should check returned strings using
>> pg_verifymbstr().
>>
>>
>
> Please test this patch. I think we'd probably want to trap the
> encoding error and issue a customised error message, but this plugs
> all the holes I can see with the possible exception of values inserted
> via SPI calls. I'll check that out.
>
>

I think the attached patch plugs the direct SPI holes as well.

One thing that I am pondering is: how does SPI handle things if the
client encoding and server encoding are not the same? Won't the strings
it passes the parser be interpreted in the client encoding? If so, that
doesn't seem right at all, since these strings come from a server side
call and not from the client at all. It looks to me like the call to
pg_parse_query() in spi.c should possibly be surrounded by code to
temporarily set the client encoding to the server encoding and then
restore it afterwards.

cheers

andrew


Index: src/pl/plperl/plperl.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plperl/plperl.c,v
retrieving revision 1.157
diff -c -r1.157 plperl.c
*** src/pl/plperl/plperl.c    31 Dec 2009 19:41:37 -0000    1.157
--- src/pl/plperl/plperl.c    3 Jan 2010 14:12:25 -0000
***************
*** 630,636 ****
                       errmsg("Perl hash contains nonexistent column \"%s\"",
                              key)));
          if (SvOK(val))
!             values[attn - 1] = SvPV(val, PL_na);
      }
      hv_iterinit(perlhash);

--- 630,642 ----
                       errmsg("Perl hash contains nonexistent column \"%s\"",
                              key)));
          if (SvOK(val))
!         {
!             char * aval;
!
!             aval = SvPV(val, PL_na);
!             pg_verifymbstr(aval, strlen(aval), false);
!             values[attn - 1] = aval;
!         }
      }
      hv_iterinit(perlhash);

***************
*** 829,836 ****
          atttypmod = tupdesc->attrs[attn - 1]->atttypmod;
          if (SvOK(val))
          {
              modvalues[slotsused] = InputFunctionCall(&finfo,
!                                                      SvPV(val, PL_na),
                                                       typioparam,
                                                       atttypmod);
              modnulls[slotsused] = ' ';
--- 835,846 ----
          atttypmod = tupdesc->attrs[attn - 1]->atttypmod;
          if (SvOK(val))
          {
+             char * aval;
+
+             aval = SvPV(val, PL_na);
+             pg_verifymbstr(aval,strlen(aval), false);
              modvalues[slotsused] = InputFunctionCall(&finfo,
!                                                      aval,
                                                       typioparam,
                                                       atttypmod);
              modnulls[slotsused] = ' ';
***************
*** 1468,1474 ****
          }

          val = SvPV(perlret, PL_na);
!
          retval = InputFunctionCall(&prodesc->result_in_func, val,
                                     prodesc->result_typioparam, -1);
      }
--- 1478,1484 ----
          }

          val = SvPV(perlret, PL_na);
!         pg_verifymbstr(val, strlen(val), false);
          retval = InputFunctionCall(&prodesc->result_in_func, val,
                                     prodesc->result_typioparam, -1);
      }
***************
*** 2125,2131 ****
              }

              val = SvPV(sv, PL_na);
!
              ret = InputFunctionCall(&prodesc->result_in_func, val,
                                      prodesc->result_typioparam, -1);
              isNull = false;
--- 2135,2141 ----
              }

              val = SvPV(sv, PL_na);
!             pg_verifymbstr(val, strlen(val), false);
              ret = InputFunctionCall(&prodesc->result_in_func, val,
                                      prodesc->result_typioparam, -1);
              isNull = false;
***************
*** 2516,2523 ****
          {
              if (SvOK(argv[i]))
              {
                  argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i],
!                                                  SvPV(argv[i], PL_na),
                                                   qdesc->argtypioparams[i],
                                                   -1);
                  nulls[i] = ' ';
--- 2526,2537 ----
          {
              if (SvOK(argv[i]))
              {
+                 char *val;
+
+                 val = SvPV(argv[i], PL_na);
+                 pg_verifymbstr(val, strlen(val), false);
                  argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i],
!                                                  val,
                                                   qdesc->argtypioparams[i],
                                                   -1);
                  nulls[i] = ' ';
***************
*** 2647,2652 ****
--- 2661,2669 ----
          {
              if (SvOK(argv[i]))
              {
+                 char *val;
+
+                 pg_verifymbstr(val, strlen(val), false);
                  argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i],
                                                   SvPV(argv[i], PL_na),
                                                   qdesc->argtypioparams[i],

Re: invalid UTF-8 via pl/perl

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> One thing that I am pondering is: how does SPI handle things if the 
> client encoding and server encoding are not the same?

What?  client_encoding is not used anywhere within the backend.
Everything should be server_encoding.
        regards, tom lane


Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> One thing that I am pondering is: how does SPI handle things if the 
>> client encoding and server encoding are not the same?
>>     
>
> What?  client_encoding is not used anywhere within the backend.
> Everything should be server_encoding.
>
>             
>   

Oh, for some reason I thought the translation was done in the scanner. 
Sorry for the noise.

cheers

andrew


Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

I wrote:
>
> I think the attached patch plugs the direct SPI holes as well.

There are two issues with this patch. First, how far if at all should it 
be backpatched? All the way, or 8.3, where we tightened the encoding 
rules, or not at all?

Second, It produces errors like this:
   andrew=# select 'a' || invalid_utf_seq() || 'b';   ERROR:  invalid byte sequence for encoding "UTF8": 0xd0   HINT:
Thiserror can also happen if the byte sequence does not   match the encoding expected by the server, which is
controlledby   "client_encoding".   CONTEXT:  PL/Perl function "invalid_utf_seq"   andrew=#
 


That hint seems rather misleading. I'm not sure what we can do about it 
though. If we set the noError param on pg_verifymbstr() we would miss 
the error message that actually identified the bad data, so that doesn't 
seem like a good plan.

cheers

andrew



Re: invalid UTF-8 via pl/perl

From
"David E. Wheeler"
Date:
On Jan 3, 2010, at 11:54 AM, Andrew Dunstan wrote:

> There are two issues with this patch. First, how far if at all should it be backpatched? All the way, or 8.3, where
wetightened the encoding rules, or not at all? 

8.3 seems reasonable.

> Second, It produces errors like this:
>
>   andrew=# select 'a' || invalid_utf_seq() || 'b';
>   ERROR:  invalid byte sequence for encoding "UTF8": 0xd0
>   HINT:  This error can also happen if the byte sequence does not
>   match the encoding expected by the server, which is controlled by
>   "client_encoding".
>   CONTEXT:  PL/Perl function "invalid_utf_seq"
>   andrew=#
>
>
> That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on
pg_verifymbstr()we would miss the error message that actually identified the bad data, so that doesn't seem like a good
plan.

I'm sure I'm just revealing my ignorance here, but how is the hint misleading?

Best,

David



Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

David E. Wheeler wrote:
>> Second, It produces errors like this:
>>
>>   andrew=# select 'a' || invalid_utf_seq() || 'b';
>>   ERROR:  invalid byte sequence for encoding "UTF8": 0xd0
>>   HINT:  This error can also happen if the byte sequence does not
>>   match the encoding expected by the server, which is controlled by
>>   "client_encoding".
>>   CONTEXT:  PL/Perl function "invalid_utf_seq"
>>   andrew=#
>>
>>
>> That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on
pg_verifymbstr()we would miss the error message that actually identified the bad data, so that doesn't seem like a good
plan.
>>     
>
> I'm sure I'm just revealing my ignorance here, but how is the hint misleading?
>
>
>   

The string that causes the trouble does not come from the client and has 
nothing to do with client_encoding.

cheers

andrew



Re: invalid UTF-8 via pl/perl

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>     andrew=# select 'a' || invalid_utf_seq() || 'b';
>     ERROR:  invalid byte sequence for encoding "UTF8": 0xd0
>     HINT:  This error can also happen if the byte sequence does not
>     match the encoding expected by the server, which is controlled by
>     "client_encoding".
>     CONTEXT:  PL/Perl function "invalid_utf_seq"

> That hint seems rather misleading. I'm not sure what we can do about it 
> though. If we set the noError param on pg_verifymbstr() we would miss 
> the error message that actually identified the bad data, so that doesn't 
> seem like a good plan.

Yeah, we want the detailed error info.  The problem is that the hint is
targeted to the case where we are checking data coming from the client.
We could add another parameter to pg_verifymbstr to indicate the
context, perhaps.  I'm not sure how to do it exactly --- just a bool
that suppresses the hint, or do we want to make a provision for some
other hint or detail message?
        regards, tom lane


Re: invalid UTF-8 via pl/perl

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> There are two issues with this patch. First, how far if at all should it 
> be backpatched? All the way, or 8.3, where we tightened the encoding 
> rules, or not at all?

Forgot to mention --- I'm not in favor of backpatching.  First because
tightening encoding verification has been a process over multiple
releases; it's not a bug fix in the normal sense of the word, and might
break things that people had been doing without trouble.  Second because
I think we'll have to change pg_verifymbstr's API, and that's not
something to back-patch if we can avoid it.
        regards, tom lane


Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>>     andrew=# select 'a' || invalid_utf_seq() || 'b';
>>     ERROR:  invalid byte sequence for encoding "UTF8": 0xd0
>>     HINT:  This error can also happen if the byte sequence does not
>>     match the encoding expected by the server, which is controlled by
>>     "client_encoding".
>>     CONTEXT:  PL/Perl function "invalid_utf_seq"
>>     
>
>   
>> That hint seems rather misleading. I'm not sure what we can do about it 
>> though. If we set the noError param on pg_verifymbstr() we would miss 
>> the error message that actually identified the bad data, so that doesn't 
>> seem like a good plan.
>>     
>
> Yeah, we want the detailed error info.  The problem is that the hint is
> targeted to the case where we are checking data coming from the client.
> We could add another parameter to pg_verifymbstr to indicate the
> context, perhaps.  I'm not sure how to do it exactly --- just a bool
> that suppresses the hint, or do we want to make a provision for some
> other hint or detail message?
>
>             
>   

Or instead of another param we could change the third param to be one of 
(NO_ERROR, CLIENT_ERROR, SERVER_ERROR) or some such.

Or we could just add another verify func. I don't have terribly strong 
opinions about it.

Incidentally, I guess we need to look at plpython and pltcl for similar 
issues.

cheers

andrew






Re: invalid UTF-8 via pl/perl

From
Peter Eisentraut
Date:
On sön, 2010-01-03 at 18:40 -0500, Andrew Dunstan wrote:
> Incidentally, I guess we need to look at plpython and pltcl for
> similar issues.

I confirm that the same issue exists in plpython.




Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>>     andrew=# select 'a' || invalid_utf_seq() || 'b';
>>     ERROR:  invalid byte sequence for encoding "UTF8": 0xd0
>>     HINT:  This error can also happen if the byte sequence does not
>>     match the encoding expected by the server, which is controlled by
>>     "client_encoding".
>>     CONTEXT:  PL/Perl function "invalid_utf_seq"
>>     
>
>   
>> That hint seems rather misleading. I'm not sure what we can do about it 
>> though. If we set the noError param on pg_verifymbstr() we would miss 
>> the error message that actually identified the bad data, so that doesn't 
>> seem like a good plan.
>>     
>
> Yeah, we want the detailed error info.  The problem is that the hint is
> targeted to the case where we are checking data coming from the client.
> We could add another parameter to pg_verifymbstr to indicate the
> context, perhaps.  I'm not sure how to do it exactly --- just a bool
> that suppresses the hint, or do we want to make a provision for some
> other hint or detail message?
>
>             
>   

This is a mess. It affects four or five levels of visible functions that 
are called in about 18 files.

How about we just change the hint so it also refers to the possibility 
that the data comes from a PL? That would save lots of trouble.

cheers

andrew


Re: invalid UTF-8 via pl/perl

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> How about we just change the hint so it also refers to the possibility 
> that the data comes from a PL? That would save lots of trouble.

Maybe just lose the hint altogether.  It's not adding that much,
and I seem to recall that there have already been complaints about
other cases where it's misleading.
        regards, tom lane


Re: invalid UTF-8 via pl/perl

From
Hannu Krosing
Date:
On Sat, 2010-01-02 at 20:51 -0500, Andrew Dunstan wrote:
> 
> Andrew Dunstan wrote:
> >  
> > I think the plperl glue code should check returned strings using 
> > pg_verifymbstr().
> >
> >
> 
> Please test this patch. I think we'd probably want to trap the encoding 
> error and issue a customised error message, but this plugs all the holes 
> I can see with the possible exception of values inserted via SPI calls. 
> I'll check that out.

I got a report, that the patch fixes one case but leaves open another:

CREATE TABLE utf_test
( id serial PRIMARY KEY, data character varying
);

CREATE OR REPLACE FUNCTION utf_test() RETURNS character varying AS
$BODY$
return "\xd0";
$BODY$ LANGUAGE 'plperlu' VOLATILE STRICT;

CREATE OR REPLACE FUNCTION utf_test2() RETURNS character varying AS
$BODY$
spi_exec_query("insert into utf_test (data) values('\xd0');");
return "VIGA";
$BODY$ LANGUAGE 'plperlu' VOLATILE STRICT;


The report siad, that patch fixes case
       insert into utf_test (data) values(utf_test());

so that it return an error, but the second function
       select utf_test2();

still enters wrong data to the table

So SPI interface should also be fixed, either from perl side, or maybe
from inside SPI ?




-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: invalid UTF-8 via pl/perl

From
Tom Lane
Date:
Hannu Krosing <hannu@2ndquadrant.com> writes:
> So SPI interface should also be fixed, either from perl side, or maybe
> from inside SPI ?

SPI has every right to assume that data it's given is already in the
database encoding.
        regards, tom lane


Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Hannu Krosing <hannu@2ndquadrant.com> writes:
>   
>> So SPI interface should also be fixed, either from perl side, or maybe
>> from inside SPI ?
>>     
>
> SPI has every right to assume that data it's given is already in the
> database encoding.
>
>             
>   

Yeah, looks like we missed a few spots. I have added three more checks 
that I think plug the remaining holes in plperl.

Hannu, please test again against CVS HEAD.

cheers

andrew


Re: invalid UTF-8 via pl/perl

From
Hannu Krosing
Date:
On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote:
> 
> Tom Lane wrote:
> > Hannu Krosing <hannu@2ndquadrant.com> writes:
> >   
> >> So SPI interface should also be fixed, either from perl side, or maybe
> >> from inside SPI ?
> >>     
> >
> > SPI has every right to assume that data it's given is already in the
> > database encoding.
> >
> >             
> >   
> 
> Yeah, looks like we missed a few spots. I have added three more checks 
> that I think plug the remaining holes in plperl.
> 
> Hannu, please test again against CVS HEAD.

Seems to work now

Do you plan to back-port this ?

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: invalid UTF-8 via pl/perl

From
Andrew Dunstan
Date:

Hannu Krosing wrote:
> On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote:
>   
>> Tom Lane wrote:
>>     
>>> Hannu Krosing <hannu@2ndquadrant.com> writes:
>>>   
>>>       
>>>> So SPI interface should also be fixed, either from perl side, or maybe
>>>> from inside SPI ?
>>>>     
>>>>         
>>> SPI has every right to assume that data it's given is already in the
>>> database encoding.
>>>
>>>             
>>>   
>>>       
>> Yeah, looks like we missed a few spots. I have added three more checks 
>> that I think plug the remaining holes in plperl.
>>
>> Hannu, please test again against CVS HEAD.
>>     
>
> Seems to work now
>
> Do you plan to back-port this ?
>   

I wasn't going to. The previous fixes weren't backpatched either, and in 
general when we have plugged encoding holes the changes have not been 
backpatched, on the grounds that it would be a behaviour change, e.g. 
when we tightened things a lot for 8.3.

I think there an outstanding TODO to plug the other PLs, however. It's a 
pity it has to be done over and over for each PL. Maybe we need some new 
versions of some of the SPI calls that would do the checking so it could 
be centralized.

cheers

andrew