Thread: invalid UTF-8 via pl/perl
It is possible to get an invalid byte sequence into a text field via pl, in this case pl/perl : ---8<------8<------8<------8<------8<------8<--- CREATE TABLE utf_test ( id serial PRIMARY KEY, data character varying ); CREATE OR REPLACE FUNCTION invalid_utf_seq() RETURNS character varying AS $BODY$ return "\xd0"; $BODY$ LANGUAGE 'plperlu' VOLATILE STRICT; insert into utf_test(data) values(invalid_utf_seq()); ---8<------8<------8<------8<------8<------8<--- This results in a table, which has invalid utf sequence in it and consequently does not pass dump/load What would be the best place to fix this ? Should there be checks in all text types ? (probably too expensive) Or should pl/perl check it's return values for compliance with server_encoding ? Or should postgresql itself check that pl-s return what they promise to return ? -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing wrote: [plperl can return data that is not valid in the server encoding and it is not caught] > This results in a table, which has invalid utf sequence in it and > consequently does not pass dump/load > > What would be the best place to fix this ? > > Should there be checks in all text types ? > (probably too expensive) > The plperl code has no type-specific checks, and in any case limiting it to "text" types would defeat third party and contrib types of which it knows nothing (think citext). We should check all strings returned by plperl. > Or should pl/perl check it's return values for compliance with > server_encoding ? > I think the plperl glue code should check returned strings using pg_verifymbstr(). > Or should postgresql itself check that pl-s return what they promise to > return ? > > There is no central place for it to check. The pl glue code is the right place, I think. cheers andrew
Andrew Dunstan wrote: > > I think the plperl glue code should check returned strings using > pg_verifymbstr(). > > Please test this patch. I think we'd probably want to trap the encoding error and issue a customised error message, but this plugs all the holes I can see with the possible exception of values inserted via SPI calls. I'll check that out. cheers andrew Index: src/pl/plperl/plperl.c =================================================================== RCS file: /cvsroot/pgsql/src/pl/plperl/plperl.c,v retrieving revision 1.157 diff -c -r1.157 plperl.c *** src/pl/plperl/plperl.c 31 Dec 2009 19:41:37 -0000 1.157 --- src/pl/plperl/plperl.c 3 Jan 2010 01:37:33 -0000 *************** *** 630,636 **** errmsg("Perl hash contains nonexistent column \"%s\"", key))); if (SvOK(val)) ! values[attn - 1] = SvPV(val, PL_na); } hv_iterinit(perlhash); --- 630,642 ---- errmsg("Perl hash contains nonexistent column \"%s\"", key))); if (SvOK(val)) ! { ! char * aval; ! ! aval = SvPV(val, PL_na); ! pg_verifymbstr(aval, strlen(aval), false); ! values[attn - 1] = aval; ! } } hv_iterinit(perlhash); *************** *** 829,836 **** atttypmod = tupdesc->attrs[attn - 1]->atttypmod; if (SvOK(val)) { modvalues[slotsused] = InputFunctionCall(&finfo, ! SvPV(val, PL_na), typioparam, atttypmod); modnulls[slotsused] = ' '; --- 835,846 ---- atttypmod = tupdesc->attrs[attn - 1]->atttypmod; if (SvOK(val)) { + char * aval; + + aval = SvPV(val, PL_na); + pg_verifymbstr(aval,strlen(aval), false); modvalues[slotsused] = InputFunctionCall(&finfo, ! aval, typioparam, atttypmod); modnulls[slotsused] = ' '; *************** *** 1468,1474 **** } val = SvPV(perlret, PL_na); ! retval = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); } --- 1478,1484 ---- } val = SvPV(perlret, PL_na); ! pg_verifymbstr(val, strlen(val), false); retval = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); } *************** *** 2125,2131 **** } val = SvPV(sv, PL_na); ! ret = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); isNull = false; --- 2135,2141 ---- } val = SvPV(sv, PL_na); ! pg_verifymbstr(val, strlen(val), false); ret = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); isNull = false;
Andrew Dunstan wrote: > > > Andrew Dunstan wrote: >> >> I think the plperl glue code should check returned strings using >> pg_verifymbstr(). >> >> > > Please test this patch. I think we'd probably want to trap the > encoding error and issue a customised error message, but this plugs > all the holes I can see with the possible exception of values inserted > via SPI calls. I'll check that out. > > I think the attached patch plugs the direct SPI holes as well. One thing that I am pondering is: how does SPI handle things if the client encoding and server encoding are not the same? Won't the strings it passes the parser be interpreted in the client encoding? If so, that doesn't seem right at all, since these strings come from a server side call and not from the client at all. It looks to me like the call to pg_parse_query() in spi.c should possibly be surrounded by code to temporarily set the client encoding to the server encoding and then restore it afterwards. cheers andrew Index: src/pl/plperl/plperl.c =================================================================== RCS file: /cvsroot/pgsql/src/pl/plperl/plperl.c,v retrieving revision 1.157 diff -c -r1.157 plperl.c *** src/pl/plperl/plperl.c 31 Dec 2009 19:41:37 -0000 1.157 --- src/pl/plperl/plperl.c 3 Jan 2010 14:12:25 -0000 *************** *** 630,636 **** errmsg("Perl hash contains nonexistent column \"%s\"", key))); if (SvOK(val)) ! values[attn - 1] = SvPV(val, PL_na); } hv_iterinit(perlhash); --- 630,642 ---- errmsg("Perl hash contains nonexistent column \"%s\"", key))); if (SvOK(val)) ! { ! char * aval; ! ! aval = SvPV(val, PL_na); ! pg_verifymbstr(aval, strlen(aval), false); ! values[attn - 1] = aval; ! } } hv_iterinit(perlhash); *************** *** 829,836 **** atttypmod = tupdesc->attrs[attn - 1]->atttypmod; if (SvOK(val)) { modvalues[slotsused] = InputFunctionCall(&finfo, ! SvPV(val, PL_na), typioparam, atttypmod); modnulls[slotsused] = ' '; --- 835,846 ---- atttypmod = tupdesc->attrs[attn - 1]->atttypmod; if (SvOK(val)) { + char * aval; + + aval = SvPV(val, PL_na); + pg_verifymbstr(aval,strlen(aval), false); modvalues[slotsused] = InputFunctionCall(&finfo, ! aval, typioparam, atttypmod); modnulls[slotsused] = ' '; *************** *** 1468,1474 **** } val = SvPV(perlret, PL_na); ! retval = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); } --- 1478,1484 ---- } val = SvPV(perlret, PL_na); ! pg_verifymbstr(val, strlen(val), false); retval = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); } *************** *** 2125,2131 **** } val = SvPV(sv, PL_na); ! ret = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); isNull = false; --- 2135,2141 ---- } val = SvPV(sv, PL_na); ! pg_verifymbstr(val, strlen(val), false); ret = InputFunctionCall(&prodesc->result_in_func, val, prodesc->result_typioparam, -1); isNull = false; *************** *** 2516,2523 **** { if (SvOK(argv[i])) { argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i], ! SvPV(argv[i], PL_na), qdesc->argtypioparams[i], -1); nulls[i] = ' '; --- 2526,2537 ---- { if (SvOK(argv[i])) { + char *val; + + val = SvPV(argv[i], PL_na); + pg_verifymbstr(val, strlen(val), false); argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i], ! val, qdesc->argtypioparams[i], -1); nulls[i] = ' '; *************** *** 2647,2652 **** --- 2661,2669 ---- { if (SvOK(argv[i])) { + char *val; + + pg_verifymbstr(val, strlen(val), false); argvalues[i] = InputFunctionCall(&qdesc->arginfuncs[i], SvPV(argv[i], PL_na), qdesc->argtypioparams[i],
Andrew Dunstan <andrew@dunslane.net> writes: > One thing that I am pondering is: how does SPI handle things if the > client encoding and server encoding are not the same? What? client_encoding is not used anywhere within the backend. Everything should be server_encoding. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> One thing that I am pondering is: how does SPI handle things if the >> client encoding and server encoding are not the same? >> > > What? client_encoding is not used anywhere within the backend. > Everything should be server_encoding. > > > Oh, for some reason I thought the translation was done in the scanner. Sorry for the noise. cheers andrew
I wrote: > > I think the attached patch plugs the direct SPI holes as well. There are two issues with this patch. First, how far if at all should it be backpatched? All the way, or 8.3, where we tightened the encoding rules, or not at all? Second, It produces errors like this: andrew=# select 'a' || invalid_utf_seq() || 'b'; ERROR: invalid byte sequence for encoding "UTF8": 0xd0 HINT: Thiserror can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". CONTEXT: PL/Perl function "invalid_utf_seq" andrew=# That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on pg_verifymbstr() we would miss the error message that actually identified the bad data, so that doesn't seem like a good plan. cheers andrew
On Jan 3, 2010, at 11:54 AM, Andrew Dunstan wrote: > There are two issues with this patch. First, how far if at all should it be backpatched? All the way, or 8.3, where wetightened the encoding rules, or not at all? 8.3 seems reasonable. > Second, It produces errors like this: > > andrew=# select 'a' || invalid_utf_seq() || 'b'; > ERROR: invalid byte sequence for encoding "UTF8": 0xd0 > HINT: This error can also happen if the byte sequence does not > match the encoding expected by the server, which is controlled by > "client_encoding". > CONTEXT: PL/Perl function "invalid_utf_seq" > andrew=# > > > That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on pg_verifymbstr()we would miss the error message that actually identified the bad data, so that doesn't seem like a good plan. I'm sure I'm just revealing my ignorance here, but how is the hint misleading? Best, David
David E. Wheeler wrote: >> Second, It produces errors like this: >> >> andrew=# select 'a' || invalid_utf_seq() || 'b'; >> ERROR: invalid byte sequence for encoding "UTF8": 0xd0 >> HINT: This error can also happen if the byte sequence does not >> match the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: PL/Perl function "invalid_utf_seq" >> andrew=# >> >> >> That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on pg_verifymbstr()we would miss the error message that actually identified the bad data, so that doesn't seem like a good plan. >> > > I'm sure I'm just revealing my ignorance here, but how is the hint misleading? > > > The string that causes the trouble does not come from the client and has nothing to do with client_encoding. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > andrew=# select 'a' || invalid_utf_seq() || 'b'; > ERROR: invalid byte sequence for encoding "UTF8": 0xd0 > HINT: This error can also happen if the byte sequence does not > match the encoding expected by the server, which is controlled by > "client_encoding". > CONTEXT: PL/Perl function "invalid_utf_seq" > That hint seems rather misleading. I'm not sure what we can do about it > though. If we set the noError param on pg_verifymbstr() we would miss > the error message that actually identified the bad data, so that doesn't > seem like a good plan. Yeah, we want the detailed error info. The problem is that the hint is targeted to the case where we are checking data coming from the client. We could add another parameter to pg_verifymbstr to indicate the context, perhaps. I'm not sure how to do it exactly --- just a bool that suppresses the hint, or do we want to make a provision for some other hint or detail message? regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > There are two issues with this patch. First, how far if at all should it > be backpatched? All the way, or 8.3, where we tightened the encoding > rules, or not at all? Forgot to mention --- I'm not in favor of backpatching. First because tightening encoding verification has been a process over multiple releases; it's not a bug fix in the normal sense of the word, and might break things that people had been doing without trouble. Second because I think we'll have to change pg_verifymbstr's API, and that's not something to back-patch if we can avoid it. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> andrew=# select 'a' || invalid_utf_seq() || 'b'; >> ERROR: invalid byte sequence for encoding "UTF8": 0xd0 >> HINT: This error can also happen if the byte sequence does not >> match the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: PL/Perl function "invalid_utf_seq" >> > > >> That hint seems rather misleading. I'm not sure what we can do about it >> though. If we set the noError param on pg_verifymbstr() we would miss >> the error message that actually identified the bad data, so that doesn't >> seem like a good plan. >> > > Yeah, we want the detailed error info. The problem is that the hint is > targeted to the case where we are checking data coming from the client. > We could add another parameter to pg_verifymbstr to indicate the > context, perhaps. I'm not sure how to do it exactly --- just a bool > that suppresses the hint, or do we want to make a provision for some > other hint or detail message? > > > Or instead of another param we could change the third param to be one of (NO_ERROR, CLIENT_ERROR, SERVER_ERROR) or some such. Or we could just add another verify func. I don't have terribly strong opinions about it. Incidentally, I guess we need to look at plpython and pltcl for similar issues. cheers andrew
On sön, 2010-01-03 at 18:40 -0500, Andrew Dunstan wrote: > Incidentally, I guess we need to look at plpython and pltcl for > similar issues. I confirm that the same issue exists in plpython.
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> andrew=# select 'a' || invalid_utf_seq() || 'b'; >> ERROR: invalid byte sequence for encoding "UTF8": 0xd0 >> HINT: This error can also happen if the byte sequence does not >> match the encoding expected by the server, which is controlled by >> "client_encoding". >> CONTEXT: PL/Perl function "invalid_utf_seq" >> > > >> That hint seems rather misleading. I'm not sure what we can do about it >> though. If we set the noError param on pg_verifymbstr() we would miss >> the error message that actually identified the bad data, so that doesn't >> seem like a good plan. >> > > Yeah, we want the detailed error info. The problem is that the hint is > targeted to the case where we are checking data coming from the client. > We could add another parameter to pg_verifymbstr to indicate the > context, perhaps. I'm not sure how to do it exactly --- just a bool > that suppresses the hint, or do we want to make a provision for some > other hint or detail message? > > > This is a mess. It affects four or five levels of visible functions that are called in about 18 files. How about we just change the hint so it also refers to the possibility that the data comes from a PL? That would save lots of trouble. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > How about we just change the hint so it also refers to the possibility > that the data comes from a PL? That would save lots of trouble. Maybe just lose the hint altogether. It's not adding that much, and I seem to recall that there have already been complaints about other cases where it's misleading. regards, tom lane
On Sat, 2010-01-02 at 20:51 -0500, Andrew Dunstan wrote: > > Andrew Dunstan wrote: > > > > I think the plperl glue code should check returned strings using > > pg_verifymbstr(). > > > > > > Please test this patch. I think we'd probably want to trap the encoding > error and issue a customised error message, but this plugs all the holes > I can see with the possible exception of values inserted via SPI calls. > I'll check that out. I got a report, that the patch fixes one case but leaves open another: CREATE TABLE utf_test ( id serial PRIMARY KEY, data character varying ); CREATE OR REPLACE FUNCTION utf_test() RETURNS character varying AS $BODY$ return "\xd0"; $BODY$ LANGUAGE 'plperlu' VOLATILE STRICT; CREATE OR REPLACE FUNCTION utf_test2() RETURNS character varying AS $BODY$ spi_exec_query("insert into utf_test (data) values('\xd0');"); return "VIGA"; $BODY$ LANGUAGE 'plperlu' VOLATILE STRICT; The report siad, that patch fixes case insert into utf_test (data) values(utf_test()); so that it return an error, but the second function select utf_test2(); still enters wrong data to the table So SPI interface should also be fixed, either from perl side, or maybe from inside SPI ? -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing <hannu@2ndquadrant.com> writes: > So SPI interface should also be fixed, either from perl side, or maybe > from inside SPI ? SPI has every right to assume that data it's given is already in the database encoding. regards, tom lane
Tom Lane wrote: > Hannu Krosing <hannu@2ndquadrant.com> writes: > >> So SPI interface should also be fixed, either from perl side, or maybe >> from inside SPI ? >> > > SPI has every right to assume that data it's given is already in the > database encoding. > > > Yeah, looks like we missed a few spots. I have added three more checks that I think plug the remaining holes in plperl. Hannu, please test again against CVS HEAD. cheers andrew
On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote: > > Tom Lane wrote: > > Hannu Krosing <hannu@2ndquadrant.com> writes: > > > >> So SPI interface should also be fixed, either from perl side, or maybe > >> from inside SPI ? > >> > > > > SPI has every right to assume that data it's given is already in the > > database encoding. > > > > > > > > Yeah, looks like we missed a few spots. I have added three more checks > that I think plug the remaining holes in plperl. > > Hannu, please test again against CVS HEAD. Seems to work now Do you plan to back-port this ? -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing wrote: > On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote: > >> Tom Lane wrote: >> >>> Hannu Krosing <hannu@2ndquadrant.com> writes: >>> >>> >>>> So SPI interface should also be fixed, either from perl side, or maybe >>>> from inside SPI ? >>>> >>>> >>> SPI has every right to assume that data it's given is already in the >>> database encoding. >>> >>> >>> >>> >> Yeah, looks like we missed a few spots. I have added three more checks >> that I think plug the remaining holes in plperl. >> >> Hannu, please test again against CVS HEAD. >> > > Seems to work now > > Do you plan to back-port this ? > I wasn't going to. The previous fixes weren't backpatched either, and in general when we have plugged encoding holes the changes have not been backpatched, on the grounds that it would be a behaviour change, e.g. when we tightened things a lot for 8.3. I think there an outstanding TODO to plug the other PLs, however. It's a pity it has to be done over and over for each PL. Maybe we need some new versions of some of the SPI calls that would do the checking so it could be centralized. cheers andrew