Thread: Re: [COMMITTERS] pgsql: Prevent the injection of invalidly encoded strings by PL/Python

On fre, 2010-03-19 at 11:50 -0400, Andrew Dunstan wrote:
> Peter Eisentraut wrote:
> > Log Message:
> > -----------
> > Prevent the injection of invalidly encoded strings by PL/Python into PostgreSQL
> > with a few strategically placed pg_verifymbstr calls.

> Awesome. Do we need to fix pltcl too?

Short answer: yes

I have never used Tcl before just now, and the documentation is sketchy,
but it looks like the behavior of Tcl is kind of mixed in this area.

Escapes such as "\xd0" are apparently converted to Unicode code points
rather than bytes when the appropriate OS locale is set.  So that is
safe.  Except that it doesn't work in some locale/charset setups, such
as EUC_JP.  To adapt Hannu's original example:

CREATE TABLE utf_test
( id serial PRIMARY KEY, data character varying
);

CREATE OR REPLACE FUNCTION invalid_utf_seq() RETURNS character varying AS
$BODY$
return "\xd0";
$BODY$
LANGUAGE 'pltclu' VOLATILE STRICT;

insert into utf_test(data) values(invalid_utf_seq());

-- This works in UTF8 and LATIN1 with the right locales, but ...

select invalid_utf_seq();
ERROR:  22021: invalid byte sequence for encoding "EUC_JP": 0xc390



Peter Eisentraut <peter_e@gmx.net> writes:
> I have never used Tcl before just now, and the documentation is sketchy,
> but it looks like the behavior of Tcl is kind of mixed in this area.

> Escapes such as "\xd0" are apparently converted to Unicode code points
> rather than bytes when the appropriate OS locale is set.  So that is
> safe.  Except that it doesn't work in some locale/charset setups, such
> as EUC_JP.  To adapt Hannu's original example:

The pltcl code special-cases Unicode IIRC.
        regards, tom lane


On mån, 2010-03-22 at 19:29 -0400, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > I have never used Tcl before just now, and the documentation is sketchy,
> > but it looks like the behavior of Tcl is kind of mixed in this area.
> 
> > Escapes such as "\xd0" are apparently converted to Unicode code points
> > rather than bytes when the appropriate OS locale is set.  So that is
> > safe.  Except that it doesn't work in some locale/charset setups, such
> > as EUC_JP.  To adapt Hannu's original example:
> 
> The pltcl code special-cases Unicode IIRC.

You can observe the equivalent behavior in tclsh, so this isn't pltcl at
work here.

One might argue that the leak is really somewhere in Tcl, since it
allows this kind of thing while claiming to use Unicode.  But that
doesn't really help us ...