Re: plperlu problem with utf8 - Mailing list pgsql-hackers

From David Christensen
Subject Re: plperlu problem with utf8
Date
Msg-id A52064D4-3467-49B3-AE9E-14FF22B7A57D@endpoint.com
Whole thread Raw
In response to Re: plperlu problem with utf8  (Alex Hunsaker <badalex@gmail.com>)
Responses Re: plperlu problem with utf8
List pgsql-hackers
On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote:

> On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david@kineticode.com> wrote:
>> ...
>> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1
beforepassing the text to the Perl function. 
>
> Yeah, I don't think you will find anyone who disagrees :)  PL/TCL and
> PL/Python get this right FWIW.  Anyway find attached a patch that does
> just this.

Cool, thanks for taking this on.

> With the attached we:
> - for function arguments, convert (using pg_do_encoding_conversion) to
> utf8 from the current database encoding.  We also turn on the utf8
> flag so perl knows it was given utf8.  Pre patch things only really
> worked for SQL_ASCII or PG_UTF8 databases.  In practice everything
> worked fine for single byte charsets.  However things like uc() or
> lc() on bytes with high bits set were probably broken.

How does this deal with input records of composite type?

> - for output from perl convert from perls internal format to utf8
> (using SvPVutf8()), and then convert that to the current database
> encoding. This sounds unoptimized, but in the common case SvPVutf8()
> should be a noop.  Pre patch this was "random" (dependent on how perl
> decided to represent the string internally) but it worked 99% of the
> time (at least in the single byte charset or UTF8 cases).
>
> - fix errors so they properly encode their message to the current
> database encoding (pre patch we were doing no conversion at all,
> similar to the output case were it worked most of the time)

This sounds good; I imagine in practice most errors contain just 7-bit ascii which should be acceptable in any
server_encoding,but in the case where something is returned that is unable to be represented in the server_encoding
(thinkingvalues defined/used in the function itself), does it degrade to the current behavior, as opposed to fail or
eatthe error message without outputting? 

> - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
> plperl). Pre patch this only happened on a UTF8 database.  That meant
> multi-byte character regexs were broken on non utf8 databases.

This sounds good in general, but I think should be skipped if GetDatabaseEncoding() == SQL_ASCII.

> -remove some old perl version checks for 5.6 and 5.8.  We require
> 5.8.1 so these were nothing but clutter.

+1.  Can't complain about removing clutter :-).

> Something interesting to note is when we are SQL_ASCII,
> pg_do_encoding_conversion() does nothing, yet we turn on the utf8
> flag.  This means if you pass in valid utf8 perl will treat it as
> such.  It also means on output it will hand utf8 back.  Both PL/Tcl
> and PL/Python do the same thing so I suppose its sensible to match
> their behavior (and it was the lazy thing to do).  The difference
> being with PL/Python if you return said string you get "ERROR:
> PL/Python: could not convert Python Unicode object to PostgreSQL
> server encoding".  While PL/Tcl and now Pl/perl give you back a utf8
> version.  For example:
>
> (SQL_ASCII database)
> =# select length('☺');
> length
> --------
>      3
>
> =# CREATE FUNCTION tcl_len(text) returns text as  $$ return [string
> length $1] $$ language pltcl;
> CREATE FUNCTION
> postgres=# SELECT tcl_len('☺');
> tcl_len
> ------------
> 1
> (1 row)
>
> =# CREATE FUNCTION py_len(str text) returns text as  $$ return
> len(str) $$ language plpython3;
> =# SELECT py_len('☺');
> py_len
> --------
> 1
> (1 row)
>
> I wouldn't really say thats right, but its at least consistent...

I think this could/should be adequately handled by not calling the function when the DatabaseEncoding is SQL_ASCII.
SinceSQL_ASCII basically makes no assumptions about any representation other than arbitrary 8-bit encoding, this
demonstratedbehavior is more-or-less attaching incorrect semantics to values that are being returned, and is completely
bunkoIMHO.  (Not that many people are hopefully running SQL_ASCII at this point, but you never know...)  Also, I'd
arguethat pltcl and plpython should be fixed as well for the same reasons. 

> This does *not* address the bytea issue where currently if you have
> bytea input or output we try to encode that the same as any string.  I
> think thats going to be a bit more invasive and this patch should
> stands on its own.
> <plperl_fix_enc.patch.gz>

Yeah, I'm not sure how invasive that will end up being, or if there are other datatypes which should skip the text
processing.

I noticed you moved the declaration of perl_sys_init_done; was that an independent bug fix, or did something in the
patchrequire that? 

Cheers,

David
--
David Christensen
End Point Corporation
david@endpoint.com






pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: pg_ctl and port number detection
Next
From: Robert Haas
Date:
Subject: Re: Extensions and custom_variable_classes