Re: plperlu problem with utf8 - Mailing list pgsql-hackers
From | David Christensen |
---|---|
Subject | Re: plperlu problem with utf8 |
Date | |
Msg-id | A52064D4-3467-49B3-AE9E-14FF22B7A57D@endpoint.com Whole thread Raw |
In response to | Re: plperlu problem with utf8 (Alex Hunsaker <badalex@gmail.com>) |
Responses |
Re: plperlu problem with utf8
|
List | pgsql-hackers |
On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote: > On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david@kineticode.com> wrote: >> ... >> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 beforepassing the text to the Perl function. > > Yeah, I don't think you will find anyone who disagrees :) PL/TCL and > PL/Python get this right FWIW. Anyway find attached a patch that does > just this. Cool, thanks for taking this on. > With the attached we: > - for function arguments, convert (using pg_do_encoding_conversion) to > utf8 from the current database encoding. We also turn on the utf8 > flag so perl knows it was given utf8. Pre patch things only really > worked for SQL_ASCII or PG_UTF8 databases. In practice everything > worked fine for single byte charsets. However things like uc() or > lc() on bytes with high bits set were probably broken. How does this deal with input records of composite type? > - for output from perl convert from perls internal format to utf8 > (using SvPVutf8()), and then convert that to the current database > encoding. This sounds unoptimized, but in the common case SvPVutf8() > should be a noop. Pre patch this was "random" (dependent on how perl > decided to represent the string internally) but it worked 99% of the > time (at least in the single byte charset or UTF8 cases). > > - fix errors so they properly encode their message to the current > database encoding (pre patch we were doing no conversion at all, > similar to the output case were it worked most of the time) This sounds good; I imagine in practice most errors contain just 7-bit ascii which should be acceptable in any server_encoding,but in the case where something is returned that is unable to be represented in the server_encoding (thinkingvalues defined/used in the function itself), does it degrade to the current behavior, as opposed to fail or eatthe error message without outputting? > - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in > plperl). Pre patch this only happened on a UTF8 database. That meant > multi-byte character regexs were broken on non utf8 databases. This sounds good in general, but I think should be skipped if GetDatabaseEncoding() == SQL_ASCII. > -remove some old perl version checks for 5.6 and 5.8. We require > 5.8.1 so these were nothing but clutter. +1. Can't complain about removing clutter :-). > Something interesting to note is when we are SQL_ASCII, > pg_do_encoding_conversion() does nothing, yet we turn on the utf8 > flag. This means if you pass in valid utf8 perl will treat it as > such. It also means on output it will hand utf8 back. Both PL/Tcl > and PL/Python do the same thing so I suppose its sensible to match > their behavior (and it was the lazy thing to do). The difference > being with PL/Python if you return said string you get "ERROR: > PL/Python: could not convert Python Unicode object to PostgreSQL > server encoding". While PL/Tcl and now Pl/perl give you back a utf8 > version. For example: > > (SQL_ASCII database) > =# select length('☺'); > length > -------- > 3 > > =# CREATE FUNCTION tcl_len(text) returns text as $$ return [string > length $1] $$ language pltcl; > CREATE FUNCTION > postgres=# SELECT tcl_len('☺'); > tcl_len > ------------ > 1 > (1 row) > > =# CREATE FUNCTION py_len(str text) returns text as $$ return > len(str) $$ language plpython3; > =# SELECT py_len('☺'); > py_len > -------- > 1 > (1 row) > > I wouldn't really say thats right, but its at least consistent... I think this could/should be adequately handled by not calling the function when the DatabaseEncoding is SQL_ASCII. SinceSQL_ASCII basically makes no assumptions about any representation other than arbitrary 8-bit encoding, this demonstratedbehavior is more-or-less attaching incorrect semantics to values that are being returned, and is completely bunkoIMHO. (Not that many people are hopefully running SQL_ASCII at this point, but you never know...) Also, I'd arguethat pltcl and plpython should be fixed as well for the same reasons. > This does *not* address the bytea issue where currently if you have > bytea input or output we try to encode that the same as any string. I > think thats going to be a bit more invasive and this patch should > stands on its own. > <plperl_fix_enc.patch.gz> Yeah, I'm not sure how invasive that will end up being, or if there are other datatypes which should skip the text processing. I noticed you moved the declaration of perl_sys_init_done; was that an independent bug fix, or did something in the patchrequire that? Cheers, David -- David Christensen End Point Corporation david@endpoint.com
pgsql-hackers by date: