plperlu problem with utf8 [REVIEW] - Mailing list pgsql-hackers

From Andy Colson
Subject plperlu problem with utf8 [REVIEW]
Date
Msg-id 4D320FA6.3000005@squeakycode.net
Whole thread Raw
In response to Re: plperlu problem with utf8  (Alex Hunsaker <badalex@gmail.com>)
Responses Re: plperlu problem with utf8 [REVIEW]  (Alex Hunsaker <badalex@gmail.com>)
List pgsql-hackers
This is a review of  "plperl encoding issues"

https://commitfest.postgresql.org/action/patch_view?id=452

Purpose:
========
Your database uses one encoding, and passes data to perl in the same encoding, which perl is not prepared for (it
assumesUTF-8).  This patch makes sure data is encoded into UTF-8 before its passed to plperl then converts the response
fromUTF-8 back to the database encoding for storage.
 

My test:

ptest2=# create database ptest2 encoding 'EUC_JP' template template0;

I created a simple perl function that reverses the string.  I don't know Japanese so I found a tattoo website that had
sayingsin Japanese... I picked: "I am awesome".
 
 
create or replace function preverse(x text) returns text as $$my $tmp = reverse($_[0]);return $tmp;
$$ LANGUAGE plperl;


Before the patch:

ptest2=#select preverse('私はよだれを垂らす');
      preverse
-------------------- 垢蕕眇鬚譴世茲呂篁
(1 row)

It is also possible to generate invalid characters.  This function pulls off the last character in the string...
assumingits UTF-8
 

create or replace function plastchar(x text) returns text as $$my $tmp = substr($_[0], -1);return $tmp;
$$ LANGUAGE plperl;

ptest2=# select plastchar('私はよだれを垂らす');

ERROR:  invalid byte sequence for encoding "EUC_JP": 0xb9
CONTEXT:  PL/Perl function "plastchar"

Because the string was not UTF-8, perl got confused and returned an invalid character.

After the patch:
The exact same plperl functions work fine:

ptest2=# select preverse('私はよだれを垂らす');
      preverse
-------------------- すら垂をれだよは私
(1 row)

ptest2=# select plastchar('私はよだれを垂らす');
 plastchar
----------- す
(1 row)




Performance:
============
This is a bug fix, not for performance, however, as noted by the author, many encodings will be very UTF-8'ish and the
overheadwill be very small.  For those encodings that would need converted, you'd need to do the same convert  inside
yourperl function anyway before you could use the data.  The processing has just moved from inside your perl func to
insidePG.
 




The Patch:
==========
Applies clean to git head as of January 15 2011.  PG built with --enable-cassert and --enable-debug seems to run fine
withno errors.
 

I don't think regression tests cover plperl, so understandable there are no tests in the patch.

There is no manual updates in the patch either, and I think there should be.  I think it should be made clear
that data (varchar, text, etc.  but not bytea) will be passed to perl as UTF-8, regardless of database encoding.  Also
that"use utf8;" is always loaded and in use.
 



Code Review:
============
I am not qualified.  Looking through the patch, I'm reminded of the old saying: "Any sufficently advanced perl XS code
isindistinguishable from magic"  :-)
 


Other Remarks:
==============
- Yes I know... it was a joke.
- I sure hope this posts to the news group ok
- My terminal (konsole) had a hard time displaying Japanese, so I used psql's \i and \o to read/write files that kwrite
show'd/encodedcorrectly via EUC_JP
 


Summary:
========
Looks good.  Looks needed.  Needs manual updates.



pgsql-hackers by date:

Previous
From: Marko Tiikkaja
Date:
Subject: Re: Transaction-scope advisory locks
Next
From: Marti Raudsepp
Date:
Subject: Re: [PATCH] Return command tag 'REPLACE X' for CREATE OR REPLACE statements.