pl/perl and utf-8 in sql_ascii databases - Mailing list pgsql-hackers

From Christoph Berg
Subject pl/perl and utf-8 in sql_ascii databases
Date
Msg-id 20120209102116.GA14429@msgid.df7cb.de
Whole thread Raw
Responses Re: pl/perl and utf-8 in sql_ascii databases
List pgsql-hackers
Hi,

we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup that you might see in
email headers from the internet). For this reason, the database uses
sql_ascii encoding. The columns are text, as most characters are
ascii, so bytea didn't seem the right way to go.

Currently we are on 8.3 and try to upgrade to 9.1, but the plperlu
functions we have are acting up.

Old behavior on 8.3 .. 9.0:

sql_ascii =# create or replace function whitespace(text) returns text
language plperlu as $$ $a = shift; $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace (E'\200'); -- 0x80 is not valid utf-8whitespace
------------

sql_ascii =# select whitespace (E'\200')::bytea;whitespace
------------\x80

New behavior on 9.1.2:

sql_ascii =# select whitespace (E'\200');
ERROR:  XX000: Malformed UTF-8 character (fatal) at line 1.
KONTEXT:  PL/Perl function "whitespace"
ORT:  plperl_call_perl_func, plperl.c:2037

A crude workaround is:

sql_ascii =# create or replace function whitespace_utf8_off(text)
returns text language plperlu as $$ use Encode; $a = shift;
Encode::_utf8_off($a); $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace_utf8_off (E'\200');whitespace_utf8_off
---------------------\u0080

sql_ascii =# select whitespace_utf8_off (E'\200')::bytea;whitespace_utf8_off
---------------------\xc280

(Note that the workaround is not perfect as the resulting 0x80..0xff
bytes are still tagged to be utf8.)


I think the bug is in plperl_helpers.h:

/** Create a new SV from a string assumed to be in the current database's* encoding.*/

static inline SV *
cstr2sv(const char *str)
{       SV                 *sv;       char       *utf8_str = utf_e2u(str);
       sv = newSVpv(utf8_str, 0);       SvUTF8_on(sv);
       pfree(utf8_str);
       return sv;
}

In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.

(Returned values might also need fixing.)

In my view, this is clearly a bug in pl/perl on sql_ascii databases.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: psql NUL record and field separator
Next
From: Fujii Masao
Date:
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)