Re: Careful PL/Perl Release Not Required - Mailing list pgsql-hackers

From Alex Hunsaker
Subject Re: Careful PL/Perl Release Not Required
Date
Msg-id AANLkTimkORLgN6ib63rkZ9OjZv5jDpBpe8E+OEk-oXL-@mail.gmail.com
Whole thread Raw
In response to Re: Careful PL/Perl Release Not Required  ("David E. Wheeler" <david@kineticode.com>)
Responses Re: Careful PL/Perl Release Not Required  (Alex Hunsaker <badalex@gmail.com>)
Re: Careful PL/Perl Release Not Required  ("David E. Wheeler" <david@kineticode.com>)
List pgsql-hackers
On Fri, Feb 11, 2011 at 10:16, David E. Wheeler <david@kineticode.com> wrote:
> On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> Like I said, the terminology is awful.

Yeah I use encode and decode to mean the same thing frequently :-(.

>> In the the cited case he was passing "%C3%A9" to uri_unescape() and
>> expecting it to return 1 character. The additional utf8::decode() will
>> tell perl the string is in utf8 so it will then return 1 char. The
>> point being, decode is needed and with it, the function will work pre
>> and post 9.1.
>
> Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database
wasutf-8, and should be in 9.1 if the database isn't sql_ascii?
 

It is decoded... the input string "%C3%A9" actually is the _same_
string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
characters. Calling utf8::decode("%C3%A9") is essentially a noop.

>> In-fact on a latin-1 database it sure as heck better return two
>> characters, it would be a bug if it only returned 1 as that would mean
>> it would be treating a series of latin1 bytes as a series of utf8
>> bytes!
>
> If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's
Perl'sinternal representation.
 

> If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded.

Ok, I think i figured out why we seem to be talking past each other, we have:
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
use strict;
use URI::Escape;
utf8::decode($_[0]);
return uri_unescape($_[0]); $$ LANGUAGE plperlu;

That *looks* like it is decoding the input string, which it is, but
actually that will double utf8 encode your string. It does not seem to
in this case because we are dealing with all ascii input. The trick
here is its also telling perl to decode/treat the *output* string as
utf8.

uri_unescape() returns the same string you passed in, which thanks to
the utf8::decode() above has the utf8 flag set. Meaning we end up
treating it as 1 character instead of two. Or basically that it has
the same effect as calling utf8::decode() on the return value.

The correct way to write that function pre 9.1 and post 9.1 would be
(in a utf8 database):
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
use strict;
use URI::Escape;
my $str = uri_unescape($_[0]);
utf8::decode($str);
return $str;
$$ LANGUAGE plperlu;

The last utf8::decode being optional (as we said, it might not be
utf8), but granting the sought behavior by the op.


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Range Types: << >> -|- ops vs empty range
Next
From: "Kevin Grittner"
Date:
Subject: Re: Range Types: << >> -|- ops vs empty range