Re: plperlu problem with utf8 - Mailing list pgsql-hackers

From David E. Wheeler
Subject Re: plperlu problem with utf8
Date
Msg-id ACAB08D1-A956-44F5-8200-36EB0404DD6A@kineticode.com
Whole thread Raw
In response to Re: plperlu problem with utf8  (Alex Hunsaker <badalex@gmail.com>)
Responses Re: plperlu problem with utf8
Re: plperlu problem with utf8
Re: plperlu problem with utf8
List pgsql-hackers
On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:

>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's
Latin-1.
>
> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl,
givena scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on. 

> Im saying they are not, and if you want \xc3\xa9 to be treated as
> chr(233) you need to tell perl what encoding the string is in (err
> well actually decode it so its in "perl space" as unicode characters
> correctly).

PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from
Perl'sinternal format on output. 

>> Maybe I'm misunderstanding, but it seems to me that:
>>
>> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal
representationbefore the function actually gets them. 
>
> Currently postgres has 2 behaviors:
> 1) If the database is utf8, turn on the utf8 flag. According to the
> perldoc snippet I quoted this should mean its a sequence of utf8 bytes
> and should interpret it as such.

Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?

> 2) its not utf8, so we just leave it as octets.

Which mean's Perl will assume that it's Latin-1, IIUC.

> So in "perl space" length($_[0]) returns the number of characters when
> you pass in a multibyte char *not* the number of bytes.  Which is
> correct, so um check we do that.  Right?

Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:
   CREATE OR REPLACE FUNCTION perlgets(       TEXT   ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
 my $text = shift;      return_next {          length  => length $text,          is_utf8 => utf8::is_utf8($text) ? 1 :
0     };   $$; 

In a utf-8 database:
   utf8=# select * from perlgets('foo');    length │ is_utf8    ────────┼─────────         8 │ t   (1 row)


In a latin-1 database:
   latin=# select * from perlgets('foo');    length │ is_utf8    ────────┼─────────         8 │ f   (1 row)

I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to
Perl'sinternal form. 

Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database.
Thatdoesn't seem right to me. 

> In the URI::Escape example we have:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>   use URI::Escape;
>   warn(length($_[0]));
>   return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> # select url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING: 38 at line 2

What's the output? And what's the encoding of the database?

> Ok that length looks right, just for grins lets try add one multibyte char:
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
> WARNING:  39 CONTEXT:  PL/Perl function "url_decode" at line 2.
>          url_decode
> -------------------------------
> comment passer le réveillon☺
> (1 row)
>
> Still right,

The length is right, but the é is wrong. It looks like Perl thinks it's latin-1. Or, rather, unescape_uri() dosn't know
thatit should be returning utf-8 characters. That *might* be a bug in URI::Escape. 

> now lets try the utf8::decode version that "works".  Only
> lets look at the length of the string we are returning instead of the
> one we are passing in:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>   use URI::Escape;
>   utf8::decode($_[0]);
>   my $str = uri_unescape($_[0]);
>   warn(length($str));
>   return $str;
> $$ LANGUAGE plperlu;
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING:  28 at line 5.
> CONTEXT:  PL/Perl function "url_decode"
>         url_decode
> -----------------------------
> comment passer le réveillon
> (1 row)
>
> Looks harmless enough...

Looks far better, in fact. Interesting that URI::Escape does the right thing only if the utf8 flag has been turned on
inthe string passed to it. But in Perl it usually won't be, because the encoded string should generally have only ASCII
characters.

> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING:  28 at line 5.
> CONTEXT:  PL/Perl function "url_decode"
> length
> --------
>     27
> (1 row)
>
> Wait a minute... those lengths should match.
>
> Post patch they do:
> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING:  28 at line 5.
> CONTEXT:  PL/Perl function "url_decode"
> length
> --------
>     28
> (1 row)
>
> Still confused? Yeah me too.

Yeah…

> Maybe this will help:
>
> #!/usr/bin/perl
> use URI::Escape;
> my $str = uri_unescape("%c3%a9");
> die "first match" if($str =~ m/\xe9/);
> utf8::decode($str);
> die "2nd match" if($str =~ m/\xe9/);
>
> gives:
> $ perl t.pl
> 2nd match at t.pl line 6.
>
> see? Either uri_unescape() should be decoding that utf8() or you need
> to do it *after* you call uri_unescape().  Hence the maybe it could be
> considered a bug in uri_unescape().

Agreed.

>> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the
serverencoding before they're returned. 
>> I didn't really follow all of the above; are you aiming for the same thing?
>
> Yeah, the patch address this part.  Right now we just spit out
> whatever the internal format happens to be.

Ah, excellent.

> Anyway its all probably clear as mud, this part of perl is one of the
> hardest IMO.

No question.

Best,

David



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Why don't we accept exponential format for integers?
Next
From: "David E. Wheeler"
Date:
Subject: Re: plperlu problem with utf8