Re: plperlu problem with utf8 - Mailing list pgsql-hackers

From Alex Hunsaker
Subject Re: plperlu problem with utf8
Date
Msg-id AANLkTinch0U5CE5B8pNsSVu5bh7eOynsjKpugG4sfG92@mail.gmail.com
Whole thread Raw
In response to Re: plperlu problem with utf8  (David Christensen <david@endpoint.com>)
List pgsql-hackers
On Fri, Dec 17, 2010 at 22:32, David Christensen <david@endpoint.com> wrote:
>
> On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:
>
>> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>>
>>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's
Latin-1.
>>>
>>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>>
>> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl,
givena scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on. 
>
> This is a correct assertion as to Perl's behavior.  As far as PostgreSQL is/should be concerned in this case, this is
thecorrect handling for URI::Escape, 

Right, so no postgres bug here..  Postgres showing é instead of é is
right as far as its concerned.

>> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode
fromPerl's internal format on output. 
>
> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments.
 Inthe case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like
types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion
regardlessof data type. 

Right and thats what we do for the above.  Minus some mis-handling of
non character datatypes like bytea in the UTF-8 case.

> For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably
usingthe built-in conversions before passing it off to the first code path.  A similar handling would need to be done
forthe return values, again datatype-dependent. 

Yeah, thats what we *should* do.  Right now we just leave it as byte
soup for the user to decode/encode. :(

> [ correctness of perl character ops in the non utf8 case] One thought I had was that we could expose the
server_encodingto the plperl interpreters in a special variable to make it easy to explicitly decode... 

Should not need to do anything as complicated as that. Can just encode
the string to utf8 before we hand it off to perl.

[...]
> $ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
> 28
>
> $ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
> 27
[...]
> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded
versionis 28.  I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not
handle> 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to
UTF8first and then escape the encoded value, and 

And why should it? properly escaped URIs should have all those
escaped, I imagine.  Anyway not really relevant for postgres.

> b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far
asthe version that I'm looking at, which came with 5.10.1). 

>>> Either uri_unescape() should be decoding that utf8() or you need
>>> to do it *after* you call uri_unescape().  Hence the maybe it could be
>>> considered a bug in uri_unescape().
>>
>> Agreed.
>
> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it.

-1? thats basically what I said:  "... you need to do it (decode the
utf8) *after* you call uri_unescape"

>  Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub
uri_unescape_utf8{ Encode::decode_utf8(uri_unescape(shift))}.  This is definitely not a bug in uri_escape, as it is
onlydefined to return octets. 

Ahh So -1 because I said maybe you could call it a bug in
uri_unescape(). Really, I was only saying you *might* be able to
consider it a bug-- or perhaps deficiency is a better word, in
uri_unescape iff URI's are defined to have escaped characters as a %
escaped utf8 sequence.  I dont know that they do, so I don't know if
its a bug :)


pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: proposal : cross-column stats
Next
From: Alex Hunsaker
Date:
Subject: Re: plperlu problem with utf8