Re: plperlu problem with utf8 - Mailing list pgsql-hackers
From | Alex Hunsaker |
---|---|
Subject | Re: plperlu problem with utf8 |
Date | |
Msg-id | AANLkTinch0U5CE5B8pNsSVu5bh7eOynsjKpugG4sfG92@mail.gmail.com Whole thread Raw |
In response to | Re: plperlu problem with utf8 (David Christensen <david@endpoint.com>) |
List | pgsql-hackers |
On Fri, Dec 17, 2010 at 22:32, David Christensen <david@endpoint.com> wrote: > > On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote: > >> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote: >> >>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1. >>> >>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ? >> >> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, givena scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on. > > This is a correct assertion as to Perl's behavior. As far as PostgreSQL is/should be concerned in this case, this is thecorrect handling for URI::Escape, Right, so no postgres bug here.. Postgres showing é instead of é is right as far as its concerned. >> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode fromPerl's internal format on output. > > +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. Inthe case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardlessof data type. Right and thats what we do for the above. Minus some mis-handling of non character datatypes like bytea in the UTF-8 case. > For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably usingthe built-in conversions before passing it off to the first code path. A similar handling would need to be done forthe return values, again datatype-dependent. Yeah, thats what we *should* do. Right now we just leave it as byte soup for the user to decode/encode. :( > [ correctness of perl character ops in the non utf8 case] One thought I had was that we could expose the server_encodingto the plperl interpreters in a special variable to make it easy to explicitly decode... Should not need to do anything as complicated as that. Can just encode the string to utf8 before we hand it off to perl. [...] > $ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))' > 28 > > $ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))' > 27 [...] > As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded versionis 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle> 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8first and then escape the encoded value, and And why should it? properly escaped URIs should have all those escaped, I imagine. Anyway not really relevant for postgres. > b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far asthe version that I'm looking at, which came with 5.10.1). >>> Either uri_unescape() should be decoding that utf8() or you need >>> to do it *after* you call uri_unescape(). Hence the maybe it could be >>> considered a bug in uri_unescape(). >> >> Agreed. > > -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it. -1? thats basically what I said: "... you need to do it (decode the utf8) *after* you call uri_unescape" > Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8{ Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is onlydefined to return octets. Ahh So -1 because I said maybe you could call it a bug in uri_unescape(). Really, I was only saying you *might* be able to consider it a bug-- or perhaps deficiency is a better word, in uri_unescape iff URI's are defined to have escaped characters as a % escaped utf8 sequence. I dont know that they do, so I don't know if its a bug :)
pgsql-hackers by date: