Thread: How to convert HEX to ASCII?
Hello, after several attempts I have finally succeeded in developing a urlencode() function to encode text correctly like defined in RFC 1738. Now i have a big problem: how to decode the text? Example: # SELECT urlencode('Hellö World!'); urlencode ----------------------- Hell%C3%B6%20World%21 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? Thanks for your help and greetings from Germany, Torsten -- http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 verschiedenen Datenbanksystemen abstrahiert, Queries von Applikationen trennt und automatisch die Query-Ergebnisse auswerten kann.
On 2 December 2011 13:18, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote: > Hello, > > after several attempts I have finally succeeded in developing a urlencode() > function to encode text correctly like defined in RFC 1738. > > Now i have a big problem: how to decode the text? > > Example: > # SELECT urlencode('Hellö World!'); > urlencode > ----------------------- > Hell%C3%B6%20World%21 > > Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? > I've extracted the unquote method [0] from urllib in the python stdlib that decodes urlencoded strings. Hopefully be some use! [0] http://pastie.org/2954968
Damien Churchill schrieb: >> after several attempts I have finally succeeded in developing a urlencode() >> function to encode text correctly like defined in RFC 1738. >> >> Now i have a big problem: how to decode the text? >> >> Example: >> # SELECT urlencode('Hellö World!'); >> urlencode >> ----------------------- >> Hell%C3%B6%20World%21 >> >> Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? > > I've extracted the unquote method [0] from urllib in the python stdlib > that decodes urlencoded strings. Hopefully be some use! Not directly, but it gives me some helpful hints. For example i'm now able to decode some basic characters, for example: # SELECT chr(x'21'::int); chr ----- ! (1 row) But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr ----- 쎶 (1 row) Also i'm not sure how to figure out, when to decode '%C3' and when to decode '%C3%B6'. Thanks for your help, Torsten -- http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 verschiedenen Datenbanksystemen abstrahiert, Queries von Applikationen trennt und automatisch die Query-Ergebnisse auswerten kann.
On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote: > Damien Churchill schrieb: > > >>> after several attempts I have finally succeeded in developing a >>> urlencode() >>> function to encode text correctly like defined in RFC 1738. >>> >>> Now i have a big problem: how to decode the text? >>> >>> Example: >>> # SELECT urlencode('Hellö World!'); >>> urlencode >>> ----------------------- >>> Hell%C3%B6%20World%21 >>> >>> Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? >> >> >> I've extracted the unquote method [0] from urllib in the python stdlib >> that decodes urlencoded strings. Hopefully be some use! > > > Not directly, but it gives me some helpful hints. For example i'm now able > to decode some basic characters, for example: > > # SELECT chr(x'21'::int); > chr > ----- > ! > (1 row) > > But i clearly have a missunderstanding of other chars, like umlauts or utf-8 > chars. This, for example, should return a 'ö': > > # SELECT chr(x'C3B6'::int); > chr > ----- > 쎶 > (1 row) > > Also i'm not sure how to figure out, when to decode '%C3' and when to decode > '%C3%B6'. > > Thanks for your help, You're welcome. get ready for some seriously abusive sql: create or replace function unencode(text) returns text as $$ with q as ( select (regexp_matches($1, '(%..|.)', 'g'))[1] as v ) select string_agg(case when length(v) = 3 then chr(replace(v, '%', 'x')::bit(8)::int) else v end, '') from q; $$ language sql immutable; set client_encoding to latin1; SET postgres=# select unencode('Hell%C3%B6%20World%21'); unencode --------------- Hellö World! (1 row) Time: 1.908 ms (maybe this isn't really an immutable function, but oh well). merlin
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote: > But i clearly have a missunderstanding of other chars, like umlauts or utf-8 > chars. This, for example, should return a 'ö': > > # SELECT chr(x'C3B6'::int); > chr > ----- > 쎶 > (1 row) That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded and actually decodes to the codepoint 00F6. There is a fundamental problem that a decoded URL may actually be a binary string -- it might not have a textual representation at all. But if text is what you want, RFC3986 strongly suggests using UTF-8 for encoding text strings in URLs, and that works almost always in the real world. So the *right* way is to first convert the URL to a binary "bytea" type by fixing all the % escapes, then convert that to UTF-8 encoding to handle multibyte characters. What I came up with is far from elegant because PostgreSQL lacks convenient functions for bytea manipulation (no bytea_agg, etc). Stealing a little from Merlin, this is what it looks like: CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE bin bytea = ''; byte text; BEGIN FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP IF length(byte) = 3 THEN bin = bin || decode(substring(byte, 2, 2), 'hex'); ELSE bin = bin || byte::bytea; END IF; END LOOP; RETURN convert_from(bin, 'utf8'); END $$; db=# select url_decode('Hell%C3%B6%20World%21'); url_decode -------------- Hellö World! db=# select url_decode('%EC%8E%B6'); url_decode ------------ 쎶 This will break for binary-encoded data in URLs, though. db=# select url_decode('%fa%fa%fa'); ERROR: invalid byte sequence for encoding "UTF8": 0xfa CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN ---- On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote: > set client_encoding to latin1; > postgres=# select unencode('Hell%C3%B6%20World%21'); > unencode > --------------- > Hellö World! > (1 row) Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. Regards, Marti
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote: > Sorry, but AFAICT this makes a mess of encodings and only works by > pure luck. The server thinks it's sending the client LATIN1 text, but > it's actually UTF8-encoded and the last decoding step is done by your > terminal. yup -- your're right -- what a coincidence! I still prefer the 1 liner sql variant vs plpgsql loop though. nicely done. merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>: > On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote: >> Sorry, but AFAICT this makes a mess of encodings and only works by >> pure luck. The server thinks it's sending the client LATIN1 text, but >> it's actually UTF8-encoded and the last decoding step is done by your >> terminal. > > yup -- your're right -- what a coincidence! I still prefer the 1 > liner sql variant vs plpgsql loop though. nicely done. so bytea_agg - one param aggregate has sense it's very easy to implement it Pavel > > merlin > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2011/12/2 Merlin Moncure <mmoncure@gmail.com>: >> On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote: >>> Sorry, but AFAICT this makes a mess of encodings and only works by >>> pure luck. The server thinks it's sending the client LATIN1 text, but >>> it's actually UTF8-encoded and the last decoding step is done by your >>> terminal. >> >> yup -- your're right -- what a coincidence! I still prefer the 1 >> liner sql variant vs plpgsql loop though. nicely done. > > so bytea_agg - one param aggregate has sense > > it's very easy to implement it yup: create aggregate bytea_agg (bytea) ( sfunc=byteacat, stype=bytea ); merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>: > On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> 2011/12/2 Merlin Moncure <mmoncure@gmail.com>: >>> On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote: >>>> Sorry, but AFAICT this makes a mess of encodings and only works by >>>> pure luck. The server thinks it's sending the client LATIN1 text, but >>>> it's actually UTF8-encoded and the last decoding step is done by your >>>> terminal. >>> >>> yup -- your're right -- what a coincidence! I still prefer the 1 >>> liner sql variant vs plpgsql loop though. nicely done. >> >> so bytea_agg - one param aggregate has sense >> >> it's very easy to implement it > > yup: > > create aggregate bytea_agg (bytea) > ( > sfunc=byteacat, > stype=bytea > ); this is workaround :) without a memory preallocating it has same speed like cycle in plpgsql. Regards Pavel > > merlin
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>> so bytea_agg - one param aggregate has sense >>> >>> it's very easy to implement it >> >> yup: >> >> create aggregate bytea_agg (bytea) >> ( >> sfunc=byteacat, >> stype=bytea >> ); > > this is workaround :) > > without a memory preallocating it has same speed like cycle in plpgsql. sure, but I prefer to code against the workaround because it's cleaner and it makes things easier to port over when such a feature makes it into core. also, one liner sql has better chance of inlining as a general rule. merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>: > On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>>> so bytea_agg - one param aggregate has sense >>>> >>>> it's very easy to implement it >>> >>> yup: >>> >>> create aggregate bytea_agg (bytea) >>> ( >>> sfunc=byteacat, >>> stype=bytea >>> ); >> >> this is workaround :) >> >> without a memory preallocating it has same speed like cycle in plpgsql. > > sure, but I prefer to code against the workaround because it's cleaner > and it makes things easier to port over when such a feature makes it > into core. also, one liner sql has better chance of inlining as a > general rule. ook Pavel > > merlin
Marti Raudsepp schrieb: > On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff > <foo@meisterderspiele.de> wrote: >> But i clearly have a missunderstanding of other chars, like umlauts or utf-8 >> chars. This, for example, should return a 'ö': >> >> # SELECT chr(x'C3B6'::int); >> chr >> ----- >> 쎶 >> (1 row) > > That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded > and actually decodes to the codepoint 00F6. > > There is a fundamental problem that a decoded URL may actually be a > binary string -- it might not have a textual representation at all. > But if text is what you want, RFC3986 strongly suggests using UTF-8 > for encoding text strings in URLs, and that works almost always in the > real world. Text is what i want. :) I've created a highly specialiced CMS, which handle a bunch of big sites (in meaning of a great numbers of users and content). It has a build-in traffic-analyze and with this function it creates a real time analyze of the keywords, a user used to find the sites in search engines. This is very needful if you try to do SEO for websites with more than 20.000 unique content-pages. :) > CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text > LANGUAGE plpgsql IMMUTABLE STRICT AS $$ > DECLARE > bin bytea = ''; > byte text; > BEGIN > FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP > IF length(byte) = 3 THEN > bin = bin || decode(substring(byte, 2, 2), 'hex'); > ELSE > bin = bin || byte::bytea; > END IF; > END LOOP; > RETURN convert_from(bin, 'utf8'); > END > $$; Hey, this function looks similar to my encoding function :) Thank you very munch! > This will break for binary-encoded data in URLs, though. Thats no problem, i just have text. Big thanks to all of you, Torsten
Hello all just note 9.1 will have a bytea_agg aggregate regards Pavel Stehule 2011/12/2 Marti Raudsepp <marti@juffo.org>: > On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff > <foo@meisterderspiele.de> wrote: >> But i clearly have a missunderstanding of other chars, like umlauts or utf-8 >> chars. This, for example, should return a 'ö': >> >> # SELECT chr(x'C3B6'::int); >> chr >> ----- >> 쎶 >> (1 row) > > That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded > and actually decodes to the codepoint 00F6. > > There is a fundamental problem that a decoded URL may actually be a > binary string -- it might not have a textual representation at all. > But if text is what you want, RFC3986 strongly suggests using UTF-8 > for encoding text strings in URLs, and that works almost always in the > real world. > > So the *right* way is to first convert the URL to a binary "bytea" > type by fixing all the % escapes, then convert that to UTF-8 encoding > to handle multibyte characters. > > What I came up with is far from elegant because PostgreSQL lacks > convenient functions for bytea manipulation (no bytea_agg, etc). > Stealing a little from Merlin, this is what it looks like: > > CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text > LANGUAGE plpgsql IMMUTABLE STRICT AS $$ > DECLARE > bin bytea = ''; > byte text; > BEGIN > FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP > IF length(byte) = 3 THEN > bin = bin || decode(substring(byte, 2, 2), 'hex'); > ELSE > bin = bin || byte::bytea; > END IF; > END LOOP; > RETURN convert_from(bin, 'utf8'); > END > $$; > > db=# select url_decode('Hell%C3%B6%20World%21'); > url_decode > -------------- > Hellö World! > > db=# select url_decode('%EC%8E%B6'); > url_decode > ------------ > 쎶 > > This will break for binary-encoded data in URLs, though. > db=# select url_decode('%fa%fa%fa'); > ERROR: invalid byte sequence for encoding "UTF8": 0xfa > CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN > > ---- > > On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote: >> set client_encoding to latin1; > >> postgres=# select unencode('Hell%C3%B6%20World%21'); >> unencode >> --------------- >> Hellö World! >> (1 row) > > Sorry, but AFAICT this makes a mess of encodings and only works by > pure luck. The server thinks it's sending the client LATIN1 text, but > it's actually UTF8-encoded and the last decoding step is done by your > terminal. > > Regards, > Marti > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general