Thread: TODO: Expose parser support for decoding unicode escape literals to user

TODO: Expose parser support for decoding unicode escape literals to user

From

Craig Ringer

Date:

15 May 2014, 08:31:59

Hi all

I just noticed a Stack Overflow question
(http://stackoverflow.com/q/20124393/398670) where someone's asking how
to decode '\u0000` style escapes *stored in database text fields* into
properly encoded text strings.

The parser supports this for escape-strings, and you can write E'\u011B'
to get 'ě' because of
http://postgresql.1045698.n5.nabble.com/Unicode-escapes-in-literals-td1992313.html.

I don't see this exposed in a way that users can call directly, though.
'decode(bytea, text)' has the 'escape' input, but it expects octal.

It's possible to use PL/PgSQL's 'EXECUTE' to use the parser to do the
work, but that's downright awful.

Am I missing something obvious, or is this something that'd be a good
new-developer TODO?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO: Expose parser support for decoding unicode escape literals to user

From

Adrian Klaver

Date:

15 May 2014, 13:56:29

On 05/15/2014 01:31 AM, Craig Ringer wrote:
> Hi all
>
> I just noticed a Stack Overflow question
> (http://stackoverflow.com/q/20124393/398670) where someone's asking how
> to decode '\u0000` style escapes *stored in database text fields* into
> properly encoded text strings.
>
> The parser supports this for escape-strings, and you can write E'\u011B'
> to get 'ě' because of
> http://postgresql.1045698.n5.nabble.com/Unicode-escapes-in-literals-td1992313.html.
>
> I don't see this exposed in a way that users can call directly, though.
> 'decode(bytea, text)' has the 'escape' input, but it expects octal.
>
> It's possible to use PL/PgSQL's 'EXECUTE' to use the parser to do the
> work, but that's downright awful.
>
> Am I missing something obvious, or is this something that'd be a good
> new-developer TODO?
>

Not sure if this is what you want?:

test=> SELECT quote_literal(E'test \u011B');
  quote_literal
---------------
  'test ě'


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: TODO: Expose parser support for decoding unicode escape literals to user

From

David G Johnston

Date:

15 May 2014, 14:13:12

Adrian Klaver-4 wrote
> On 05/15/2014 01:31 AM, Craig Ringer wrote:
>> Hi all
>>
>> I just noticed a Stack Overflow question
>> (http://stackoverflow.com/q/20124393/398670) where someone's asking how
>> to decode '\u0000` style escapes *stored in database text fields* into
>> properly encoded text strings.
>>
>> The parser supports this for escape-strings, and you can write E'\u011B'
>> to get 'ě' because of
>> http://postgresql.1045698.n5.nabble.com/Unicode-escapes-in-literals-td1992313.html.
>>
>> I don't see this exposed in a way that users can call directly, though.
>> 'decode(bytea, text)' has the 'escape' input, but it expects octal.
>>
>> It's possible to use PL/PgSQL's 'EXECUTE' to use the parser to do the
>> work, but that's downright awful.
>>
>> Am I missing something obvious, or is this something that'd be a good
>> new-developer TODO?
>>
>
> Not sure if this is what you want?:
>
> test=> SELECT quote_literal(E'test \u011B');
>   quote_literal
> ---------------
>   'test ě'

Except the data is already in the database and there is no way to put an "E"
in front of a column name and cause PostgreSQL to process the escapes
embedded in the column's value in the same way it processes a literal.

WITH src (txt) AS ( VALUES ('A \u011B C') )
SELECT txt FROM src;

Hence the need for a function to perform the same process that the parser
performs when dealing with literals.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/TODO-Expose-parser-support-for-decoding-unicode-escape-literals-to-user-tp5804012p5804042.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Re: TODO: Expose parser support for decoding unicode escape literals to user

From

Adrian Klaver

Date:

15 May 2014, 14:41:53

On 05/15/2014 07:13 AM, David G Johnston wrote:
> Adrian Klaver-4 wrote
>> On 05/15/2014 01:31 AM, Craig Ringer wrote:
>>> Hi all
>>>
>>> I just noticed a Stack Overflow question
>>> (http://stackoverflow.com/q/20124393/398670) where someone's asking how
>>> to decode '\u0000` style escapes *stored in database text fields* into
>>> properly encoded text strings.
>>>
>>> The parser supports this for escape-strings, and you can write E'\u011B'
>>> to get 'ě' because of
>>> http://postgresql.1045698.n5.nabble.com/Unicode-escapes-in-literals-td1992313.html.
>>>
>>> I don't see this exposed in a way that users can call directly, though.
>>> 'decode(bytea, text)' has the 'escape' input, but it expects octal.
>>>
>>> It's possible to use PL/PgSQL's 'EXECUTE' to use the parser to do the
>>> work, but that's downright awful.
>>>
>>> Am I missing something obvious, or is this something that'd be a good
>>> new-developer TODO?
>>>
>>
>> Not sure if this is what you want?:
>>
>> test=> SELECT quote_literal(E'test \u011B');
>>    quote_literal
>> ---------------
>>    'test ě'
>
> Except the data is already in the database and there is no way to put an "E"
> in front of a column name and cause PostgreSQL to process the escapes
> embedded in the column's value in the same way it processes a literal.

Yea, that is a problem.

>
> WITH src (txt) AS ( VALUES ('A \u011B C') )
> SELECT txt FROM src;
>
> Hence the need for a function to perform the same process that the parser
> performs when dealing with literals.
>
> David J.
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: TODO: Expose parser support for decoding unicode escape literals to user

From

Craig Ringer

Date:

16 May 2014, 11:13:00

On 05/15/2014 09:56 PM, Adrian Klaver wrote:
>
> test=> SELECT quote_literal(E'test \u011B');
>  quote_literal
> ---------------
>  'test ě'

That's another case where the function isn't doing what you expect.
quote_literal has nothing to do with what's happening, it's
escape-string processing in the parser doing the work. Compare:

regress=> SELECT 'test \u011B';
  ?column?
-------------
 test \u011B
(1 row)

regress=> SELECT E'test \u011B';
 ?column?
----------
 test ě
(1 row)

now, the problem posed is if you had this:

regress=> CREATE TABLE test AS SELECT TEXT 'test \u011B' dummy;
SELECT 1
regress=> SELECT * FROM test;
    dummy
-------------
 test \u011B
(1 row)

how would you get 'test ě' ?


The parser can do it, but I don't think anyone would consider this an
acceptable solution to this problem (anybody reading this, UNDER NO
CIRCUMSTANCES USE THIS FUNCTION, EVER):


regress=> CREATE OR REPLACE FUNCTION ohmygod(text) RETURNS text AS $$
DECLARE
    retval text;
BEGIN
    -- If you use this in real code, I hate you
    EXECUTE 'SELECT E'''||$1||''';' INTO retval;
    RETURN retval;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION

regress=> SELECT ohmygod(dummy) FROM test;
 ohmygod
---------
 test ě
(1 row)



It'd be nice to expose this capability to users without requiring that
kind of horror.

Hence: exposing parser support for decoding unicode escape literals to
the user.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO: Expose parser support for decoding unicode escape literals to user

From

Adrian Klaver

Date:

16 May 2014, 14:33:14

On 05/16/2014 04:12 AM, Craig Ringer wrote:
> On 05/15/2014 09:56 PM, Adrian Klaver wrote:
>>
>> test=> SELECT quote_literal(E'test \u011B');
>>   quote_literal
>> ---------------
>>   'test ě'
>
> That's another case where the function isn't doing what you expect.
> quote_literal has nothing to do with what's happening, it's
> escape-string processing in the parser doing the work. Compare:
>
> regress=> SELECT 'test \u011B';
>    ?column?
> -------------
>   test \u011B
> (1 row)
>
> regress=> SELECT E'test \u011B';
>   ?column?
> ----------
>   test ě
> (1 row)

Davids comments and some playing around with strings showed me what was
going on and the error of my ways.

>
> now, the problem posed is if you had this:
>
> regress=> CREATE TABLE test AS SELECT TEXT 'test \u011B' dummy;
> SELECT 1
> regress=> SELECT * FROM test;
>      dummy
> -------------
>   test \u011B
> (1 row)
>
> how would you get 'test ě' ?
>
>
> The parser can do it, but I don't think anyone would consider this an
> acceptable solution to this problem (anybody reading this, UNDER NO
> CIRCUMSTANCES USE THIS FUNCTION, EVER):
>
>
> regress=> CREATE OR REPLACE FUNCTION ohmygod(text) RETURNS text AS $$
> DECLARE
>      retval text;
> BEGIN
>      -- If you use this in real code, I hate you
>      EXECUTE 'SELECT E'''||$1||''';' INTO retval;
>      RETURN retval;
> END;
> $$ LANGUAGE plpgsql;
> CREATE FUNCTION


Actually I started down this dark path also, before I was interrupted by
something else:)

>
> regress=> SELECT ohmygod(dummy) FROM test;
>   ohmygod
> ---------
>   test ě
> (1 row)
>
>
>
> It'd be nice to expose this capability to users without requiring that
> kind of horror.
>
> Hence: exposing parser support for decoding unicode escape literals to
> the user.

Yes, so as an example something like:

decode_u('test \u011B')

that would decode the escaped values automatically.

>


--
Adrian Klaver
adrian.klaver@aklaver.com