Thread: proposal: unescape_text function

proposal: unescape_text function

From

Pavel Stehule

Date:

22 June 2020, 03:48:48

There is one user request for unescape function in core.

https://stackoverflow.com/questions/20124393/convert-escaped-unicode-character-back-to-actual-character-in-postgresql/20125412?noredirect=1#comment110502526_20125412

This request is about possibility that we do with string literal via functional interface instead string literals only

I wrote plpgsql function, but built in function can be simpler:

CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS textLANGUAGE plpgsqlAS $function$DECLARE result text;BEGIN  EXECUTE format('SELECT U&%s UESCAPE %s',                         quote_literal(replace($1, '\u','^')),                        quote_literal($2)) INTO result;  RETURN result;END;$function$

postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');   unescape     
-----------------Odpovědná osoba
(1 row)

What do you think about this?

Regards

Pavel

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

23 June 2020, 09:51:40

po 22. 6. 2020 v 5:48 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

There is one user request for unescape function in core.

https://stackoverflow.com/questions/20124393/convert-escaped-unicode-character-back-to-actual-character-in-postgresql/20125412?noredirect=1#comment110502526_20125412

This request is about possibility that we do with string literal via functional interface instead string literals only

I wrote plpgsql function, but built in function can be simpler:
CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS textLANGUAGE plpgsqlAS $function$DECLARE result text;BEGIN  EXECUTE format('SELECT U&%s UESCAPE %s',                         quote_literal(replace($1, '\u','^')),                        quote_literal($2)) INTO result;  RETURN result;END;$function$
postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');   unescape     
-----------------Odpovědná osoba
(1 row)
What do you think about this?

I changed the name to more accurately "unicode_unescape". Patch is assigned

Regards

Pavel

Regards

Pavel

Attachment

unicode_unescape.patch

Re: proposal: unescape_text function

From

Daniel Gustafsson

Date:

02 July 2020, 15:27:52

> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I changed the name to more accurately "unicode_unescape". Patch is assigned

You've made this function return Oid, where it used to be void.  Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.

+Oid
+check_unicode_value(pg_wchar c)
+{
+   if (!is_valid_unicode_codepoint(c))
+       ereport(ERROR,
+               (errcode(ERRCODE_SYNTAX_ERROR),
+                errmsg("invalid Unicode escape value")));
+}

cheers ./daniel

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

02 July 2020, 17:09:57

čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I changed the name to more accurately "unicode_unescape". Patch is assigned

You've made this function return Oid, where it used to be void. Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.

+Oid
+check_unicode_value(pg_wchar c)
+{
+ if (!is_valid_unicode_codepoint(c))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}

yes, it is my error

I am sending fixed patch

Thank you for check

Pavel

cheers ./daniel

Attachment

unicode_uescape-20200702.patch

Re: proposal: unescape_text function

From

Asif Rehman

Date:

28 July 2020, 12:02:28

On Thu, Jul 2, 2020 at 10:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:

čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I changed the name to more accurately "unicode_unescape". Patch is assigned

You've made this function return Oid, where it used to be void. Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.

+Oid
+check_unicode_value(pg_wchar c)
+{
+ if (!is_valid_unicode_codepoint(c))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}

yes, it is my error

I am sending fixed patch

Thank you for check

Pavel

cheers ./daniel

Hi Pavel,

Since the idea originated from unescaping unicode string literals i.e.
select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?

Asif Rehman

Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

29 July 2020, 06:18:18

Hi Pavel,

Since the idea originated from unescaping unicode string literals i.e.
select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?

good idea. The prefixes u (4 digits) and U (8 digits) are supported

Regards

Pavel

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Attachment

unicode_unescape-20200729.patch

Re: proposal: unescape_text function

From

Asif Rehman

Date:

29 July 2020, 18:04:37

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hi,

The patch looks good to me.

The new status of this patch is: Ready for Committer

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

07 October 2020, 09:00:40

st 29. 7. 2020 v 8:18 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

Hi Pavel,

Since the idea originated from unescaping unicode string literals i.e.
select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?

good idea. The prefixes u (4 digits) and U (8 digits) are supported

Regards

rebase

Regards

Pavel

Pavel

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Attachment

unicode_unescape-20201007.patch

Re: proposal: unescape_text function

From

Peter Eisentraut

Date:

27 November 2020, 14:37:05

On 2020-10-07 11:00, Pavel Stehule wrote:
>         Since the idea originated from unescaping unicode string
>         literals i.e.
>                 select unescape('Odpov\u011Bdn\u00E1 osoba');
> 
>         Shouldn't the built-in function support the above syntax as well?
> 
> 
>     good idea. The prefixes u (4 digits) and U (8 digits) are supported

I don't really get the point of this function.  There is AFAICT no 
function to produce this escaped format, and it's not a recognized 
interchange format.  So under what circumstances would one need to use this?

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

29 November 2020, 17:36:01

pá 27. 11. 2020 v 15:37 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:

On 2020-10-07 11:00, Pavel Stehule wrote:
> Since the idea originated from unescaping unicode string
> literals i.e.
> select unescape('Odpov\u011Bdn\u00E1 osoba');
>
> Shouldn't the built-in function support the above syntax as well?
>
>
> good idea. The prefixes u (4 digits) and U (8 digits) are supported

I don't really get the point of this function. There is AFAICT no
function to produce this escaped format, and it's not a recognized
interchange format. So under what circumstances would one need to use this?

Some corporate data can be in CSV format with escaped unicode characters. Without this function it is not possible to decode these files without external application.

Postgres has support for this conversion, but only for string literals.

CREATE OR REPLACE FUNCTION public.unescape(text, text)  RETURNS text LANGUAGE plpgsql AS $function$ DECLARE result text; BEGIN   EXECUTE format('SELECT U&%s UESCAPE %s',                          quote_literal(replace($1, '\u','^')),                         quote_literal($2)) INTO result;   RETURN result; END; $function$

Because unicode is major encoding, I think this conversion should be supported. There is another question about implementation like in this patch implemented unicode_unescape function, or with some new conversion. Using conversion https://www.postgresql.org/docs/current/sql-createconversion.html is probably better, but I am not sure how intuitive it is, and it is hard to use it (without not nice workarounds) in plpgsql.

I don't expect so Postgres should produce data in unicode escaped format, but can be useful, if Postgres can do some work with data in special format of major encoding.

postgres=# select convert_from(E'Odpov\u011Bdn\u00E1 osoba', 'UTF8');
┌─────────────────┐
│ convert_from │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

I can do this with bytea, but it is hard to use it with text fields.

I didn't find any way how to do it without ugly steps.

Regards

Pavel

Re: proposal: unescape_text function

From

Peter Eisentraut

Date:

30 November 2020, 13:14:30

On 2020-11-29 18:36, Pavel Stehule wrote:
> 
>     I don't really get the point of this function.  There is AFAICT no
>     function to produce this escaped format, and it's not a recognized
>     interchange format.  So under what circumstances would one need to
>     use this?
> 
> 
> Some corporate data can be in CSV format with escaped unicode 
> characters. Without this function it is not possible to decode these 
> files without external application.

I would like some supporting documentation on this.  So far we only have 
one stackoverflow question, and then this implementation, and they are 
not even the same format.  My worry is that if there is not precise 
specification, then people are going to want to add things in the 
future, and there will be no way to analyze such requests in a 
principled way.

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

30 November 2020, 21:15:32

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:

On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.

I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.

I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.

Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?

Regards

Pavel

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

30 November 2020, 21:56:34

po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.

I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.

I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.

Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?

a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case.

set client_encoding to unicode-escape;

copy tab from xxx;

...

but it doesn't help when only a few columns from the table are in unicode-escape format.

Regards

Pavel

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

01 December 2020, 05:43:27

po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.

I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.

I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.

Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?

a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case.

set client_encoding to unicode-escape;
copy tab from xxx;
...

but it doesn't help when only a few columns from the table are in unicode-escape format.

probably the most complete solution can be from two steps:

1. introducing new encoding - "ascii_unicode_escape" with related conversions

2. introducing two new functions - text_escape and text_unescape with two parameters - source text and conversion name

select text_convert_to('Тимати', 'ascii_unicode_escape')

\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text

select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438', 'ascii_unicode_escape')

┌──────────┐
│ ?column? │
╞══════════╡
│ Тимати │
└──────────┘
(1 row)

Regards

Pavel

Re: proposal: unescape_text function

From

Chapman Flack

Date:

01 December 2020, 19:20:40

>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
>> napsal:
>>> I checked this and it is "prefix backslash-u hex" used by Java,
>>> JavaScript  or RTF -
>>> https://billposer.org/Software/ListOfRepresentations.html

If I look on that page, it appears that RTF is using a similar-looking
escape but in decimal rather than hex.

It would be important to define what is done with non-BMP characters?
Will there be another escape for a six- or eight-hexdigit format for
the codepoint, or will it be represented by two four-hexdigit escapes
for consecutive UTF-16 surrogates?

Regards,
-Chap

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

01 December 2020, 20:16:05

út 1. 12. 2020 v 20:20 odesílatel Chapman Flack <chap@anastigmatix.net> napsal:

>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
>> napsal:
>>> I checked this and it is "prefix backslash-u hex" used by Java,
>>> JavaScript or RTF -
>>> https://billposer.org/Software/ListOfRepresentations.html

If I look on that page, it appears that RTF is using a similar-looking
escape but in decimal rather than hex.

It would be important to define what is done with non-BMP characters?
Will there be another escape for a six- or eight-hexdigit format for
the codepoint, or will it be represented by two four-hexdigit escapes
for consecutive UTF-16 surrogates?

the detection of decimal or hexadecimal codes can be a hard problem - string "12" is valid in both systems, but the numbers are different. So there should be external specification as an argument.

Regards

Pavel

Regards,
-Chap

Re: proposal: unescape_text function

From

Andrew Dunstan

Date:

01 December 2020, 23:05:21

On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> On 2020-11-29 18:36, Pavel Stehule wrote:
>>
>>     I don't really get the point of this function.  There is AFAICT no
>>     function to produce this escaped format, and it's not a recognized
>>     interchange format.  So under what circumstances would one need to
>>     use this?
>>
>>
>> Some corporate data can be in CSV format with escaped unicode
>> characters. Without this function it is not possible to decode these
>> files without external application.
>
> I would like some supporting documentation on this.  So far we only
> have one stackoverflow question, and then this implementation, and
> they are not even the same format.  My worry is that if there is not
> precise specification, then people are going to want to add things in
> the future, and there will be no way to analyze such requests in a
> principled way.
>
>
>


Also, should this be an extension? I'm dubious about including such
marginal uses in the core code unless there's a really good case for it.


cheers


andrew

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

02 December 2020, 05:48:02

st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:

On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> On 2020-11-29 18:36, Pavel Stehule wrote:
>>
>>     I don't really get the point of this function. There is AFAICT no
>>     function to produce this escaped format, and it's not a recognized
>>     interchange format. So under what circumstances would one need to
>>     use this?
>>
>>
>> Some corporate data can be in CSV format with escaped unicode
>> characters. Without this function it is not possible to decode these
>> files without external application.
>
> I would like some supporting documentation on this. So far we only
> have one stackoverflow question, and then this implementation, and
> they are not even the same format. My worry is that if there is not
> precise specification, then people are going to want to add things in
> the future, and there will be no way to analyze such requests in a
> principled way.
>
>
>

Also, should this be an extension? I'm dubious about including such
marginal uses in the core code unless there's a really good case for it.

I am not sure, and I am inclined so it should be core functionality.

1. Although this use case is marginal, this is related to most used encodings - ascii and unicode. 8 bit encodings enhanced about escaped multibyte chars will be used for a very long time. Unfortunately - this will be worse, because Postgres will be used more in the corporate environment, where there is a bigger press to conserve very legacy technologies without correct multibyte support. The core problem so this issue is out of concept bytea -> text or text -> bytea transformations supported by Postgres. This is text -> text transformation (for almost all encoding based on ascii), that is not supported by Postgres now.

2. Postgres already has this functionality - but unfortunately there is a limit just only literal constants.

create or replace function uunescape(text)
returns text as $$
declare r text;
begin

-- don't use this code!!!

execute 'select e''' || $1 || '''' into r;
return r;
end;
$$ language plpgsql immutable;

But one way how anybody can use it is SQL injection vulnerable and slow. So some simple buildin solution can be protection against some future security issues. Personally I am happy with just this limited function that will be safe (although the design based on introducing new encoding and conversions can be more complete and accurate). I agree so this case is marginal, but it is a fully valid use case, and supporting unicode escaped codes just by parser is a needless limit.

3. there are new disadvantages of extensions in current DBaaS times. Until the extension is not directly accepted by a cloud provider, then the extension is not available for users. The acceptance of extensions is not too agile - so moving this code to extension doesn't solve this problem. Without DBaaS the implementation of this feature as the extensions can be good enough.

Regards

Pavel

cheers

andrew

Re: proposal: unescape_text function

From

Peter Eisentraut

Date:

02 December 2020, 08:23:05

On 2020-11-30 22:15, Pavel Stehule wrote:
>     I would like some supporting documentation on this.  So far we only
>     have
>     one stackoverflow question, and then this implementation, and they are
>     not even the same format.  My worry is that if there is not precise
>     specification, then people are going to want to add things in the
>     future, and there will be no way to analyze such requests in a
>     principled way.
> 
> 
> I checked this and it is "prefix backslash-u hex" used by Java, 
> JavaScript  or RTF - 
> https://billposer.org/Software/ListOfRepresentations.html

Heh.  The fact that there is a table of two dozen possible 
representations kind of proves my point that we should be deliberate in 
picking one.

I do see Oracle unistr() on that list, which appears to be very similar 
to what you are trying to do here.  Maybe look into aligning with that.

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

02 December 2020, 10:37:01

st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:

On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html

Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.

unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.

2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.

What do you think about it?

Pavel

Re: proposal: unescape_text function

From

Andrew Dunstan

Date:

02 December 2020, 12:39:42

On 12/2/20 12:48 AM, Pavel Stehule wrote:
>
>
> st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan <andrew@dunslane.net
> <mailto:andrew@dunslane.net>> napsal:
>
>
>     On 11/30/20 8:14 AM, Peter Eisentraut wrote:
>     > On 2020-11-29 18:36, Pavel Stehule wrote:
>     >>
>     >>     I don't really get the point of this function.  There is
>     AFAICT no
>     >>     function to produce this escaped format, and it's not a
>     recognized
>     >>     interchange format.  So under what circumstances would one
>     need to
>     >>     use this?
>     >>
>     >>
>     >> Some corporate data can be in CSV format with escaped unicode
>     >> characters. Without this function it is not possible to decode
>     these
>     >> files without external application.
>     >
>     > I would like some supporting documentation on this.  So far we only
>     > have one stackoverflow question, and then this implementation, and
>     > they are not even the same format.  My worry is that if there is not
>     > precise specification, then people are going to want to add
>     things in
>     > the future, and there will be no way to analyze such requests in a
>     > principled way.
>     >
>     >
>     >
>
>
>     Also, should this be an extension? I'm dubious about including such
>     marginal uses in the core code unless there's a really good case
>     for it.
>
>
>
[...]
> 3. there are new disadvantages of extensions in current DBaaS times.
> Until the extension is not directly accepted by a cloud provider, then
> the extension is not available for users. The acceptance of extensions
> is not too agile - so moving this code to extension doesn't solve this
> problem. Without DBaaS the implementation of this feature as the
> extensions can be good enough.
>
>

That argument can apply to any extension someone wants to use. If your
DBaaS provider doesn't support some extension you need to lobby them or
find another that does support it, rather than try to put it in core
code. Some extensions, such as untrusted PLs,  will naturally almost
never be supported by DBaaS providers because they are inherently
unsafe. That's not the case here.


cheers


andrew

Re: proposal: unescape_text function

From

Chapman Flack

Date:

02 December 2020, 14:55:49

On 12/02/20 05:37, Pavel Stehule wrote:
> 2. there can be optional parameter "prefix" with default "\". But with "\u"
> it can be compatible with Java or Python.

Java's unicode escape form is one of those early ones that lack
a six-digit form, and where any character outside of the basic multilingual
plane has to be represented by two four-digit escapes in a row, encoding
the two surrogates that would make up the character's representation
in UTF-16.

Obviously that's an existing form that's out there, so it's not a bad
thing to have some kind of support for it, but it's not a great
representation to encourage people to use.

Python, by contrast, has both \uxxxx and \Uxxxxxxxx where you would use
the latter to represent a non-BMP character directly. So the Java and
Python schemes should be considered distinct.

In Perl, there is a useful extension to regexp substitution where
you specify the replacement not as a string or even a string with &
and \1 \2 ... magic, but as essentially a lambda that is passed the
match and returns a computed replacement. That makes conversions of
the sort discussed here generally trivial to implement. Would it be
worth considering to add something of general utility like that, and
then there could be a small library of pure SQL functions (or a wiki
page or GitHub gist) covering a bunch of the two dozen representations
on that page linked above?

Regards,
-Chap

Re: proposal: unescape_text function

From

Chapman Flack

Date:

02 December 2020, 16:32:04

On 12/02/20 09:55, Chapman Flack wrote:
> In Perl, there is a useful extension to regexp substitution where
> you specify the replacement not as a string or even a string with &
> and \1 \2 ... magic, but as essentially a lambda that is passed the
> match and returns a computed replacement. That makes conversions of
> the sort discussed here generally trivial to implement.

Python, I should have added, allows that also. Java too, since release 9.

Regards,
-Chap

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

02 December 2020, 18:30:39

st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html

Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.

unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.

2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice.

postgres=# select
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)

postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

New patch attached

Regards

Pavel

Pavel

Attachment

unistr.patch

Re: proposal: unescape_text function

From

Bruce Momjian

Date:

02 December 2020, 19:25:07

On Wed, Dec  2, 2020 at 07:30:39PM +0100, Pavel Stehule wrote:
> postgres=# select
>  'Arabic     : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' )      || '
>   Chinese    : ' || unistr( '\4E2D\6587' )                               || '
>   English    : ' || unistr( 'English' )                                  || '
>   French     : ' || unistr( 'Fran\00E7ais' )                             || '
>   German     : ' || unistr( 'Deutsch' )                                  || '
>   Greek      : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
>   Hebrew     : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' )                || '
>   Japanese   : ' || unistr( '\65E5\672C\8A9E' )                          || '
>   Korean     : ' || unistr( '\D55C\AD6D\C5B4' )                          || '
>   Portuguese : ' || unistr( 'Portugu\00EAs' )                            || '
>   Russian    : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' )      || '
>   Spanish    : ' || unistr( 'Espa\00F1ol' )                              || '
>   Thai       : ' || unistr( '\0E44\0E17\0E22' )
>   as unicode_test_string;
> ┌──────────────────────────┐
> │   unicode_test_string    │
> ╞══════════════════════════╡
> │ Arabic     : العربية    ↵│
> │   Chinese    : 中文     ↵│
> │   English    : English  ↵│
> │   French     : Français ↵│
> │   German     : Deutsch  ↵│
> │   Greek      : Ελληνικά ↵│
> │   Hebrew     : עברית    ↵│
> │   Japanese   : 日本語   ↵│
> │   Korean     : 한국어   ↵│
> │   Portuguese : Português↵│
> │   Russian    : Русский  ↵│
> │   Spanish    : Español  ↵│
> │   Thai       : ไทย       │
> └──────────────────────────┘

Offlist, this table output is super-cool!

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: proposal: unescape_text function

From

Peter Eisentraut

Date:

25 March 2021, 09:44:49

On 10.03.21 14:52, David Steele wrote:
>> I thought about it a little bit more, and  the prefix specification 
>> has not too much sense (more if we implement this functionality as 
>> function "unistr"). I removed the optional argument and renamed the 
>> function to "unistr". The functionality is the same. Now it supports 
>> Oracle convention, Java and Python (for Python UXXXXXXXX) and 
>> \+XXXXXX. These formats was already supported.The compatibility witth 
>> Oracle is nice.
> 
> Peter, it looks like Pavel has aligned this function with unistr() as 
> you suggested. Thoughts?

I haven't read through the patch in detail yet, but I support the 
proposed details of the functionality.

Re: proposal: unescape_text function

From

Peter Eisentraut

Date:

29 March 2021, 10:19:07

On 25.03.21 10:44, Peter Eisentraut wrote:
> 
> On 10.03.21 14:52, David Steele wrote:
>>> I thought about it a little bit more, and  the prefix specification 
>>> has not too much sense (more if we implement this functionality as 
>>> function "unistr"). I removed the optional argument and renamed the 
>>> function to "unistr". The functionality is the same. Now it supports 
>>> Oracle convention, Java and Python (for Python UXXXXXXXX) and 
>>> \+XXXXXX. These formats was already supported.The compatibility witth 
>>> Oracle is nice.
>>
>> Peter, it looks like Pavel has aligned this function with unistr() as 
>> you suggested. Thoughts?
> 
> I haven't read through the patch in detail yet, but I support the 
> proposed details of the functionality.

Committed.

I made two major changes:  I moved the tests from unicode.sql to 
strings.sql.  The first file is for tests that only work in UTF8 
encoding, which is not the case here.  Also, I wasn't comfortable with 
exposing little utility functions from the parser in an ad hoc way.  So 
I made local copies, which also allows us to make more 
locally-appropriate error messages.  I think there is some potential for 
refactoring here (see also src/common/hex.c), but that's perhaps better 
done separately and more comprehensively.

Re: proposal: unescape_text function

From

Pavel Stehule

Date:

29 March 2021, 10:22:28

po 29. 3. 2021 v 12:19 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:

On 25.03.21 10:44, Peter Eisentraut wrote:
>
> On 10.03.21 14:52, David Steele wrote:
>>> I thought about it a little bit more, and the prefix specification
>>> has not too much sense (more if we implement this functionality as
>>> function "unistr"). I removed the optional argument and renamed the
>>> function to "unistr". The functionality is the same. Now it supports
>>> Oracle convention, Java and Python (for Python UXXXXXXXX) and
>>> \+XXXXXX. These formats was already supported.The compatibility witth
>>> Oracle is nice.
>>
>> Peter, it looks like Pavel has aligned this function with unistr() as
>> you suggested. Thoughts?
>
> I haven't read through the patch in detail yet, but I support the
> proposed details of the functionality.

Committed.

I made two major changes: I moved the tests from unicode.sql to
strings.sql. The first file is for tests that only work in UTF8
encoding, which is not the case here. Also, I wasn't comfortable with
exposing little utility functions from the parser in an ad hoc way. So
I made local copies, which also allows us to make more
locally-appropriate error messages. I think there is some potential for
refactoring here (see also src/common/hex.c), but that's perhaps better
done separately and more comprehensively.

Thank you very much

Pavel