Thread: proposal: unescape_text function
Hi
There is one user request for unescape function in core.
This request is about possibility that we do with string literal via functional interface instead string literals only
I wrote plpgsql function, but built in function can be simpler:
CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS textLANGUAGE plpgsqlAS $function$DECLARE result text;BEGIN EXECUTE format('SELECT U&%s UESCAPE %s', quote_literal(replace($1, '\u','^')), quote_literal($2)) INTO result; RETURN result;END;$function$
postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^'); unescape
-----------------Odpovědná osoba
(1 row)
What do you think about this?
Regards
Pavel
po 22. 6. 2020 v 5:48 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
HiThere is one user request for unescape function in core.This request is about possibility that we do with string literal via functional interface instead string literals onlyI wrote plpgsql function, but built in function can be simpler:CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS textLANGUAGE plpgsqlAS $function$DECLARE result text;BEGIN EXECUTE format('SELECT U&%s UESCAPE %s', quote_literal(replace($1, '\u','^')), quote_literal($2)) INTO result; RETURN result;END;$function$
postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^'); unescape -----------------Odpovědná osoba (1 row)
What do you think about this?
I changed the name to more accurately "unicode_unescape". Patch is assigned
Regards
Pavel
RegardsPavel
Attachment
> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I changed the name to more accurately "unicode_unescape". Patch is assigned You've made this function return Oid, where it used to be void. Was that a copy-paste mistake? Else the code needs fixing as it doesn't return an Oid. +Oid +check_unicode_value(pg_wchar c) +{ + if (!is_valid_unicode_codepoint(c)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode escape value"))); +} cheers ./daniel
čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I changed the name to more accurately "unicode_unescape". Patch is assigned
You've made this function return Oid, where it used to be void. Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.
+Oid
+check_unicode_value(pg_wchar c)
+{
+ if (!is_valid_unicode_codepoint(c))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}
yes, it is my error
I am sending fixed patch
Thank you for check
Pavel
cheers ./daniel
Attachment
On Thu, Jul 2, 2020 at 10:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:> On 23 Jun 2020, at 11:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I changed the name to more accurately "unicode_unescape". Patch is assigned
You've made this function return Oid, where it used to be void. Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.
+Oid
+check_unicode_value(pg_wchar c)
+{
+ if (!is_valid_unicode_codepoint(c))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}yes, it is my errorI am sending fixed patchThank you for checkPavelcheers ./daniel
Hi Pavel,
select unescape('Odpov\u011Bdn\u00E1 osoba');
Shouldn't the built-in function support the above syntax as well?
--
Asif Rehman
Hi
Hi Pavel,Since the idea originated from unescaping unicode string literals i.e.
select unescape('Odpov\u011Bdn\u00E1 osoba');
Shouldn't the built-in function support the above syntax as well?
good idea. The prefixes u (4 digits) and U (8 digits) are supported
Regards
Pavel
--Asif Rehman
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hi, The patch looks good to me. The new status of this patch is: Ready for Committer
st 29. 7. 2020 v 8:18 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
HiHi Pavel,Since the idea originated from unescaping unicode string literals i.e.
select unescape('Odpov\u011Bdn\u00E1 osoba');
Shouldn't the built-in function support the above syntax as well?good idea. The prefixes u (4 digits) and U (8 digits) are supportedRegards
rebase
Regards
Pavel
Attachment
On 2020-10-07 11:00, Pavel Stehule wrote: > Since the idea originated from unescaping unicode string > literals i.e. > select unescape('Odpov\u011Bdn\u00E1 osoba'); > > Shouldn't the built-in function support the above syntax as well? > > > good idea. The prefixes u (4 digits) and U (8 digits) are supported I don't really get the point of this function. There is AFAICT no function to produce this escaped format, and it's not a recognized interchange format. So under what circumstances would one need to use this?
pá 27. 11. 2020 v 15:37 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-10-07 11:00, Pavel Stehule wrote:
> Since the idea originated from unescaping unicode string
> literals i.e.
> select unescape('Odpov\u011Bdn\u00E1 osoba');
>
> Shouldn't the built-in function support the above syntax as well?
>
>
> good idea. The prefixes u (4 digits) and U (8 digits) are supported
I don't really get the point of this function. There is AFAICT no
function to produce this escaped format, and it's not a recognized
interchange format. So under what circumstances would one need to use this?
Some corporate data can be in CSV format with escaped unicode characters. Without this function it is not possible to decode these files without external application.
Postgres has support for this conversion, but only for string literals.
CREATE OR REPLACE FUNCTION public.unescape(text, text) RETURNS text LANGUAGE plpgsql AS $function$ DECLARE result text; BEGIN EXECUTE format('SELECT U&%s UESCAPE %s', quote_literal(replace($1, '\u','^')), quote_literal($2)) INTO result; RETURN result; END; $function$
Because unicode is major encoding, I think this conversion should be supported. There is another question about implementation like in this patch implemented unicode_unescape function, or with some new conversion. Using conversion https://www.postgresql.org/docs/current/sql-createconversion.html is probably better, but I am not sure how intuitive it is, and it is hard to use it (without not nice workarounds) in plpgsql.
I don't expect so Postgres should produce data in unicode escaped format, but can be useful, if Postgres can do some work with data in special format of major encoding.
postgres=# select convert_from(E'Odpov\u011Bdn\u00E1 osoba', 'UTF8');
┌─────────────────┐
│ convert_from │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
┌─────────────────┐
│ convert_from │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
I can do this with bytea, but it is hard to use it with text fields.
I didn't find any way how to do it without ugly steps.
Regards
Pavel
On 2020-11-29 18:36, Pavel Stehule wrote: > > I don't really get the point of this function. There is AFAICT no > function to produce this escaped format, and it's not a recognized > interchange format. So under what circumstances would one need to > use this? > > > Some corporate data can be in CSV format with escaped unicode > characters. Without this function it is not possible to decode these > files without external application. I would like some supporting documentation on this. So far we only have one stackoverflow question, and then this implementation, and they are not even the same format. My worry is that if there is not precise specification, then people are going to want to add things in the future, and there will be no way to analyze such requests in a principled way.
po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.
I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.
I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html
In some languages (Python), there is decoder "unicode-escape
"
. Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.
Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.
Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?
Regards
Pavel
po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.
I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.htmlIn some languages (Python), there is decoder "unicode-escape"
. Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?
a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case.
set client_encoding to unicode-escape;
copy tab from xxx;
...
but it doesn't help when only a few columns from the table are in unicode-escape format.
RegardsPavel
po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.
I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.htmlIn some languages (Python), there is decoder "unicode-escape"
. Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case.set client_encoding to unicode-escape;copy tab from xxx;...but it doesn't help when only a few columns from the table are in unicode-escape format.
probably the most complete solution can be from two steps:
1. introducing new encoding - "ascii_unicode_escape" with related conversions
2. introducing two new functions - text_escape and text_unescape with two parameters - source text and conversion name
select text_convert_to('Тимати', 'ascii_unicode_escape')
\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text
select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438', 'ascii_unicode_escape')
┌──────────┐
│ ?column? │
╞══════════╡
│ Тимати │
└──────────┘
(1 row)
│ ?column? │
╞══════════╡
│ Тимати │
└──────────┘
(1 row)
RegardsPavel
>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> >> napsal: >>> I checked this and it is "prefix backslash-u hex" used by Java, >>> JavaScript or RTF - >>> https://billposer.org/Software/ListOfRepresentations.html If I look on that page, it appears that RTF is using a similar-looking escape but in decimal rather than hex. It would be important to define what is done with non-BMP characters? Will there be another escape for a six- or eight-hexdigit format for the codepoint, or will it be represented by two four-hexdigit escapes for consecutive UTF-16 surrogates? Regards, -Chap
út 1. 12. 2020 v 20:20 odesílatel Chapman Flack <chap@anastigmatix.net> napsal:
>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
>> napsal:
>>> I checked this and it is "prefix backslash-u hex" used by Java,
>>> JavaScript or RTF -
>>> https://billposer.org/Software/ListOfRepresentations.html
If I look on that page, it appears that RTF is using a similar-looking
escape but in decimal rather than hex.
It would be important to define what is done with non-BMP characters?
Will there be another escape for a six- or eight-hexdigit format for
the codepoint, or will it be represented by two four-hexdigit escapes
for consecutive UTF-16 surrogates?
the detection of decimal or hexadecimal codes can be a hard problem - string "12" is valid in both systems, but the numbers are different. So there should be external specification as an argument.
Regards
Pavel
Regards,
-Chap
On 11/30/20 8:14 AM, Peter Eisentraut wrote: > On 2020-11-29 18:36, Pavel Stehule wrote: >> >> I don't really get the point of this function. There is AFAICT no >> function to produce this escaped format, and it's not a recognized >> interchange format. So under what circumstances would one need to >> use this? >> >> >> Some corporate data can be in CSV format with escaped unicode >> characters. Without this function it is not possible to decode these >> files without external application. > > I would like some supporting documentation on this. So far we only > have one stackoverflow question, and then this implementation, and > they are not even the same format. My worry is that if there is not > precise specification, then people are going to want to add things in > the future, and there will be no way to analyze such requests in a > principled way. > > > Also, should this be an extension? I'm dubious about including such marginal uses in the core code unless there's a really good case for it. cheers andrew
st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:
On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> On 2020-11-29 18:36, Pavel Stehule wrote:
>>
>> I don't really get the point of this function. There is AFAICT no
>> function to produce this escaped format, and it's not a recognized
>> interchange format. So under what circumstances would one need to
>> use this?
>>
>>
>> Some corporate data can be in CSV format with escaped unicode
>> characters. Without this function it is not possible to decode these
>> files without external application.
>
> I would like some supporting documentation on this. So far we only
> have one stackoverflow question, and then this implementation, and
> they are not even the same format. My worry is that if there is not
> precise specification, then people are going to want to add things in
> the future, and there will be no way to analyze such requests in a
> principled way.
>
>
>
Also, should this be an extension? I'm dubious about including such
marginal uses in the core code unless there's a really good case for it.
I am not sure, and I am inclined so it should be core functionality.
1. Although this use case is marginal, this is related to most used encodings - ascii and unicode. 8 bit encodings enhanced about escaped multibyte chars will be used for a very long time. Unfortunately - this will be worse, because Postgres will be used more in the corporate environment, where there is a bigger press to conserve very legacy technologies without correct multibyte support. The core problem so this issue is out of concept bytea -> text or text -> bytea transformations supported by Postgres. This is text -> text transformation (for almost all encoding based on ascii), that is not supported by Postgres now.
2. Postgres already has this functionality - but unfortunately there is a limit just only literal constants.
create or replace function uunescape(text)
returns text as $$
declare r text;
begin
returns text as $$
declare r text;
begin
-- don't use this code!!!
execute 'select e''' || $1 || '''' into r;
return r;
end;
$$ language plpgsql immutable;
return r;
end;
$$ language plpgsql immutable;
But one way how anybody can use it is SQL injection vulnerable and slow. So some simple buildin solution can be protection against some future security issues. Personally I am happy with just this limited function that will be safe (although the design based on introducing new encoding and conversions can be more complete and accurate). I agree so this case is marginal, but it is a fully valid use case, and supporting unicode escaped codes just by parser is a needless limit.
3. there are new disadvantages of extensions in current DBaaS times. Until the extension is not directly accepted by a cloud provider, then the extension is not available for users. The acceptance of extensions is not too agile - so moving this code to extension doesn't solve this problem. Without DBaaS the implementation of this feature as the extensions can be good enough.
Regards
Pavel
cheers
andrew
On 2020-11-30 22:15, Pavel Stehule wrote: > I would like some supporting documentation on this. So far we only > have > one stackoverflow question, and then this implementation, and they are > not even the same format. My worry is that if there is not precise > specification, then people are going to want to add things in the > future, and there will be no way to analyze such requests in a > principled way. > > > I checked this and it is "prefix backslash-u hex" used by Java, > JavaScript or RTF - > https://billposer.org/Software/ListOfRepresentations.html Heh. The fact that there is a table of two dozen possible representations kind of proves my point that we should be deliberate in picking one. I do see Oracle unistr() on that list, which appears to be very similar to what you are trying to do here. Maybe look into aligning with that.
st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html
Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.
I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.
unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".
What do you think about the following proposal?
1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.
2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.
What do you think about it?
Pavel
On 12/2/20 12:48 AM, Pavel Stehule wrote: > > > st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan <andrew@dunslane.net > <mailto:andrew@dunslane.net>> napsal: > > > On 11/30/20 8:14 AM, Peter Eisentraut wrote: > > On 2020-11-29 18:36, Pavel Stehule wrote: > >> > >> I don't really get the point of this function. There is > AFAICT no > >> function to produce this escaped format, and it's not a > recognized > >> interchange format. So under what circumstances would one > need to > >> use this? > >> > >> > >> Some corporate data can be in CSV format with escaped unicode > >> characters. Without this function it is not possible to decode > these > >> files without external application. > > > > I would like some supporting documentation on this. So far we only > > have one stackoverflow question, and then this implementation, and > > they are not even the same format. My worry is that if there is not > > precise specification, then people are going to want to add > things in > > the future, and there will be no way to analyze such requests in a > > principled way. > > > > > > > > > Also, should this be an extension? I'm dubious about including such > marginal uses in the core code unless there's a really good case > for it. > > > [...] > 3. there are new disadvantages of extensions in current DBaaS times. > Until the extension is not directly accepted by a cloud provider, then > the extension is not available for users. The acceptance of extensions > is not too agile - so moving this code to extension doesn't solve this > problem. Without DBaaS the implementation of this feature as the > extensions can be good enough. > > That argument can apply to any extension someone wants to use. If your DBaaS provider doesn't support some extension you need to lobby them or find another that does support it, rather than try to put it in core code. Some extensions, such as untrusted PLs, will naturally almost never be supported by DBaaS providers because they are inherently unsafe. That's not the case here. cheers andrew
On 12/02/20 05:37, Pavel Stehule wrote: > 2. there can be optional parameter "prefix" with default "\". But with "\u" > it can be compatible with Java or Python. Java's unicode escape form is one of those early ones that lack a six-digit form, and where any character outside of the basic multilingual plane has to be represented by two four-digit escapes in a row, encoding the two surrogates that would make up the character's representation in UTF-16. Obviously that's an existing form that's out there, so it's not a bad thing to have some kind of support for it, but it's not a great representation to encourage people to use. Python, by contrast, has both \uxxxx and \Uxxxxxxxx where you would use the latter to represent a non-BMP character directly. So the Java and Python schemes should be considered distinct. In Perl, there is a useful extension to regexp substitution where you specify the replacement not as a string or even a string with & and \1 \2 ... magic, but as essentially a lambda that is passed the match and returns a computed replacement. That makes conversions of the sort discussed here generally trivial to implement. Would it be worth considering to add something of general utility like that, and then there could be a small library of pure SQL functions (or a wiki page or GitHub gist) covering a bunch of the two dozen representations on that page linked above? Regards, -Chap
On 12/02/20 09:55, Chapman Flack wrote: > In Perl, there is a useful extension to regexp substitution where > you specify the replacement not as a string or even a string with & > and \1 \2 ... magic, but as essentially a lambda that is passed the > match and returns a computed replacement. That makes conversions of > the sort discussed here generally trivial to implement. Python, I should have added, allows that also. Java too, since release 9. Regards, -Chap
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html
Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.
I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".What do you think about the following proposal?1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.What do you think about it?
I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice.
postgres=# select
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)
postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
New patch attached
Regards
Pavel
Pavel
Attachment
On Wed, Dec 2, 2020 at 07:30:39PM +0100, Pavel Stehule wrote: > postgres=# select > 'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || ' > Chinese : ' || unistr( '\4E2D\6587' ) || ' > English : ' || unistr( 'English' ) || ' > French : ' || unistr( 'Fran\00E7ais' ) || ' > German : ' || unistr( 'Deutsch' ) || ' > Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || ' > Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || ' > Japanese : ' || unistr( '\65E5\672C\8A9E' ) || ' > Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || ' > Portuguese : ' || unistr( 'Portugu\00EAs' ) || ' > Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || ' > Spanish : ' || unistr( 'Espa\00F1ol' ) || ' > Thai : ' || unistr( '\0E44\0E17\0E22' ) > as unicode_test_string; > ┌──────────────────────────┐ > │ unicode_test_string │ > ╞══════════════════════════╡ > │ Arabic : العربية ↵│ > │ Chinese : 中文 ↵│ > │ English : English ↵│ > │ French : Français ↵│ > │ German : Deutsch ↵│ > │ Greek : Ελληνικά ↵│ > │ Hebrew : עברית ↵│ > │ Japanese : 日本語 ↵│ > │ Korean : 한국어 ↵│ > │ Portuguese : Português↵│ > │ Russian : Русский ↵│ > │ Spanish : Español ↵│ > │ Thai : ไทย │ > └──────────────────────────┘ Offlist, this table output is super-cool! -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On 10.03.21 14:52, David Steele wrote: >> I thought about it a little bit more, and the prefix specification >> has not too much sense (more if we implement this functionality as >> function "unistr"). I removed the optional argument and renamed the >> function to "unistr". The functionality is the same. Now it supports >> Oracle convention, Java and Python (for Python UXXXXXXXX) and >> \+XXXXXX. These formats was already supported.The compatibility witth >> Oracle is nice. > > Peter, it looks like Pavel has aligned this function with unistr() as > you suggested. Thoughts? I haven't read through the patch in detail yet, but I support the proposed details of the functionality.
On 25.03.21 10:44, Peter Eisentraut wrote: > > On 10.03.21 14:52, David Steele wrote: >>> I thought about it a little bit more, and the prefix specification >>> has not too much sense (more if we implement this functionality as >>> function "unistr"). I removed the optional argument and renamed the >>> function to "unistr". The functionality is the same. Now it supports >>> Oracle convention, Java and Python (for Python UXXXXXXXX) and >>> \+XXXXXX. These formats was already supported.The compatibility witth >>> Oracle is nice. >> >> Peter, it looks like Pavel has aligned this function with unistr() as >> you suggested. Thoughts? > > I haven't read through the patch in detail yet, but I support the > proposed details of the functionality. Committed. I made two major changes: I moved the tests from unicode.sql to strings.sql. The first file is for tests that only work in UTF8 encoding, which is not the case here. Also, I wasn't comfortable with exposing little utility functions from the parser in an ad hoc way. So I made local copies, which also allows us to make more locally-appropriate error messages. I think there is some potential for refactoring here (see also src/common/hex.c), but that's perhaps better done separately and more comprehensively.
po 29. 3. 2021 v 12:19 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 25.03.21 10:44, Peter Eisentraut wrote:
>
> On 10.03.21 14:52, David Steele wrote:
>>> I thought about it a little bit more, and the prefix specification
>>> has not too much sense (more if we implement this functionality as
>>> function "unistr"). I removed the optional argument and renamed the
>>> function to "unistr". The functionality is the same. Now it supports
>>> Oracle convention, Java and Python (for Python UXXXXXXXX) and
>>> \+XXXXXX. These formats was already supported.The compatibility witth
>>> Oracle is nice.
>>
>> Peter, it looks like Pavel has aligned this function with unistr() as
>> you suggested. Thoughts?
>
> I haven't read through the patch in detail yet, but I support the
> proposed details of the functionality.
Committed.
I made two major changes: I moved the tests from unicode.sql to
strings.sql. The first file is for tests that only work in UTF8
encoding, which is not the case here. Also, I wasn't comfortable with
exposing little utility functions from the parser in an ad hoc way. So
I made local copies, which also allows us to make more
locally-appropriate error messages. I think there is some potential for
refactoring here (see also src/common/hex.c), but that's perhaps better
done separately and more comprehensively.
Thank you very much
Pavel