Re: proposal: unescape_text function - Mailing list pgsql-hackers
From | Pavel Stehule |
---|---|
Subject | Re: proposal: unescape_text function |
Date | |
Msg-id | CAFj8pRC1UufDW45WOFz5rH6uiOTBaWU-sQ5BLkEyeAiV9M6VLA@mail.gmail.com Whole thread Raw |
In response to | Re: proposal: unescape_text function (Pavel Stehule <pavel.stehule@gmail.com>) |
Responses |
Re: proposal: unescape_text function
|
List | pgsql-hackers |
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html
Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.
I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".What do you think about the following proposal?1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.What do you think about it?
I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice.
postgres=# select
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)
postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)
New patch attached
Regards
Pavel
Pavel
Attachment
pgsql-hackers by date: