Re: proposal: unescape_text function - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: proposal: unescape_text function
Date
Msg-id CAFj8pRC1UufDW45WOFz5rH6uiOTBaWU-sQ5BLkEyeAiV9M6VLA@mail.gmail.com
Whole thread Raw
In response to Re: proposal: unescape_text function  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: proposal: unescape_text function
List pgsql-hackers


st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:


st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-30 22:15, Pavel Stehule wrote:
>     I would like some supporting documentation on this.  So far we only
>     have
>     one stackoverflow question, and then this implementation, and they are
>     not even the same format.  My worry is that if there is not precise
>     specification, then people are going to want to add things in the
>     future, and there will be no way to analyze such requests in a
>     principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript  or RTF -
> https://billposer.org/Software/ListOfRepresentations.html

Heh.  The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here.  Maybe look into aligning with that.

unistr is a primitive form of proposed function.  But it can be used as a base. The format is compatible with our  "4.1.2.3. String Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.

2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and  the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice.

postgres=# select
 'Arabic     : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' )      || '
  Chinese    : ' || unistr( '\4E2D\6587' )                               || '
  English    : ' || unistr( 'English' )                                  || '
  French     : ' || unistr( 'Fran\00E7ais' )                             || '
  German     : ' || unistr( 'Deutsch' )                                  || '
  Greek      : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
  Hebrew     : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' )                || '
  Japanese   : ' || unistr( '\65E5\672C\8A9E' )                          || '
  Korean     : ' || unistr( '\D55C\AD6D\C5B4' )                          || '
  Portuguese : ' || unistr( 'Portugu\00EAs' )                            || '
  Russian    : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' )      || '
  Spanish    : ' || unistr( 'Espa\00F1ol' )                              || '
  Thai       : ' || unistr( '\0E44\0E17\0E22' )
  as unicode_test_string;
┌──────────────────────────┐
│   unicode_test_string    │
╞══════════════════════════╡
│ Arabic     : العربية    ↵│
│   Chinese    : 中文     ↵│
│   English    : English  ↵│
│   French     : Français ↵│
│   German     : Deutsch  ↵│
│   Greek      : Ελληνικά ↵│
│   Hebrew     : עברית    ↵│
│   Japanese   : 日本語   ↵│
│   Korean     : 한국어   ↵│
│   Portuguese : Português↵│
│   Russian    : Русский  ↵│
│   Spanish    : Español  ↵│
│   Thai       : ไทย       │
└──────────────────────────┘
(1 row)


postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│     unistr      │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

New patch attached

Regards

Pavel






Pavel
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Deprecate custom encoding conversions
Next
From: Dmitry Dolgov
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting