Home > mailing lists

Re: proposal: unescape_text function - Mailing list pgsql-hackers

From	Pavel Stehule
Subject	Re: proposal: unescape_text function
Date	December 2, 2020 18:30:39
Msg-id	CAFj8pRC1UufDW45WOFz5rH6uiOTBaWU-sQ5BLkEyeAiV9M6VLA@mail.gmail.com Whole thread Raw
In response to	Re: proposal: unescape_text function (Pavel Stehule <pavel.stehule@gmail.com>)
Responses	Re: proposal: unescape_text function
List	pgsql-hackers

Tree view

st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-30 22:15, Pavel Stehule wrote:
> I would like some supporting documentation on this. So far we only
> have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript or RTF -
> https://billposer.org/Software/ListOfRepresentations.html

Heh. The fact that there is a table of two dozen possible
representations kind of proves my point that we should be deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very similar
to what you are trying to do here. Maybe look into aligning with that.

unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes.

2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice.

postgres=# select
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || '
Chinese : ' || unistr( '\4E2D\6587' ) || '
English : ' || unistr( 'English' ) || '
French : ' || unistr( 'Fran\00E7ais' ) || '
German : ' || unistr( 'Deutsch' ) || '
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || '
Japanese : ' || unistr( '\65E5\672C\8A9E' ) || '
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || '
Portuguese : ' || unistr( 'Portugu\00EAs' ) || '
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || '
Spanish : ' || unistr( 'Espa\00F1ol' ) || '
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)

postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

New patch attached

Regards

Pavel

Pavel

Attachment

unistr.patch

pgsql-hackers by date:

From: Tom Lane
Date: 02 December 2020, 18:02:53
Subject: Re: Deprecate custom encoding conversions

From: Dmitry Dolgov
Date: 02 December 2020, 19:18:08
Subject: Re: [HACKERS] [PATCH] Generic type subscripting

Re: proposal: unescape_text function - Mailing list pgsql-hackers

Attachment

Previous

Next