Home > mailing lists

Re: proposal: unescape_text function - Mailing list pgsql-hackers

From	Pavel Stehule
Subject	Re: proposal: unescape_text function
Date	December 1, 2020 08:43:27
Msg-id	CAFj8pRAOoDk+fR57tj7PitYd=v2eobTz4+vb4-YZ5hPMmxNr1A@mail.gmail.com Whole thread Raw
In response to	Re: proposal: unescape_text function (Pavel Stehule <pavel.stehule@gmail.com>)
Responses	Re: proposal: unescape_text function
List	pgsql-hackers

Tree view

po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <peter.eisentraut@enterprisedb.com> napsal:
On 2020-11-29 18:36, Pavel Stehule wrote:
>
> I don't really get the point of this function. There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format. So under what circumstances would one need to
> use this?
>
>
> Some corporate data can be in CSV format with escaped unicode
> characters. Without this function it is not possible to decode these
> files without external application.

I would like some supporting documentation on this. So far we only have
one stackoverflow question, and then this implementation, and they are
not even the same format. My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.

I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too.

Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it?

a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case.

set client_encoding to unicode-escape;
copy tab from xxx;
...

but it doesn't help when only a few columns from the table are in unicode-escape format.

probably the most complete solution can be from two steps:

1. introducing new encoding - "ascii_unicode_escape" with related conversions

2. introducing two new functions - text_escape and text_unescape with two parameters - source text and conversion name

select text_convert_to('Тимати', 'ascii_unicode_escape')

\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text

select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438', 'ascii_unicode_escape')

┌──────────┐
│ ?column? │
╞══════════╡
│ Тимати │
└──────────┘
(1 row)

Regards

Pavel

pgsql-hackers by date:

From: "tsunakawa.takay@fujitsu.com"
Date: 01 December 2020, 08:39:59
Subject: RE: [POC] Fast COPY FROM command for the table with foreign partitions

From: "Hou, Zhijie"
Date: 01 December 2020, 08:45:35
Subject: RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

Re: proposal: unescape_text function - Mailing list pgsql-hackers

Previous

Next