Thread: Deprecate custom encoding conversions

Deprecate custom encoding conversions

From

Heikki Linnakangas

Date:

02 December 2020, 16:04:11

Hi,

PostgreSQL allows writing custom encoding conversion functions between 
any character encodings, using the CREATE CONVERSION command. It's 
pretty flexible, you can define default and non-default conversions, and 
the conversions live in schemas so you can have multiple conversions 
installed in a system and you can switch between them by changing 
search_path.

However:

We never use non-default conversions for anything. All code that 
performs encoding conversions only cares about the default ones.

I think this flexibility is kind of ridiculous anyway. If a built-in 
conversion routine doesn't handle some characters correctly, surely we 
should fix the built-in function, rather than provide a mechanism for 
having your own conversion functions. If you truly need to perform the 
conversions differently than the built-in routines do, you can always 
perform the conversion in the client instead.

Note that we don't support adding completely new custom encodings, all 
this is just for conversions between the built-in encodings that we have.

I propose that we add a notice to the CREATE CONVERSION docs to say that 
it is deprecated, and remove it in a few years.

Any objections? Anyone using custom encoding conversions in production?

- Heikki

Re: Deprecate custom encoding conversions

From

Tom Lane

Date:

02 December 2020, 16:18:18

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> I propose that we add a notice to the CREATE CONVERSION docs to say that 
> it is deprecated, and remove it in a few years.

While I agree that it's probably not that useful, what would we gain
by removing it?  If you intend to remove those catalogs, what lookup
mechanism would replace them?  We can't exactly drop the encoding
conversion functionality.

            regards, tom lane

Re: Deprecate custom encoding conversions

From

Heikki Linnakangas

Date:

02 December 2020, 16:31:53

On 02/12/2020 18:18, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> I propose that we add a notice to the CREATE CONVERSION docs to say that
>> it is deprecated, and remove it in a few years.
> 
> While I agree that it's probably not that useful, what would we gain
> by removing it?  If you intend to remove those catalogs, what lookup
> mechanism would replace them?  We can't exactly drop the encoding
> conversion functionality.

I'm thinking of replacing the catalog with a hard-coded 2D array of 
function pointers. Or something along those lines.

I had this idea while looking at the encoding conversions performed in 
COPY. The current conversion functions return a null-terminated, 
palloc'd string, which is a bit awkward for the callers. The caller 
needs to call strlen() on the result, and it would be nice to reuse the 
same buffer for all the conversions. And I've got a hunch that it'd be 
faster to convert the whole 64 kB raw input buffer in one go, rather 
than convert each line separately, but the current function signature 
doesn't make that easy either.

So I'm looking for refactoring the conversion routines to be more 
convenient for the callers. But the current function signature is 
exposed at the SQL level, thanks to CREATE CONVERSION.

- Heikki

Re: Deprecate custom encoding conversions

From

Tom Lane

Date:

02 December 2020, 18:02:53

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 02/12/2020 18:18, Tom Lane wrote:
>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>> I propose that we add a notice to the CREATE CONVERSION docs to say that
>>> it is deprecated, and remove it in a few years.

>> While I agree that it's probably not that useful, what would we gain
>> by removing it?  If you intend to remove those catalogs, what lookup
>> mechanism would replace them?  We can't exactly drop the encoding
>> conversion functionality.

> I'm thinking of replacing the catalog with a hard-coded 2D array of 
> function pointers. Or something along those lines.

I like the current structure in which the encoding functions are in
separate .so's, so that you don't load the ones you don't need.
It's not real clear how to preserve that if we hard-code things.

> So I'm looking for refactoring the conversion routines to be more 
> convenient for the callers. But the current function signature is 
> exposed at the SQL level, thanks to CREATE CONVERSION.

I'd be the first to agree that the current API for conversion functions is
not terribly well-designed.  But what if we just change it?  That can't be
any worse of a compatibility hit than removing CREATE CONVERSION
altogether.

            regards, tom lane

RE: Deprecate custom encoding conversions

From

"tsunakawa.takay@fujitsu.com"

Date:

03 December 2020, 00:54:56

From: Heikki Linnakangas <hlinnaka@iki.fi>
> I propose that we add a notice to the CREATE CONVERSION docs to say that
> it is deprecated, and remove it in a few years.
> 
> Any objections? Anyone using custom encoding conversions in production?

I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles that
useCREATE CONVERSION to handle external characters.  So, I think we may as well retain it.  (OTOH, I'm wondering how
otherDBMSs support external characters and what Postgres should implement it.)
 

Also, the SQL standard has CREATE CHARACTER SET and CREATE TRANSLATION.  I don't know how these are useful, but the
mechanismof CREATE CONVERSION can be used to support them.
 


CREATE CHARACTER SET <character set name> [ AS ]

<character set source> [ <collate clause> ]


CREATE TRANSLATION <transliteration name> FOR <source character set specification>

TO <target character set specification> FROM <transliteration source>


Regards
Takayuki Tsunakawa

Re: Deprecate custom encoding conversions

From

Tom Lane

Date:

03 December 2020, 01:47:35

"tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> writes:
> From: Heikki Linnakangas <hlinnaka@iki.fi>
>> Any objections? Anyone using custom encoding conversions in production?

> I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles
thatuse CREATE CONVERSION to handle external characters.  So, I think we may as well retain it.  (OTOH, I'm wondering
howother DBMSs support external characters and what Postgres should implement it.) 

I recall a discussion in which someone explained that things are messy for
Japanese because there's not really just one version of SJIS; there are
several, because various groups extended the original standard in not-
quite-compatible ways.  In turn this means that there's not really one
well-defined conversion to/from Unicode --- which is the reason why
allowing custom conversions seemed like a good idea.  I don't know
whether anyone over there is actually using custom conversions, but
I'd be hesitant to just nuke the feature altogether.

Having said that, it doesn't seem like the conversion API is necessarily
any more set in stone than any of our other C-level APIs.  We break things
at the C API level whenever there's sufficient reason.

            regards, tom lane

Re: Deprecate custom encoding conversions

From

Michael Paquier

Date:

03 December 2020, 01:50:40

On Thu, Dec 03, 2020 at 12:54:56AM +0000, tsunakawa.takay@fujitsu.com wrote:
> I can't answer deeper questions because I'm not familiar with
> character sets, but I saw some Japanese web articles that use CREATE
> CONVERSION to handle external characters.  So, I think we may as
> well retain it.  (OTOH, I'm wondering how other DBMSs support
> external characters and what Postgres should implement it.)

Tsunakawa-san, could you post a link to this article, if possible?  I
am curious about their problem and why they used CREATE CONVERSION as
a way to solve it.  That's fine even if it is in Japanese.
--
Michael

Attachment

signature.asc

Re: Deprecate custom encoding conversions

From

Tatsuo Ishii

Date:

03 December 2020, 02:27:28

> I recall a discussion in which someone explained that things are messy for
> Japanese because there's not really just one version of SJIS; there are
> several, because various groups extended the original standard in not-
> quite-compatible ways.  In turn this means that there's not really one
> well-defined conversion to/from Unicode

That's true.

> --- which is the reason why
> allowing custom conversions seemed like a good idea.  I don't know
> whether anyone over there is actually using custom conversions, but
> I'd be hesitant to just nuke the feature altogether.

By Googling I found an instance which is using CREATE CONVERSION
(Japanese article).

http://grep.blog49.fc2.com/blog-entry-87.html
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

RE: Deprecate custom encoding conversions

From

"tsunakawa.takay@fujitsu.com"

Date:

03 December 2020, 02:48:13

From: Michael Paquier <michael@paquier.xyz>
> Tsunakawa-san, could you post a link to this article, if possible?  I am curious
> about their problem and why they used CREATE CONVERSION as a way to
> solve it.  That's fine even if it is in Japanese.

I just pulled info from my old memory in my previous mail.  Now the following links seem to be relevant.  (They should
bedifferent from what I saw in the past.) 

https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html

https://teratail.com/questions/295988

IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION.



Regards
Takayuki Tsunakawa

Re: Deprecate custom encoding conversions

From

Fujii Masao

Date:

03 December 2020, 03:57:35


On 2020/12/03 11:48, tsunakawa.takay@fujitsu.com wrote:
> From: Michael Paquier <michael@paquier.xyz>
>> Tsunakawa-san, could you post a link to this article, if possible?  I am curious
>> about their problem and why they used CREATE CONVERSION as a way to
>> solve it.  That's fine even if it is in Japanese.
> 
> I just pulled info from my old memory in my previous mail.  Now the following links seem to be relevant.  (They
shouldbe different from what I saw in the past.)
 
> 
> https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html
> 
> https://teratail.com/questions/295988
> 
> IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION.

FWIW, about four years before, for the real project, I wrote
the extension [1] that provides yet another encoding conversion
from UTF-8 to EUC_JP, to handle some external characters.
This extension uses CREATE CONVERSION.

[1]
https://github.com/MasaoFujii/pg_fallback_utf8_to_euc_jp

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Deprecate custom encoding conversions

From

Fujii Masao

Date:

03 December 2020, 03:58:45


On 2020/12/03 1:04, Heikki Linnakangas wrote:
> Hi,
> 
> PostgreSQL allows writing custom encoding conversion functions between any character encodings, using the CREATE
CONVERSIONcommand. It's pretty flexible, you can define default and non-default conversions, and the conversions live
inschemas so you can have multiple conversions installed in a system and you can switch between them by changing
search_path.
> 
> However:
> 
> We never use non-default conversions for anything. All code that performs encoding conversions only cares about the
defaultones.
 

Yes. I had to update pg_conversion.condefault directly so that we can
use custom encoding when I registered it. The direct update of
pg_conversion is of course not good thing. So I was wondering
if we should have something like ALTER CONVERSION SET DEFAULT.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Deprecate custom encoding conversions

From

Tom Lane

Date:

03 December 2020, 04:07:22

Fujii Masao <masao.fujii@oss.nttdata.com> writes:
> On 2020/12/03 1:04, Heikki Linnakangas wrote:
>> We never use non-default conversions for anything. All code that performs encoding conversions only cares about the
defaultones. 

> Yes. I had to update pg_conversion.condefault directly so that we can
> use custom encoding when I registered it. The direct update of
> pg_conversion is of course not good thing. So I was wondering
> if we should have something like ALTER CONVERSION SET DEFAULT.

Perhaps, but what I'd think is more urgent is having some way for
a session to select a non-default conversion for client I/O.

            regards, tom lane

Re: Deprecate custom encoding conversions

From

Fujii Masao

Date:

03 December 2020, 04:18:51


On 2020/12/03 13:07, Tom Lane wrote:
> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>> On 2020/12/03 1:04, Heikki Linnakangas wrote:
>>> We never use non-default conversions for anything. All code that performs encoding conversions only cares about the
defaultones.
 
> 
>> Yes. I had to update pg_conversion.condefault directly so that we can
>> use custom encoding when I registered it. The direct update of
>> pg_conversion is of course not good thing. So I was wondering
>> if we should have something like ALTER CONVERSION SET DEFAULT.
> 
> Perhaps, but what I'd think is more urgent is having some way for
> a session to select a non-default conversion for client I/O.

+1

What about adding new libpq parameter like encoding_conversion
(like client_encoding) so that a client can specify what conversion to use?
That setting is sent to the server while establishing the connection.
Probably we would need to verify that the setting of this new parameter
is consistent with that of client_encoding.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION