Thread: Deprecate custom encoding conversions
Hi, PostgreSQL allows writing custom encoding conversion functions between any character encodings, using the CREATE CONVERSION command. It's pretty flexible, you can define default and non-default conversions, and the conversions live in schemas so you can have multiple conversions installed in a system and you can switch between them by changing search_path. However: We never use non-default conversions for anything. All code that performs encoding conversions only cares about the default ones. I think this flexibility is kind of ridiculous anyway. If a built-in conversion routine doesn't handle some characters correctly, surely we should fix the built-in function, rather than provide a mechanism for having your own conversion functions. If you truly need to perform the conversions differently than the built-in routines do, you can always perform the conversion in the client instead. Note that we don't support adding completely new custom encodings, all this is just for conversions between the built-in encodings that we have. I propose that we add a notice to the CREATE CONVERSION docs to say that it is deprecated, and remove it in a few years. Any objections? Anyone using custom encoding conversions in production? - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > I propose that we add a notice to the CREATE CONVERSION docs to say that > it is deprecated, and remove it in a few years. While I agree that it's probably not that useful, what would we gain by removing it? If you intend to remove those catalogs, what lookup mechanism would replace them? We can't exactly drop the encoding conversion functionality. regards, tom lane
On 02/12/2020 18:18, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> I propose that we add a notice to the CREATE CONVERSION docs to say that >> it is deprecated, and remove it in a few years. > > While I agree that it's probably not that useful, what would we gain > by removing it? If you intend to remove those catalogs, what lookup > mechanism would replace them? We can't exactly drop the encoding > conversion functionality. I'm thinking of replacing the catalog with a hard-coded 2D array of function pointers. Or something along those lines. I had this idea while looking at the encoding conversions performed in COPY. The current conversion functions return a null-terminated, palloc'd string, which is a bit awkward for the callers. The caller needs to call strlen() on the result, and it would be nice to reuse the same buffer for all the conversions. And I've got a hunch that it'd be faster to convert the whole 64 kB raw input buffer in one go, rather than convert each line separately, but the current function signature doesn't make that easy either. So I'm looking for refactoring the conversion routines to be more convenient for the callers. But the current function signature is exposed at the SQL level, thanks to CREATE CONVERSION. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 02/12/2020 18:18, Tom Lane wrote: >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> I propose that we add a notice to the CREATE CONVERSION docs to say that >>> it is deprecated, and remove it in a few years. >> While I agree that it's probably not that useful, what would we gain >> by removing it? If you intend to remove those catalogs, what lookup >> mechanism would replace them? We can't exactly drop the encoding >> conversion functionality. > I'm thinking of replacing the catalog with a hard-coded 2D array of > function pointers. Or something along those lines. I like the current structure in which the encoding functions are in separate .so's, so that you don't load the ones you don't need. It's not real clear how to preserve that if we hard-code things. > So I'm looking for refactoring the conversion routines to be more > convenient for the callers. But the current function signature is > exposed at the SQL level, thanks to CREATE CONVERSION. I'd be the first to agree that the current API for conversion functions is not terribly well-designed. But what if we just change it? That can't be any worse of a compatibility hit than removing CREATE CONVERSION altogether. regards, tom lane
From: Heikki Linnakangas <hlinnaka@iki.fi> > I propose that we add a notice to the CREATE CONVERSION docs to say that > it is deprecated, and remove it in a few years. > > Any objections? Anyone using custom encoding conversions in production? I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles that useCREATE CONVERSION to handle external characters. So, I think we may as well retain it. (OTOH, I'm wondering how otherDBMSs support external characters and what Postgres should implement it.) Also, the SQL standard has CREATE CHARACTER SET and CREATE TRANSLATION. I don't know how these are useful, but the mechanismof CREATE CONVERSION can be used to support them. CREATE CHARACTER SET <character set name> [ AS ] <character set source> [ <collate clause> ] CREATE TRANSLATION <transliteration name> FOR <source character set specification> TO <target character set specification> FROM <transliteration source> Regards Takayuki Tsunakawa
"tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> writes: > From: Heikki Linnakangas <hlinnaka@iki.fi> >> Any objections? Anyone using custom encoding conversions in production? > I can't answer deeper questions because I'm not familiar with character sets, but I saw some Japanese web articles thatuse CREATE CONVERSION to handle external characters. So, I think we may as well retain it. (OTOH, I'm wondering howother DBMSs support external characters and what Postgres should implement it.) I recall a discussion in which someone explained that things are messy for Japanese because there's not really just one version of SJIS; there are several, because various groups extended the original standard in not- quite-compatible ways. In turn this means that there's not really one well-defined conversion to/from Unicode --- which is the reason why allowing custom conversions seemed like a good idea. I don't know whether anyone over there is actually using custom conversions, but I'd be hesitant to just nuke the feature altogether. Having said that, it doesn't seem like the conversion API is necessarily any more set in stone than any of our other C-level APIs. We break things at the C API level whenever there's sufficient reason. regards, tom lane
On Thu, Dec 03, 2020 at 12:54:56AM +0000, tsunakawa.takay@fujitsu.com wrote: > I can't answer deeper questions because I'm not familiar with > character sets, but I saw some Japanese web articles that use CREATE > CONVERSION to handle external characters. So, I think we may as > well retain it. (OTOH, I'm wondering how other DBMSs support > external characters and what Postgres should implement it.) Tsunakawa-san, could you post a link to this article, if possible? I am curious about their problem and why they used CREATE CONVERSION as a way to solve it. That's fine even if it is in Japanese. -- Michael
Attachment
> I recall a discussion in which someone explained that things are messy for > Japanese because there's not really just one version of SJIS; there are > several, because various groups extended the original standard in not- > quite-compatible ways. In turn this means that there's not really one > well-defined conversion to/from Unicode That's true. > --- which is the reason why > allowing custom conversions seemed like a good idea. I don't know > whether anyone over there is actually using custom conversions, but > I'd be hesitant to just nuke the feature altogether. By Googling I found an instance which is using CREATE CONVERSION (Japanese article). http://grep.blog49.fc2.com/blog-entry-87.html -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
From: Michael Paquier <michael@paquier.xyz> > Tsunakawa-san, could you post a link to this article, if possible? I am curious > about their problem and why they used CREATE CONVERSION as a way to > solve it. That's fine even if it is in Japanese. I just pulled info from my old memory in my previous mail. Now the following links seem to be relevant. (They should bedifferent from what I saw in the past.) https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html https://teratail.com/questions/295988 IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION. Regards Takayuki Tsunakawa
On 2020/12/03 11:48, tsunakawa.takay@fujitsu.com wrote: > From: Michael Paquier <michael@paquier.xyz> >> Tsunakawa-san, could you post a link to this article, if possible? I am curious >> about their problem and why they used CREATE CONVERSION as a way to >> solve it. That's fine even if it is in Japanese. > > I just pulled info from my old memory in my previous mail. Now the following links seem to be relevant. (They shouldbe different from what I saw in the past.) > > https://ml.postgresql.jp/pipermail/pgsql-jp/2007-January/013103.html > > https://teratail.com/questions/295988 > > IIRC, the open source Postgres extension EUDC also uses CREATE CONVERSION. FWIW, about four years before, for the real project, I wrote the extension [1] that provides yet another encoding conversion from UTF-8 to EUC_JP, to handle some external characters. This extension uses CREATE CONVERSION. [1] https://github.com/MasaoFujii/pg_fallback_utf8_to_euc_jp Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/12/03 1:04, Heikki Linnakangas wrote: > Hi, > > PostgreSQL allows writing custom encoding conversion functions between any character encodings, using the CREATE CONVERSIONcommand. It's pretty flexible, you can define default and non-default conversions, and the conversions live inschemas so you can have multiple conversions installed in a system and you can switch between them by changing search_path. > > However: > > We never use non-default conversions for anything. All code that performs encoding conversions only cares about the defaultones. Yes. I had to update pg_conversion.condefault directly so that we can use custom encoding when I registered it. The direct update of pg_conversion is of course not good thing. So I was wondering if we should have something like ALTER CONVERSION SET DEFAULT. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Fujii Masao <masao.fujii@oss.nttdata.com> writes: > On 2020/12/03 1:04, Heikki Linnakangas wrote: >> We never use non-default conversions for anything. All code that performs encoding conversions only cares about the defaultones. > Yes. I had to update pg_conversion.condefault directly so that we can > use custom encoding when I registered it. The direct update of > pg_conversion is of course not good thing. So I was wondering > if we should have something like ALTER CONVERSION SET DEFAULT. Perhaps, but what I'd think is more urgent is having some way for a session to select a non-default conversion for client I/O. regards, tom lane
On 2020/12/03 13:07, Tom Lane wrote: > Fujii Masao <masao.fujii@oss.nttdata.com> writes: >> On 2020/12/03 1:04, Heikki Linnakangas wrote: >>> We never use non-default conversions for anything. All code that performs encoding conversions only cares about the defaultones. > >> Yes. I had to update pg_conversion.condefault directly so that we can >> use custom encoding when I registered it. The direct update of >> pg_conversion is of course not good thing. So I was wondering >> if we should have something like ALTER CONVERSION SET DEFAULT. > > Perhaps, but what I'd think is more urgent is having some way for > a session to select a non-default conversion for client I/O. +1 What about adding new libpq parameter like encoding_conversion (like client_encoding) so that a client can specify what conversion to use? That setting is sent to the server while establishing the connection. Probably we would need to verify that the setting of this new parameter is consistent with that of client_encoding. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION