Home > mailing lists

Re: Questionable description about character sets - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	Re: Questionable description about character sets
Date	February 14 13:20:33
Msg-id	20260214.192033.705419152780150580.ishii@postgresql.org Whole thread Raw
In response to	Re: Questionable description about character sets (Andreas Karlsson <andreas@proxel.se>)
Responses	Re: Questionable description about character sets
List	pgsql-hackers

Tree view

> Wouldn't that make the table very wide?

I don't think it would make the table very wide but a little bit
wider. So I think adding the character sets information to
"Description" column is better. Some of encodings already have the
info. See attached patch.

> And for e.g. European
> character encodings I am not sure it is that useful since most or
> maybe even all of them are subsets of unicode, it mostly gets
> interesting for encodings which support characters not in unicode,
> right?

Choosing UTF8 or not is just one of the use cases.

I am thinking about the use case in which user wants to continue to
use other encodings (e.g. wants to avoid conversion to UTF8).
Example: suppose the user has a legacy system in which EUC_JP is
used. The data in the system includes JIS X 0201, JIS X 0208 and JIS X
0212, and he wants to make sure that PostgreSQL supports all those
character sets in EUC_JP, because some tools does not support JIS X
0212. Only JIS X 0212 and JIS X 0208 are supported. Currently the info
(whether JIS X 0212 is supported or not) does not exist anywhere in
our docs. It's only in the source code. I think it's better to have
the info in our docs so that user does not need to look into the
source code.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

From 98c97f670ce647003ce467a84f81cec0cb463c18 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii <ishii@postgresql.org>
Date: Sat, 14 Feb 2026 16:26:01 +0900
Subject: [PATCH v1] doc: Enhance "PostgreSQL Character Sets" table.

Previously some of encoding lacked description of coded character sets
being used in the encoding. For most of European encoding this is
obvious because there's only or few character sets for encoding, but
it's not true for some Asian encodings. For example, EUC_JP encoding
corresponds to multiple character sets: Namely, JIS X 0201, JIS X 0208
and JIS X 0212. This commit adds the information to "Description"
column.

Discussion: https://postgr.es/m/20260211.185847.1679085676298121526.ishii%40postgresql.org
---
 doc/src/sgml/charset.sgml | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 3aabc798012..32c6280489b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1831,7 +1831,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_CN</literal></entry>
-         <entry>Extended UNIX Code-CN</entry>
+         <entry>Extended UNIX Code-CN, GB 2312</entry>
          <entry>Simplified Chinese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1840,7 +1840,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_JP</literal></entry>
-         <entry>Extended UNIX Code-JP</entry>
+         <entry>Extended UNIX Code-JP, JIS X 0201, JIS X 0208, JIS X 0212</entry>
          <entry>Japanese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1849,7 +1849,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_JIS_2004</literal></entry>
-         <entry>Extended UNIX Code-JP, JIS X 0213</entry>
+         <entry>Extended UNIX Code-JP, JIS X 0201, JIS X 0213</entry>
          <entry>Japanese</entry>
          <entry>Yes</entry>
          <entry>No</entry>
@@ -1858,7 +1858,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_KR</literal></entry>
-         <entry>Extended UNIX Code-KR</entry>
+         <entry>Extended UNIX Code-KR, KS X 1001</entry>
          <entry>Korean</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1867,7 +1867,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_TW</literal></entry>
-         <entry>Extended UNIX Code-TW</entry>
+         <entry>Extended UNIX Code-TW, CNS 11643</entry>
          <entry>Traditional Chinese, Taiwanese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -2056,7 +2056,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>SJIS</literal></entry>
-         <entry>Shift JIS</entry>
+         <entry>Shift JIS, JIS X 0201, JIS X 0208</entry>
          <entry>Japanese</entry>
          <entry>No</entry>
          <entry>No</entry>
-- 
2.43.0

pgsql-hackers by date:

From: Shin Berg
Date: 14 February, 11:48:11
Subject: Inconsistency in owner assignment between INDEX and STATISTICS

From: Henson Choi
Date: 14 February, 17:58:10
Subject: Re: Row pattern recognition

Re: Questionable description about character sets - Mailing list pgsql-hackers

Previous

Next