Thread: can postgresql supported utf8mb4 character sets ?

can postgresql supported utf8mb4 character sets ?

From
"lsliang"
Date:
 
 
can  postgresql   supported   utf8mb4  character set?
 
 
 
 
today   mobile  apps  support   4-byte  character   and  utf8 can only support  1-3 bytes character 
 
 
if   load  string  to database    which  contain  a  4-byte character   will  failed  .
 
mysql   since  5.5.3 support  utf8mb4 character sets 
 
I don't  find  some information  about  postgresql
 
 thanks 
 

Re: can postgresql supported utf8mb4 character sets ?

From
Adrian Klaver
Date:
On 03/05/2015 01:45 AM, lsliang wrote:
> can  postgresql supported   utf8mb4  character set?
> today   mobile  apps support   4-byte  character   and  utf8 can only
> support  1-3 bytes character

The docs would seem to indicate otherwise:

http://www.postgresql.org/docs/9.3/interactive/multibyte.html

http://en.wikipedia.org/wiki/UTF-8


> if   load  string  to database which  contain  a  4-byte character
> will failed  .

Have you actually tried to load strings in to Postgres?

If so and it failed what was the method you used and what was the error?

> mysql   since  5.5.3 support utf8mb4 character sets
> I don't  find  some information about  postgresql
>   thanks


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Re: [GENERAL] can postgresql supported utf8mb4 character sets?

From
"lsliang"
Date:

 
 
 
 
2015-03-06
 

发件人:Adrian Klaver
发送时间:2015-03-05 21:31:39
收件人:lsliang; pgsql-general
抄送:
主题:Re: [GENERAL] can postgresql supported utf8mb4 character sets?
 
On 03/05/2015 01:45 AM, lsliang wrote:
> can  postgresql supported   utf8mb4  character set?
> today   mobile  apps support   4-byte  character   and  utf8 can only
> support  1-3 bytes character
The docs would seem to indicate otherwise:
http://www.postgresql.org/docs/9.3/interactive/multibyte.html
http://en.wikipedia.org/wiki/UTF-8
> if   load  string  to database which  contain  a  4-byte character
> will failed  .
Have you actually tried to load strings in to Postgres?
If so and it failed what was the method you used and what was the error?
> mysql   since  5.5.3 support utf8mb4 character sets
> I don't  find  some information about  postgresql
>   thanks
-- 
Adrian Klaver
 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
thanks   for  your help .
 
 postgresql   can support   4-byte  character  
 
test=> select * from utf8mb4_test ;
ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
test=> \encoding utf8 
test=> select * from utf8mb4_test ;
 content 
---------
 ðŸ˜„
 ðŸ˜„
 
pcauto=> 
 
 

Re: can postgresql supported utf8mb4 character sets?

From
Adrian Klaver
Date:
On 03/05/2015 06:55 PM, lsliang wrote:
> 2015-03-06
> *
> *
> ------------------------------------------------------------------------
> *发件人:*Adrian Klaver
> *发送时间:*2015-03-05 21:31:39
> *收件人:*lsliang; pgsql-general
> *抄送:*
> *主题:*Re: [GENERAL] can postgresql supported utf8mb4 character sets?
> On 03/05/2015 01:45 AM, lsliang wrote:
>  > can  postgresql supported   utf8mb4  character set?
>  > today   mobile  apps support   4-byte  character   and  utf8 can only
>  > support  1-3 bytes character
> The docs would seem to indicate otherwise:
> http://www.postgresql.org/docs/9.3/interactive/multibyte.html
> http://en.wikipedia.org/wiki/UTF-8
>  > if   load  string  to database which  contain  a  4-byte character
>  > will failed  .
> Have you actually tried to load strings in to Postgres?
> If so and it failed what was the method you used and what was the error?
>  > mysql   since  5.5.3 support utf8mb4 character sets
>  > I don't  find  some information about  postgresql
>  >   thanks
> --
> Adrian Klaver
> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> thanks   for  your help .
>   postgresql   can support   4-byte character
> test=> select * from utf8mb4_test ;
> ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
> test=> \encoding utf8
> test=> select * from utf8mb4_test ;
>   content
> ---------
>   😄
>   😄
> pcauto=>


FYI, you can force the client encoding:

http://www.postgresql.org/docs/9.4/interactive/app-psql.html

"If both standard input and standard output are a terminal, then psql
sets the client encoding to "auto", which will detect the appropriate
client encoding from the locale settings (LC_CTYPE environment variable
on Unix systems). If this doesn't work out as expected, the client
encoding can be overridden using the environment variable PGCLIENTENCODING."



--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Re: Re: [GENERAL] can postgresql supported utf8mb4 character sets?

From
Arjen Nienhuis
Date:
On Fri, Mar 6, 2015 at 3:55 AM, lsliang <lsliang@pconline.com.cn> wrote:
 
 
 
 
2015-03-06
 

发件人:Adrian Klaver
发送时间:2015-03-05 21:31:39
收件人:lsliang; pgsql-general
抄送:
主题:Re: [GENERAL] can postgresql supported utf8mb4 character sets?
 
On 03/05/2015 01:45 AM, lsliang wrote:
> can  postgresql supported   utf8mb4  character set?
> today   mobile  apps support   4-byte  character   and  utf8 can only
> support  1-3 bytes character
The docs would seem to indicate otherwise:
> if   load  string  to database which  contain  a  4-byte character
> will failed  .
Have you actually tried to load strings in to Postgres?
If so and it failed what was the method you used and what was the error?
> mysql   since  5.5.3 support utf8mb4 character sets
> I don't  find  some information about  postgresql
>   thanks
-- 
Adrian Klaver
 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
thanks   for  your help .
 
 postgresql   can support   4-byte  character  
 
test=> select * from utf8mb4_test ;
ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
test=> \encoding utf8 
test=> select * from utf8mb4_test ;
 content 
---------
 ðŸ˜„
 ðŸ˜„
 
pcauto=> 
 
 

UTF-8 support works fine. The 3 byte limit was something mysql invented. But it only works if your client encoding is UTF-8. In your example, your terminal is not set to UTF-8.

create table test (glyph text);
insert into test values ('A'), ('馬'), ('𐁀'), ('😄'), ('🇪🇸');

select glyph, convert_to(glyph, 'utf-8'), length(glyph) FROM test;
 glyph |     convert_to     | length
-------+--------------------+--------
 A     | \x41               |      1
 馬    | \xe9a6ac           |      1
 𐁀     | \xf0908180         |      1
 😄     | \xf09f9884         |      1
 🇪🇸    | \xf09f87aaf09f87b8 |      2
(5 rows)

What doesn't work is GB18030:

select glyph, convert_to(glyph, 'GB18030'), length(glyph) FROM test;
ERROR:  character with byte sequence 0xf0 0x90 0x81 0x80 in encoding "UTF8" has no equivalent in encoding "GB18030"


I think that is a bug.

Gr. Arjen


May "PostgreSQL server side GB18030 character set support" reconsidered?

From
Han Parker
Date:
Hi,

May "GB18030 server side support" deserve reconsidering, after about 15 years later than  release of GB18030-2005?
It may be the one of most green features for PostgreSQL.

1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese characters (UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon Neutral",  along with Polar ice melting.

2."Setting client side to UTF-8, just like setting server side to UTF-8" in the following mail is not practical for most Chinese IT projects, especially public funding projects. Because GB18030 compatible is a law in Mainland China.
Usually the client side encoding configuration with a GUI is more difficult to be hidden, and most MS Windows users are familiar with GB18030.
MySQL supports GB18030 in server side from V5.7 in 2015.  And I am not sure how much this feature contributed to MySQL's more popular in Mainland China.


Parker Han


From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Arjen Nienhuis <a.g.nienhuis@gmail.com>
Sent: Saturday, March 7, 2015 8:18
To: lsliang <lsliang@pconline.com.cn>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>; pgsql-general <pgsql-general@postgresql.org>
Subject: Re: Re: Re: [GENERAL] can postgresql supported utf8mb4 character sets?
 
On Fri, Mar 6, 2015 at 3:55 AM, lsliang <lsliang@pconline.com.cn> wrote:
 
 
 
 
2015-03-06
 

发件人:Adrian Klaver
发送时间:2015-03-05 21:31:39
收件人:lsliang; pgsql-general
抄送:
主题:Re: [GENERAL] can postgresql supported utf8mb4 character sets?
 
On 03/05/2015 01:45 AM, lsliang wrote:
> can  postgresql supported   utf8mb4  character set?
> today   mobile  apps support   4-byte  character   and  utf8 can only
> support  1-3 bytes character
The docs would seem to indicate otherwise:
> if   load  string  to database which  contain  a  4-byte character
> will failed  .
Have you actually tried to load strings in to Postgres?
If so and it failed what was the method you used and what was the error?
> mysql   since  5.5.3 support utf8mb4 character sets
> I don't  find  some information about  postgresql
>   thanks
-- 
Adrian Klaver
 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
thanks   for  your help .
 
 postgresql   can support   4-byte  character  
 
test=> select * from utf8mb4_test ;
ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "GB18030"
test=> \encoding utf8 
test=> select * from utf8mb4_test ;
 content 
---------
 ðŸ˜„
 ðŸ˜„
 
pcauto=> 
 
 

UTF-8 support works fine. The 3 byte limit was something mysql invented. But it only works if your client encoding is UTF-8. In your example, your terminal is not set to UTF-8.

create table test (glyph text);
insert into test values ('A'), ('馬'), ('𐁀'), ('😄'), ('🇪🇸');

select glyph, convert_to(glyph, 'utf-8'), length(glyph) FROM test;
 glyph |     convert_to     | length
-------+--------------------+--------
 A     | \x41               |      1
 馬    | \xe9a6ac           |      1
 𐁀     | \xf0908180         |      1
 😄     | \xf09f9884         |      1
 🇪🇸    | \xf09f87aaf09f87b8 |      2
(5 rows)

What doesn't work is GB18030:

select glyph, convert_to(glyph, 'GB18030'), length(glyph) FROM test;
ERROR:  character with byte sequence 0xf0 0x90 0x81 0x80 in encoding "UTF8" has no equivalent in encoding "GB18030"


I think that is a bug.

Gr. Arjen

Re: May "PostgreSQL server side GB18030 character set support" reconsidered?

From
Tatsuo Ishii
Date:
> Hi,
> 
> May "GB18030 server side support" deserve reconsidering, after about 15 years later than  release of GB18030-2005?
> It may be the one of most green features for PostgreSQL.

Moving GB18030 to server side encoding requires a technical challenge:
currently PostgreSQL's SQL parser and perhaps in other parts of
backend assume that each byte in a string data is not confused with
ASCII byte. Since GB18030's second and fourth byte are in range of
0x40 to 0x7e, backend will be confused. How do you resolve the
technical challenge exactly?

> 1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese
characters(UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon
Neutral", along with Polar ice melting.
 

Really? I thought GB18030 uses up to 4 bytes.
https://en.wikipedia.org/wiki/GB_18030#Encoding

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Thanks for your comments.
My reply inserted into the following section.


发件人: Tatsuo Ishii <ishii@sraoss.co.jp>
发送时间: 2020年10月5日 8:41
收件人: parker.han@outlook.com <parker.han@outlook.com>
抄送: pgsql-general@postgresql.org <pgsql-general@postgresql.org>
主题: Re: May "PostgreSQL server side GB18030 character set support" reconsidered?
 
> Hi,
>
> May "GB18030 server side support" deserve reconsidering, after about 15 years later than  release of GB18030-2005?
> It may be the one of most green features for PostgreSQL.

Moving GB18030 to server side encoding requires a technical challenge:
currently PostgreSQL's SQL parser and perhaps in other parts of
backend assume that each byte in a string data is not confused with
ASCII byte. Since GB18030's second and fourth byte are in range of
0x40 to 0x7e, backend will be confused. How do you resolve the
technical challenge exactly?

--Parker:
I do not have an exact solution proposal yet.
Maybe an investigation on MySQL's mechanism would be of help.


> 1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese characters (UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon Neutral",  along with Polar ice melting.

Really? I thought GB18030 uses up to 4 bytes.
https://en.wikipedia.org/wiki/GB_18030#Encoding

--Parker:
More preciously description should be GB18030 use 2 or 4 bytes for Chinese characters.
It's a bit complicated to explain with only words but easy with help of the following graph. 

Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of GB18030.

Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.



Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
Attachment
>> 1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese
characters(UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon
Neutral", along with Polar ice melting.
 
> 
> Really? I thought GB18030 uses up to 4 bytes.
> https://en.wikipedia.org/wiki/GB_18030#Encoding
> 
> --Parker:
> More preciously description should be GB18030 use 2 or 4 bytes for Chinese characters.
> It's a bit complicated to explain with only words but easy with help of the following graph.
> 
> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of
GB18030.

It does not sound fair argument unless you are going to implement only
GBK compatible part of GB18030.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Han Parker <parker.han@outlook.com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii@sraoss.co.jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard.  The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of
GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs.  As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space.  It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

In short, I think this is very unlikely to happen.

            regards, tom lane



> TBH, even if you came up with a complete patch, we'd probably
> reject it as unmaintainable and a security hazard.  The problem
> is that code may scan a string looking for certain ASCII characters
> such as backslash (\), which up to now it's always been able to do
> byte-by-byte without fear that non-ASCII characters could confuse it.
> To support GB18030 (or other encodings with the same issue, such as
> SJIS), every such loop would have to be modified to advance character
> by character, thus roughly "p += pg_mblen(p)" instead of "p++".
> Anyplace that neglected to do that would have a bug --- one that
> could only be exposed by careful testing using GB18030 encoding.
> What's more, such bugs could easily be security problems.
> Mis-detecting a backslash, for example, could lead to wrong decisions
> about where string literals end, allowing SQL-injection exploits.

One of ideas to avoid the concern could be "shifting" GB18030 code
points into "ASCII safe" code range with some calculations so that
backend can handle them without worrying about the concern above. This
way, we could avoid a table lookup overhead which is necessary in
conversion between GB18030 and UTF8 and so on.

However I don't come up with such a mathematical conversion method for
now.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> One of ideas to avoid the concern could be "shifting" GB18030 code
> points into "ASCII safe" code range with some calculations so that
> backend can handle them without worrying about the concern above. This
> way, we could avoid a table lookup overhead which is necessary in
> conversion between GB18030 and UTF8 and so on.

Hmm ... interesting idea, basically invent our own modified version
of GB18030 (or SJIS?) for backend-internal storage.  But I'm not
sure how to make it work without enlarging the string, which'd defeat
the OP's argument.  It looks to me like the second-byte code space is
already pretty full in both encodings.

            regards, tom lane



Thanks for your reply.

Regards,

Parker Han


发件人: Tom Lane <tgl@sss.pgh.pa.us>
发送时间: 2020年10月5日 14:30
收件人: Han Parker <parker.han@outlook.com>
抄送: Tatsuo Ishii <ishii@sraoss.co.jp>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
主题: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
 
Han Parker <parker.han@outlook.com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii@sraoss.co.jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard.  The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

--From Parker:
Agree that it may not a cheap feature.
1.  pg_mblen() etc. would involve long term costs as well as short term refactoring cost. 
2. '0x5c' (backslash) appears in the low bytes of a multiply-bytes-character, increases exposure of SQL-injection risk. 
--

> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs.  As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space.  It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

--From Parker:
It depends on use scenarios, so leaving this decision making to application developers with a GUC may be an option.
OLTP may pay more processing cost than storage saving, while OLAP may pay less processing cost than storage saving.
---

In short, I think this is very unlikely to happen.


                        regards, tom lane
> Hmm ... interesting idea, basically invent our own modified version
> of GB18030 (or SJIS?) for backend-internal storage.  But I'm not
> sure how to make it work without enlarging the string, which'd defeat
> the OP's argument.  It looks to me like the second-byte code space is
> already pretty full in both encodings.

But as he already admitted, actually GB18030 is 4 byte encoding, rather
than 2 bytes. So maybe we could find a way to map original GB18030 to
ASCII-safe GB18030 using 4 bytes.

As for SJIS, no big demand for the encoding in Japan these days. So I
think we can leave it as it is.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



> But as he already admitted, actually GB18030 is 4 byte encoding, rather
> than 2 bytes. So maybe we could find a way to map original GB18030 to
> ASCII-safe GB18030 using 4 bytes.

Here is an idea (in-byte represents GB18030, out-byte represents
internal server encoding):

if (in-byte1 is 0x00-80)    /* ASCII */
   out-byte1 = in-byte1

else if (in-byte1 is 0x81-0xfe && in-byte2 is 0x40-0x7f)    /* 2 bytes GB18030 */
   out-byte1 = in-byte1
   out-byte2 = 0x80
   out-byte3 = in-byte2 + 0x80 (should be 0xc0-0xc9)
   out-byte4 = 0x80

else if (in-byte1 is 0x81-0xfe && in-byte2 is 0x80-0xfe)    /* 2 bytes GB18030 */
   out-byte1 = in-byte1
   out-byte2 = 0x80
   out-byte3 = 0x80
   out-byte4 = in-byte2 (should be 0x80-0xfe)

else if (in-byte1 is 0x81-0xfe && in-byte2 is 0x30-0x39)    /* 4 bytes GB18030 */
   out-byte1 = in-byte1
   out-byte2 = in-byte2 + 0x80 (should be 0xb0-0xb9)
   out-byte3 = in-byte3
   out-byte4 = in-byte4 + 0x80 (should be 0xb0-0xb9)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp






发件人: Tatsuo Ishii <ishii@sraoss.co.jp>
发送时间: 2020年10月6日 2:15
收件人: tgl@sss.pgh.pa.us <tgl@sss.pgh.pa.us>
抄送: parker.han@outlook.com <parker.han@outlook.com>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
主题: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
 
> Hmm ... interesting idea, basically invent our own modified version
> of GB18030 (or SJIS?) for backend-internal storage.  But I'm not
> sure how to make it work without enlarging the string, which'd defeat
> the OP's argument.  It looks to me like the second-byte code space is
> already pretty full in both encodings.

>But as he already admitted, actually GB18030 is 4 byte encoding, rather
>than 2 bytes. So maybe we could find a way to map original GB18030 to
>ASCII-safe GB18030 using 4 bytes.>

>As for SJIS, no big demand for the encoding in Japan these days. So I
>think we can leave it as it is.>

>Best regards,
>--
>Tatsuo Ishii
>SRA OSS, Inc. Japan
>English: http://www.sraoss.co.jp/index_en.php
>Japanese:http://www.sraoss.co.jp

So the key lies in a ASCII-safe GB18030 simple mapping algorithm (Maybe named with abbreviation "GB18030as" of GB18030_ascii_safe?), which not break "ASCII-safe" while save lots of storage (The ANSI-safe GB2312 contains most frequently used 6763 characters).
In fact, it was GBK designed by Microsoft broke "ASCII-safe" in about 1995 with the popular of Win95. Later GB18030 inherited it because it had to compatible with GBK.

Thanks.
I will try to find whether any opinions regarding "a ASCII-safe GB18030 simple mapping algorithm" exist in GB18030 standard maintainers community.