Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? - Mailing list pgsql-general

From Tom Lane
Subject Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date
Msg-id 1720141.1601908234@sss.pgh.pa.us
Whole thread Raw
In response to 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?  (Han Parker <parker.han@outlook.com>)
Responses Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?  (Tatsuo Ishii <ishii@sraoss.co.jp>)
回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?  (Han Parker <parker.han@outlook.com>)
List pgsql-general
Han Parker <parker.han@outlook.com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii@sraoss.co.jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard.  The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of
GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs.  As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space.  It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

In short, I think this is very unlikely to happen.

            regards, tom lane



pgsql-general by date:

Previous
From: "James B. Byrne"
Date:
Subject: Re: UUID generation problem
Next
From: Thorsten Schöning
Date:
Subject: Re: What's your experience with using Postgres in IoT-contexts?