Home > mailing lists

Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? - Mailing list pgsql-general

From	Tom Lane
Subject	Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date	October 5, 2020 17:30:34
Msg-id	1720141.1601908234@sss.pgh.pa.us Whole thread Raw
In response to	回复: May "PostgreSQL server side GB18030 character set support" reconsidered? (Han Parker <parker.han@outlook.com>)
Responses	Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? (Tatsuo Ishii <ishii@sraoss.co.jp>) 回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? (Han Parker <parker.han@outlook.com>)
List	pgsql-general

Tree view

Han Parker <parker.han@outlook.com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii@sraoss.co.jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard.  The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of
GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs.  As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space.  It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

In short, I think this is very unlikely to happen.

            regards, tom lane

pgsql-general by date:

From: "James B. Byrne"
Date: 05 October 2020, 17:22:48
Subject: Re: UUID generation problem

From: Thorsten Schöning
Date: 05 October 2020, 17:34:11
Subject: Re: What's your experience with using Postgres in IoT-contexts?

Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? - Mailing list pgsql-general

Previous

Next