回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? - Mailing list pgsql-general

From Han Parker
Subject 回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date
Msg-id ME2PR01MB253270F46FFF2668F396BEEB8A0C0@ME2PR01MB2532.ausprd01.prod.outlook.com
Whole thread Raw
In response to Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Thanks for your reply.

Regards,

Parker Han


发件人: Tom Lane <tgl@sss.pgh.pa.us>
发送时间: 2020年10月5日 14:30
收件人: Han Parker <parker.han@outlook.com>
抄送: Tatsuo Ishii <ishii@sraoss.co.jp>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
主题: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
 
Han Parker <parker.han@outlook.com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii@sraoss.co.jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard.  The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

--From Parker:
Agree that it may not a cheap feature.
1.  pg_mblen() etc. would involve long term costs as well as short term refactoring cost. 
2. '0x5c' (backslash) appears in the low bytes of a multiply-bytes-character, increases exposure of SQL-injection risk. 
--

> Most frequently used 20902 Chinese characters  and 984 symbols in GBK is encoded with 2 bytes, which is a subset of GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs.  As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space.  It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

--From Parker:
It depends on use scenarios, so leaving this decision making to application developers with a GUC may be an option.
OLTP may pay more processing cost than storage saving, while OLAP may pay less processing cost than storage saving.
---

In short, I think this is very unlikely to happen.


                        regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Next
From: Tatsuo Ishii
Date:
Subject: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?