Re: GB18030-2022 Support in PostgreSQL - Mailing list pgsql-hackers
From | Chao Li |
---|---|
Subject | Re: GB18030-2022 Support in PostgreSQL |
Date | |
Msg-id | 0D2511DA-D935-4D17-ACA2-B3027C7F1F3F@gmail.com Whole thread Raw |
In response to | Re: GB18030-2022 Support in PostgreSQL (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: GB18030-2022 Support in PostgreSQL
Re: GB18030-2022 Support in PostgreSQL |
List | pgsql-hackers |
2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
So on the whole I'd lean a bit towards just redefining GB18030 as
meaning the new standard. The fact that we don't support it as a
server-side encoding perhaps makes that idea more tenable than it
would be if the encoding governed the interpretation of our own
stored data.
regards, tom lane
I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.
As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used. So I would guess most of the existing databases won’t be impacted and the rest with encoding GB18030 need to do data migration before upgrading to a PG version that switches to GB18030-2022. I think PG may delegate data migration tasks to third party PG service vendors. They may develop simple or complex migration tools to help different use cases.
One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1. If the database worked with a pre-73.1 version of ICU, and now if ICU will be upgraded to a post-73.1 version, the database may face the same backward compatibility risk. That is because, say a gb code (0xA6D9) maps to U+E78D with GB18030 and changes to map to U+FE10 with GB18030-2022. If a char of 0xA6D9 was given to the database, it would be stored as U+E78D on disk. After upgrading ICU to post-73.1, U+E78D would no longer be considered as “0xA6D9” by ICU. So to keep the data’s original meaning, a data migration has to been done to update U+E78D to U+FE10. In this example, PG version is not changed, but the database still needs a data migration.
The other reason I don’t think a new encoding GB18030_2022 is needed is that, as GB18030_2022 is a hard requirement from the government, most likely all commercial database must comply with. Thus a lot of current databases with GB18030 must be migrated to GB18030_2022. As PG doesn’t support to change a database’s encoding, if a new encoding is added, then an existing db must be migrated to a new db. If only redefine GB18030, then existing databases only need some data migrations, which should be easier.
So, I think PG doesn’t need to worries about the backward compatibility problem too much, all PG needs to do is to state/emphasize clearly in the release note that a data migration might be required. At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.
Regards,
Chao Li (Evan)
------------------------------
HighGo Infra. Software Inc.
https://www.highgo.com/
https://www.highgo.com/
pgsql-hackers by date: