Thank you for sharing this very interesting and creative approach. Encoding is indeed a crucial factor in capacity planning and performance benchmarking — I find this direction quite compelling.
I'm currently working on a few other things, so my responses may not always be quick, but I wanted to let you know I'm genuinely interested in following this work.
As it happens, I'm currently collaborating with Ishii-san — who, as you know, is one of the original architects of multibyte/CJK support in PostgreSQL — on Row Pattern Recognition; that might also be a thread worth keeping an eye on.
It also strikes me that this is a topic worth considering in the context of the rapid growth of SNS and AI-generated data. The pervasive use of emoji — which cannot be represented in legacy encodings like EUC-KR at all — is in fact accelerating the migration toward Unicode in Korea and other Asian markets. This makes the storage efficiency of Unicode for CJK characters an increasingly practical concern, not just a theoretical one.
I'd like to take some time to analyze the current situation around character encoding in Korea — where both EUC-KR legacy systems and UTF-8 coexist in complex ways — review the patches you've attached, and then share some thoughts and feedback.