Re: Experimenting with wider Unicode storage - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Experimenting with wider Unicode storage
Date
Msg-id CAAAe_zCktovow1irTy0eD1Lmu2UMQi+DN9uGTFoWrcyXea7SMg@mail.gmail.com
Whole thread
In response to Re: Experimenting with wider Unicode storage  (Henson Choi <assam258@gmail.com>)
List pgsql-hackers
Hi Thomas,


Thank you again for sharing this exploration, and for including
Korean in your experiment table.  Rather than comment on the
patch itself, let me offer a ground-level report on where Korean
encoding reality sits in April 2026, because the picture has
shifted enough that I think it is worth entering into the record
before this thread accumulates momentum on motivations that may
no longer fully hold on this side of the region.


UTF-8 has already won in Korea, largely by inertia rather than
active choice.  Public web statistics put .kr sites at roughly
96% UTF-8 with a small EUC-KR residual of about 4% [1] —
noticeably higher than the ~1% Shift-JIS residual on .jp [2],
but steadily shrinking.  The mechanism is mundane: modern Linux
distributions default to UTF-8 locales, PostgreSQL's initdb
inherits that, and every new cluster is therefore UTF-8 from
birth.  The remaining legacy installations are not "haven't
migrated yet" — they are "have decided not to migrate," which is
a different and much slower population.


A clarification that often trips people up: in Korean practice,
"EUC-KR" is the label written down and CP949 is what actually
moves on the wire.  Microsoft's UHC has been the Windows default
for decades, and the MIME label has simply stuck.  The historical
stack goes KS X 1001 (완성형, 2,350 syllables) → EUC-KR → CP949
(11,172 syllables) → UTF-8.  PostgreSQL's strict EUC_KR decoder
rejects the bytes CP949 adds, which occasionally causes real
incidents when Windows-exchanged files are loaded.  For any
design choice about "Korean legacy support", this matters — what
needs supporting is usually CP949, not EUC-KR proper.


Server encoding and client encoding are also routinely split.  A
common Korean deployment pattern is a PostgreSQL cluster with
UTF-8 as server encoding, while legacy Windows / Delphi / C++ /
older Java clients connect with client_encoding set to EUC-KR or
CP949 and let PostgreSQL transcode at the wire boundary.  Many
systems that look like "EUC-KR systems" from the outside are
actually UTF-8 storage with an EUC-KR wire.  The storage-layer
share of legacy is therefore probably smaller still than the
3.8% web figure would suggest.


On the Korean row of your table landing at -16% under UTF-16:
that is structural, not noise.  Modern Korean writing mandates
word-space separation (unlike Chinese and Japanese), has
effectively abandoned hanja since the 1990s, and freely
interleaves ASCII acronyms (IT, AI, CEO).  As a result Korean
carries the highest ASCII share among CJK languages, and UTF-16
pays for each ASCII position (one byte → two) in exactly the
range where the Hangul savings are meant to come from.  Columns
without spaces — names, titles, addresses — could approach -33%,
but general prose cannot.  Those same short columns are, however,
exactly where the compression angle I return to further below
captures the equivalent saving without a new data type.


Storage pressure, to the extent modern operators feel it at all,
has largely migrated to other layers.  Memory and disk have both
followed exponential price/volume curves, and the CPU cost of
text comparison has disappeared inside other costs — network,
storage I/O, planning, JIT — to the point of invisibility in
profiler output.  For OLTP, the 2-vs-3-byte difference on Korean
columns does not feel meaningful on modern hardware.  For bulk
scans where byte counts still do matter, the industry answer has
already been columnar + zstd, which routinely reaches 90%+
compression on natural-language text and flattens the
CJK-vs-Latin ratio to irrelevance.  Embedded and edge are not
PostgreSQL's primary target, and archival sits in zstd territory
too.  The domains that historically motivated "we must narrow
CJK storage" have either moved outside the PostgreSQL shape or
been absorbed by general-purpose compression.


Meanwhile the cultural arrow points toward more Unicode, not
less.  KakaoTalk (which saturates domestic messaging), Naver
comments, Instagram captions, and YouTube normalise emoji in
everyday prose, while AI-generated Korean text contributes
middle dots, em dashes, and curly quotes at a scale that was
not present a few years ago.  The share of non-EUC-KR content
in everyday Korean prose is, informally, rising steadily.  Each
emoji is four UTF-8 bytes and is unrepresentable in any legacy
encoding at all.
A partial-coverage alternative looks increasingly awkward against
that trend.


Korean upstream feedback on encoding has also been notably quiet
despite a very active de-Oracle migration wave in the late 2010s.
I suspect this silence is not apathy but absence of a felt
problem — most of the community has simply moved on.


I should be careful here.  The "Korean side needs narrower CJK
storage" argument was genuinely strong around 2010, and I
remember when it motivated serious engineering time.  It is much
weaker in 2026: UTF-8 has won by default, legacy survivors are
confined to wire protocols and specific applications, OLTP does
not feel the byte cost, and bulk scan is already handled
elsewhere.  I raise this not to dismiss the technical work — the
patch shows real craft and the exploration is interesting on its
own terms.  But if the cover-letter motivation rests partly on
"this will help East Asian users, including Korea," I wanted you
to have a ground-level report: for Korean users specifically, the
pressure may no longer be strong enough to justify the complexity
described.  The calculus may well differ in Japanese or Chinese
markets — that is not for me to say.


One broader question, then, that I wanted to put to you: there
are three distinct axes on which utf16 could be pursued — as a
server character set, as a data type, or as a compression angle.
The character-set direction runs straight into the "continuation
byte must not look like ASCII" rule, as you already noted, and
is therefore effectively closed on PostgreSQL.  The data-type
direction is the current patch, which carries substantial
catalogue and operator surface, while the storage wins mostly
accrue on wider values — where columnar + zstd is already doing
the work.  What still seems genuinely unaddressed in practice is
the short-value regime: word-sized strings such as names,
titles, cities, and tags, which fall below the TOAST compression
threshold and therefore never see a compressor at all.  Would
framing this as "a compression method effective on word-sized
values" be a more productive angle than either of the other two?
The storage outcome could be similar with much less surface area
to maintain.


A fair counter on memory, before I go on: disk pressure has
clearly migrated elsewhere, but shared_buffers and work_mem
remain finite, and compression primarily addresses the disk
side.  A data-type approach that goes far enough to shrink the
in-memory representation — modifying every string function
along the way — tends to become a degraded form of a new
character set: doing most of the character-set work without the
character-set slot in PostgreSQL's encoding machinery, which as
above is closed.  None of the three axes therefore cleanly
solves the in-memory case; for truly memory-bound CJK workloads
the honest answer is probably just more RAM.


One concrete instantiation of that compression angle, if Korean
capacity specifically is the example that matters: take CP949
(which is what actually circulates under the EUC-KR label) as a
compression base and, for any character CP949 cannot represent,
spell it inline as a readable textual escape such as \u2603 or
U+2603 rather than a binary marker byte.  Native Korean text
then stays at two bytes per Hangul, emoji and modern Unicode
remain fully representable (at a modest cost per occurrence),
the in-memory representation stays plain UTF-8, and the on-disk
byte stream stays entirely within ASCII + CP949 — no new marker
byte, no collision with existing code paths that scan for raw
ASCII bytes.  If the source text itself contains sequences that
look like the escape syntax (for instance documentation quoting
\u-style literals), a simple doubling rule disambiguates them;
such cases are vanishingly rare in Korean business data.  This
targets exactly the short-value regime above, with far less
surface than a new data type.


For tighter byte density, one could go further by devising a
dedicated binary-level encoding, or by wiring zstd's external
dictionary feature into the column-compression path with a
pre-trained per-language dictionary — but either of those paths
carries its own implementation and operational costs.


Should you nonetheless decide to press on with utf16 as a data
type, I am willing to take the patch through a proper review; I
have already applied it on top of master and confirmed that the
regression tests pass, so the mechanical footing is in place.


[1] https://w3techs.com/technologies/segmentation/tld-kr-/character_encoding
[2] https://w3techs.com/technologies/segmentation/tld-jp-/character_encoding


Best regards,
Henson

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] postmaster: fix stale PM_STARTUP comment
Next
From: David Rowley
Date:
Subject: Re: Add bms_offset_members() function for bitshifting Bitmapsets