Re: Experimenting with wider Unicode storage - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: Experimenting with wider Unicode storage |
| Date | |
| Msg-id | CA+hUKGJvFV3Bd=dxN1C2eOvhxAki363j1jmoxrkw2MkyK_3Kig@mail.gmail.com Whole thread |
| In response to | Re: Experimenting with wider Unicode storage (Henson Choi <assam258@gmail.com>) |
| List | pgsql-hackers |
On Tue, Apr 21, 2026 at 1:16 PM Henson Choi <assam258@gmail.com> wrote: > Thank you again for sharing this exploration, and for including > Korean in your experiment table. Rather than comment on the > patch itself, let me offer a ground-level report on where Korean > encoding reality sits in April 2026, because the picture has > shifted enough that I think it is worth entering into the record > before this thread accumulates momentum on motivations that may > no longer fully hold on this side of the region. Hi Henson, Thank you for this thoughtful and broad feedback, which provided a lot of useful context. I appreciated all of it, and have responses to a couple of the most actionable paragraphs: > One broader question, then, that I wanted to put to you: there > are three distinct axes on which utf16 could be pursued — as a > server character set, as a data type, or as a compression angle. > The character-set direction runs straight into the "continuation > byte must not look like ASCII" rule, as you already noted, and > is therefore effectively closed on PostgreSQL. The data-type > direction is the current patch, which carries substantial > catalogue and operator surface, while the storage wins mostly > accrue on wider values — where columnar + zstd is already doing > the work. What still seems genuinely unaddressed in practice is > the short-value regime: word-sized strings such as names, > titles, cities, and tags, which fall below the TOAST compression > threshold and therefore never see a compressor at all. Would > framing this as "a compression method effective on word-sized > values" be a more productive angle than either of the other two? > The storage outcome could be similar with much less surface area > to maintain. Yeah, that is an interesting angle that I hadn't considered, at least not with that framing. There are even a couple of Unicode standards that might apply here, and that I believe some other systems are using: https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode https://www.unicode.org/notes/tn6/ BOCU-1 maintains binary codepoint order and reports typical English/French as no size change compared to UTF-8, Greek/Russian/Arabic/Hebrew as -40%, Hindi as -60% (this makes sense: it's almost a generalised ISCII, so you get down to one byte per character in any given Indian language), Japanese as -40% and Chinese/Korean as -25% (Japanese presumably wins with kana sequences). One of the ideas already mentioned in comments in the experimental patch was that the iterator abstraction could allow for incremental decompression, and I suppose there might be a way to expand BOCU-1 or similar to UTF-8 incrementally in that layer. I haven't looked into that seriously though; so far I had only been thinking of that as a way of generalising some open coded special cases that appear in a few places to avoid detoasting. ICU might also be able to consume it incrementally, IDK. zstd etc can clearly compress much more than that, as you say, but then you have to deal with dictionary problems and it's hard to do that for small values in a row-oriented system, as you say. BOCU-1 is dictionary-free, so you read it in direct byte order with only a tiny state in a register or two, which seems to be potentially along the lines you're suggesting. Food for thought. > A fair counter on memory, before I go on: disk pressure has > clearly migrated elsewhere, but shared_buffers and work_mem > remain finite, and compression primarily addresses the disk > side. A data-type approach that goes far enough to shrink the > in-memory representation — modifying every string function > along the way — tends to become a degraded form of a new > character set: doing most of the character-set work without the > character-set slot in PostgreSQL's encoding machinery, which as > above is closed. None of the three axes therefore cleanly > solves the in-memory case; for truly memory-bound CJK workloads > the honest answer is probably just more RAM. Yeah. It's an annoying set of constraints that led me to consider this, while surveying text handling choices made in lots of database systems. Of course it wouldn't be my preference to introduce a new type, but I couldn't see how how else to fit it in, and since I was already investigating "modifying every string function along the way" for other reasons, I wanted to explore what it would take to do that generically enough to handle something as different as this while remaining maintainable... BTW here is the link that I forgot to add to the bottom of my earlier email as reference [3], which is a blog from when SQL Server introduced the *opposite* thing: UTF-8 support (like Windows itself, in 2019). Previously they had only legacy single/multi-byte encodings in VARCHAR and UTF-16 in NVARCHAR, so there they were discussing this tradeoff in reverse, ie space savings for some languages, but reported 25% increase in disk I/O for CJK databases moved to UTF-8. (I don't immediately know why SCSU didn't fix that.) https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-support-for-sql-server/734928 > Should you nonetheless decide to press on with utf16 as a data > type, I am willing to take the patch through a proper review; I > have already applied it on top of master and confirmed that the > regression tests pass, so the mechanical footing is in place. Thanks. I'm not planning to do more with the "separate UTF-16 type" concept at this stage, based on your feedback so far. I am still working on a couple of text/encoding refactoring prototypes with other goals, and will try to think about that "special Unicode compression" angle while doing so.
pgsql-hackers by date: