Re: Trying out native UTF-8 locales on Windows - Mailing list pgsql-hackers
| From | Bryan Green |
|---|---|
| Subject | Re: Trying out native UTF-8 locales on Windows |
| Date | |
| Msg-id | 28c40eba-c169-4fdb-9f99-42d246922b84@gmail.com Whole thread Raw |
| In response to | Trying out native UTF-8 locales on Windows (Thomas Munro <thomas.munro@gmail.com>) |
| List | pgsql-hackers |
On 10/27/2025 10:22 PM, Thomas Munro wrote:
> Here's a very short patch to experiment with the idea of using
> Windows' native UTF-8 support when possible, ie when using
> "en-US.UTF-8" in a UTF-8 database. Otherwise it continues to use the
> special Windows-only wchar_t conversion that allows for locales with
> non-matching locales, ie the reason you're allowed to use
> "English_United States.1252" in a UTF-8 database on that OS, something
> we wouldn't allow on Unix.
>
> As I understand it, that mechanism dates from the pre-Windows 10 era
> when it had no .UTF-8 locales but users wanted or needed to use UTF-8
> databases. I think some locales used encodings that we don't even
> support as server encodings, eg SJIS in Japan, so that was a
> workaround. I assume you could use "ja-JP.UTF-8" these days.
>
> CI tells me it compiles and passes, but I am not a Windows person, I'm
> primarily interested in code cleanup and removing weird platform
> differences. I wonder if someone directly interested in Windows would
> like to experiment with this and report whether (1) it works as
> expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
> (which I guess uses WIN1252 encoding and triggers the conversion
> path?), and similarly for other locale pairs you might be interested
> in?
I wrote a standalone test to check this. Results on Windows 11 x64,
16 cores, ACP=1252.
(1) Correctness: PASS. strcoll_l() with UTF-8 locale matches wcscoll_l()
for all 26 test cases (ASCII, accents, umlauts, ß, Greek, etc).
Sorting 38 German/French words with both methods produces identical
order.
(2) Performance vs WIN1252: It depends on the data.
Basic comparison with short real-world strings (1M iterations each):
Test UTF8-new UTF8-cur WIN1252
---- -------- -------- -------
'hello' vs 'world' 82 ms 108 ms 76 ms
'apple' vs 'banana' 85 ms 110 ms 77 ms
'PostgreSQL' vs 'MySQL' 89 ms 113 ms 83 ms
UTF8-new = strcoll_l with UTF-8 locale (proposed patch)
UTF8-cur = wcscoll_l via conversion (current PostgreSQL)
WIN1252 = strcoll_l with legacy locale (baseline)
For ASCII strings, there's a crossover around 15-20 characters
(500K iterations each):
Length UTF8 WIN1252 Ratio
------ ---- ------- -----
5 43 ms 40 ms 0.93x (UTF8 7% slower)
10 50 ms 48 ms 0.96x (UTF8 4% slower)
20 57 ms 65 ms 1.13x (UTF8 13% faster)
50 104 ms 122 ms 1.17x (UTF8 17% faster)
100 150 ms 195 ms 1.30x (UTF8 30% faster)
500 550 ms 783 ms 1.43x (UTF8 43% faster)
For accented characters (á = 2 bytes UTF-8, 1 byte WIN1252), UTF-8
is consistently ~2x slower, as expected from the byte count
(500K iterations each):
Chars UTF8 WIN1252 Ratio
----- ---- ------- -----
5 55 ms 42 ms 0.76x
50 233 ms 117 ms 0.50x
200 694 ms 342 ms 0.49x
With 200-char ASCII strings, UTF-8 beats WIN1252 even when the
difference is at position 0 (500K iterations each):
Difference at UTF8 WIN1252
------------- ---- -------
Position 0 168 ms 260 ms
Position 199 252 ms 342 ms
This suggests WIN1252's strcoll_l has poor scaling characteristics
that UTF-8's implementation avoids. I don't have an explanation for
why.
The patch is correct, and the new strcoll_l() path is 10-25% faster than
the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
or slower than WIN1252 depends on string length and content - but users
choosing UTF-8 locales presumably want Unicode support, not WIN1252
compatibility.
I can test more if needed. I can also provide the test program for
anyone who wants it.
--
Bryan Green
EDB: https://www.enterprisedb.com
pgsql-hackers by date: