Re: Questionable description about character sets - Mailing list pgsql-hackers
| From | Henson Choi |
|---|---|
| Subject | Re: Questionable description about character sets |
| Date | |
| Msg-id | CAAAe_zBdGXsALm=GkUPtPx9MLcjcM5hBg3HZU+nh8gKXSjXJJw@mail.gmail.com Whole thread |
| In response to | Re: Questionable description about character sets (Tatsuo Ishii <ishii@postgresql.org>) |
| List | pgsql-hackers |
Thanks Thomas for looping me in, and thanks Tatsuo-san for driving
this. Before getting to the Korean Description-column wording
itself, the main thing I want to surface from my audit is two
Bytes/Char corrections on this very table -- they turn out to be
the most concrete thing I can offer.
* JOHAB row Bytes/Char = 1-3. This is wrong. I posted a
separate patch for bug #19354 [1] that rewrites
pg_johab_mblen() / pg_johab_verifychar() to follow
KS X 1001:2004 Annex 3 Table 1 directly, instead of borrowing
from pg_euc_mblen() / IS_EUC_RANGE_VALID(). (JOHAB's Hangul
lead-byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC
reserves as SS2/SS3, so it was never an EUC profile to begin
with.) That patch also corrects pg_wchar_table's maxmblen for
JOHAB from 3 to 2 and the Bytes/Char column of this same
Table 23.3 from "1-3" to "1-2".
* EUC_KR row Bytes/Char = 1-3. Overstated in the same way, but
with a twist: the validator is already correct. EUC-KR per
KS X 2901 / RFC 1557 designates only G0 (ASCII) and G1
(KS X 1001), so the maximum valid sequence length is 2.
pg_euckr_verifychar() already rejects 0x8E and 0x8F via
IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is ever
accepted in practice. The stale "3" only survives in
pg_wchar_table[PG_EUC_KR].maxmblen and in this docs cell, as a
leftover from pg_euckr_mblen() delegating to the shared
pg_euc_mblen(). Correcting both to 2 is a pure cleanup with
no behavior change and no backward-compatibility impact.
If the JOHAB fix lands first, that row's Bytes/Char can inherit
the corrected value. For EUC_KR, I could go either way and would
rather let you pick the direction: fold the maxmblen/docs cleanup
into v1 (since the change is behavior-free), or keep it out and
let me post it as its own small patch in a separate thread (since
it touches src/common/wchar.c as well as the docs, while your v1
is docs-only). I'm happy to prepare it either way.
As for the Korean Description-column wording itself, I'd rather
offer input than a finished proposal -- I'm honestly not confident
about the right naming convention, especially for UHC. For what
it's worth:
* EUC_KR's coded character set is just KS X 1001 (plus ASCII);
there is no KS equivalent of JIS X 0212.
* JOHAB shares the same character repertoire as EUC_KR --
KS X 1001 + ASCII -- and simply arranges those characters into
bytes via the combinational code in Annex 3. So if the column
is about coded character sets rather than encodings, JOHAB's
entry would arguably read identically to EUC_KR's. That's
actually a clean illustration of the encoding-vs-character-set
distinction you raised in the original post.
* UHC / CP949 is the Microsoft superset of EUC-KR that adds the
11172 precomposed Hangul syllables beyond KS X 1001, but those
extra syllables aren't standardized as a separately-named
coded character set as far as I know -- "CP949" tends to refer
to the encoding. I don't have a confident answer for the
wording; if you have a preferred convention I'll defer to it.
(Structural note in passing: despite the "superset of EUC-KR"
framing, UHC is not itself an EUC profile. To fit the extra
syllables, it extends the lead-byte range down to 0x81, which
necessarily swallows 0x8E and 0x8F -- the bytes EUC reserves
as SS2 and SS3. So by extending EUC-KR, CP949 steps outside
the EUC family. Mentioning this only because it mirrors the
JOHAB situation.)
One more observation, and apologies in advance for wandering a bit
beyond the scope of this thread: while auditing those code paths I
noticed that pg_uhc_verifychar() appears quite loose on trail
bytes (it only rejects \0), while CP949's actual trail-byte range
is somewhat narrower. Tightening this would be a real behavior
change -- existing databases may contain byte sequences that are
currently accepted but would be rejected under a stricter verifier
-- so it needs its own discussion. I'll raise that in its own
separate thread regardless of how the EUC_KR question above is
resolved. (UHC's 1-2 / maxmblen = 2 are already correct, so this
is purely a verifier-strictness question, not a table-cell
question.)
So in summary: the UHC verifier question will go to its own
separate thread from my side (behavior change, needs consensus),
and the EUC_KR cleanup will go to either v1 or a separate thread
depending on your call above. Neither should block your v1 patch;
the only pieces that touch the same table cells are the two
Bytes/Char corrections, both handled either via [1] or via the
EUC_KR cleanup, wherever it ends up.
[1] https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org
Regards,
Henson Choi
this. Before getting to the Korean Description-column wording
itself, the main thing I want to surface from my audit is two
Bytes/Char corrections on this very table -- they turn out to be
the most concrete thing I can offer.
* JOHAB row Bytes/Char = 1-3. This is wrong. I posted a
separate patch for bug #19354 [1] that rewrites
pg_johab_mblen() / pg_johab_verifychar() to follow
KS X 1001:2004 Annex 3 Table 1 directly, instead of borrowing
from pg_euc_mblen() / IS_EUC_RANGE_VALID(). (JOHAB's Hangul
lead-byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC
reserves as SS2/SS3, so it was never an EUC profile to begin
with.) That patch also corrects pg_wchar_table's maxmblen for
JOHAB from 3 to 2 and the Bytes/Char column of this same
Table 23.3 from "1-3" to "1-2".
* EUC_KR row Bytes/Char = 1-3. Overstated in the same way, but
with a twist: the validator is already correct. EUC-KR per
KS X 2901 / RFC 1557 designates only G0 (ASCII) and G1
(KS X 1001), so the maximum valid sequence length is 2.
pg_euckr_verifychar() already rejects 0x8E and 0x8F via
IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is ever
accepted in practice. The stale "3" only survives in
pg_wchar_table[PG_EUC_KR].maxmblen and in this docs cell, as a
leftover from pg_euckr_mblen() delegating to the shared
pg_euc_mblen(). Correcting both to 2 is a pure cleanup with
no behavior change and no backward-compatibility impact.
If the JOHAB fix lands first, that row's Bytes/Char can inherit
the corrected value. For EUC_KR, I could go either way and would
rather let you pick the direction: fold the maxmblen/docs cleanup
into v1 (since the change is behavior-free), or keep it out and
let me post it as its own small patch in a separate thread (since
it touches src/common/wchar.c as well as the docs, while your v1
is docs-only). I'm happy to prepare it either way.
As for the Korean Description-column wording itself, I'd rather
offer input than a finished proposal -- I'm honestly not confident
about the right naming convention, especially for UHC. For what
it's worth:
* EUC_KR's coded character set is just KS X 1001 (plus ASCII);
there is no KS equivalent of JIS X 0212.
* JOHAB shares the same character repertoire as EUC_KR --
KS X 1001 + ASCII -- and simply arranges those characters into
bytes via the combinational code in Annex 3. So if the column
is about coded character sets rather than encodings, JOHAB's
entry would arguably read identically to EUC_KR's. That's
actually a clean illustration of the encoding-vs-character-set
distinction you raised in the original post.
* UHC / CP949 is the Microsoft superset of EUC-KR that adds the
11172 precomposed Hangul syllables beyond KS X 1001, but those
extra syllables aren't standardized as a separately-named
coded character set as far as I know -- "CP949" tends to refer
to the encoding. I don't have a confident answer for the
wording; if you have a preferred convention I'll defer to it.
(Structural note in passing: despite the "superset of EUC-KR"
framing, UHC is not itself an EUC profile. To fit the extra
syllables, it extends the lead-byte range down to 0x81, which
necessarily swallows 0x8E and 0x8F -- the bytes EUC reserves
as SS2 and SS3. So by extending EUC-KR, CP949 steps outside
the EUC family. Mentioning this only because it mirrors the
JOHAB situation.)
One more observation, and apologies in advance for wandering a bit
beyond the scope of this thread: while auditing those code paths I
noticed that pg_uhc_verifychar() appears quite loose on trail
bytes (it only rejects \0), while CP949's actual trail-byte range
is somewhat narrower. Tightening this would be a real behavior
change -- existing databases may contain byte sequences that are
currently accepted but would be rejected under a stricter verifier
-- so it needs its own discussion. I'll raise that in its own
separate thread regardless of how the EUC_KR question above is
resolved. (UHC's 1-2 / maxmblen = 2 are already correct, so this
is purely a verifier-strictness question, not a table-cell
question.)
So in summary: the UHC verifier question will go to its own
separate thread from my side (behavior change, needs consensus),
and the EUC_KR cleanup will go to either v1 or a separate thread
depending on your call above. Neither should block your v1 patch;
the only pieces that touch the same table cells are the two
Bytes/Char corrections, both handled either via [1] or via the
EUC_KR cleanup, wherever it ends up.
[1] https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org
Regards,
Henson Choi
pgsql-hackers by date: