Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 - Mailing list pgsql-hackers

From Zhongpu Chen
Subject Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date
Msg-id CA+1gyqLj1squrZMTu2WkjoqVBeyPx=dozfaOa5XydoBjvLp02w@mail.gmail.com
Whole thread
In response to Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8  (Zhongpu Chen <chenloveit@gmail.com>)
List pgsql-hackers
I run a benchmark to test the performance over a Chinese classic novel with respect to various validation strategies: https://github.com/SWUFE-DB-Group/NUAV/blob/main/encoding-validation/NUAV/src/gb2312.rs

The running log of `cargo bench -- gb2312`:

```
     Running benches/gb2312.rs (target/release/deps/gb2312-53d8e01b8e6785c8)
gb2312::is_gb2312_iconv time:   [2.5621 ms 2.5681 ms 2.5740 ms]
                        change: [-2.6404% -2.3284% -2.0023%] (p = 0.00 < 0.05)
                        Performance has improved.

gb2312::is_gb2312_icu   time:   [3.2427 ms 3.2589 ms 3.2771 ms]
                        change: [-1.5710% -1.0409% -0.4387%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

gb2312::is_gb2312_rs    time:   [2.8157 ms 2.8229 ms 2.8303 ms]
                        change: [-1.6985% -1.2165% -0.7501%] (p = 0.00 < 0.05)
                        Change within noise threshold.

Benchmarking gb2312::is_gb2312_range: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.3s, enable flat sampling, or reduce sample count to 50.
gb2312::is_gb2312_range time:   [1.6237 ms 1.6294 ms 1.6351 ms]
                        change: [+3.8720% +4.2901% +4.6933%] (p = 0.00 < 0.05)
                        Performance has regressed.

gb2312::is_gb2312_lookup
                        time:   [488.12 µs 490.04 µs 491.97 µs]
                        change: [+0.9273% +2.2343% +3.2599%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

gb2312::is_gb2312_simd  time:   [181.00 µs 181.77 µs 182.53 µs]
                        change: [-4.4563% -3.6971% -3.0260%] (p = 0.00 < 0.05)
                        Performance has improved.

gb2312:is_gb2312_ranges_pg
                        time:   [467.69 µs 469.27 µs 470.82 µs]

Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
gb2312:is_gb2312_ranges_mysql
                        time:   [1.2611 ms 1.2667 ms 1.2724 ms]

```

As we can see, the PG-style validation does not bring much improvement. Instead, it is slower than my strict-styles. 

On Sat, May 2, 2026 at 12:49 PM Zhongpu Chen <chenloveit@gmail.com> wrote:

Thanks for the clarification.


I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:


```

static int func_gb2312_uni_onechar(int code) {
  if ((code >= 0x2121) && (code <= 0x2658))
    return (tab_gb2312_uni0[code - 0x2121]);
  if ((code >= 0x2721) && (code <= 0x296F))
    return (tab_gb2312_uni1[code - 0x2721]);
  if ((code >= 0x3021) && (code <= 0x777E))
    return (tab_gb2312_uni2[code - 0x3021]);
  return (0);
}

```

where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced.


Anyway, it is reasonable to note these details in the documentation.


On Sat, May 2, 2026 at 11:28 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Friday, May 1, 2026, Zhongpu Chen <chenloveit@gmail.com> wrote:
The issue is not specific to E'\\x..' literals. A normal COPY FROM data file with ENCODING 'EUC_CN' can create text rows that later cannot be retrieved with SELECT.

 This suggests that input validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversion table is stricter.

I suspect a lack of desire to maintain and ensure that specific values are verified; or accepting the runtime cost to do so.  It is indeed structural.  This point should probably be documented better.  But it’s hard to feel too bad if the input claims it is providing verifiable EUC_CN data then proceeds to supply data that lacks meaning in reality.  We are happy to just store and return your data to you - but it’s unreasonable to ask for it to be converted.  It would be nice for the database to provide an extra layer of protection, so I’m not against the idea.  Either automatically or or at least providing a function that could, say, be called in a trigger for opt-in.  But definitely feels like a problematic benefit-to-cost proposition.

David J.



--
Zhongpu Chen


--
Zhongpu Chen

pgsql-hackers by date:

Previous
From: Jim Jones
Date:
Subject: Re: VACUUM FULL, CLUSTER, and REPACK block on other sessions' temp tables
Next
From: Chao Li
Date:
Subject: COPY JSON: use trailing commas in FORCE_ARRAY output