Home > mailing lists

Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 - Mailing list pgsql-hackers

From	Zhongpu Chen
Subject	Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date	May 2 07:49:00
Msg-id	CA+1gyqJwhQ5n4VZmJdnouaq7yMgYR+w_RiY=A6VWz4TzcUiHkw@mail.gmail.com Whole thread
In response to	Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 ("David G. Johnston" <david.g.johnston@gmail.com>)
List	pgsql-hackers

Tree view

Thanks for the clarification.

I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:

```

static int func_gb2312_uni_onechar(int code) {
if ((code >= 0x2121) && (code <= 0x2658))
return (tab_gb2312_uni0[code - 0x2121]);
if ((code >= 0x2721) && (code <= 0x296F))
return (tab_gb2312_uni1[code - 0x2721]);
if ((code >= 0x3021) && (code <= 0x777E))
return (tab_gb2312_uni2[code - 0x3021]);
return (0);
}

```

where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced.

Anyway, it is reasonable to note these details in the documentation.

On Sat, May 2, 2026 at 11:28 AM David G. Johnston <david.g.johnston@gmail.com> wrote:

On Friday, May 1, 2026, Zhongpu Chen <chenloveit@gmail.com> wrote:
The issue is not specific to E'\\x..' literals. A normal COPY FROM data file with ENCODING 'EUC_CN' can create text rows that later cannot be retrieved with SELECT.

This suggests that input validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversion table is stricter.

I suspect a lack of desire to maintain and ensure that specific values are verified; or accepting the runtime cost to do so. It is indeed structural. This point should probably be documented better. But it’s hard to feel too bad if the input claims it is providing verifiable EUC_CN data then proceeds to supply data that lacks meaning in reality. We are happy to just store and return your data to you - but it’s unreasonable to ask for it to be converted. It would be nice for the database to provide an extra layer of protection, so I’m not against the idea. Either automatically or or at least providing a function that could, say, be called in a trigger for opt-in. But definitely feels like a problematic benefit-to-cost proposition.

David J.

Zhongpu Chen

pgsql-hackers by date:

From: Tatsuo Ishii
Date: 02 May, 07:38:28
Subject: Re: Row pattern recognition

From: Tatsuo Ishii
Date: 02 May, 08:03:04
Subject: Re: Row pattern recognition

Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 - Mailing list pgsql-hackers

Previous

Next