Thread: Re: Decision by Monday: PQescapeString() vs. encoding violation
On Fri, 2025-02-14 at 17:27 -0800, Noah Misch wrote: > I'm attaching a WIP patch from Andres Freund. I am not suggesting a change, but there's a minor point about the behavior of the replacement that I'd like to highlight: Unicode discusses a choice[1]: "An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors." The patch implements the latter. Escaping: <7A F0 80 80 41 7A> results in: <7A C0 20 C0 20 C0 20 41 7A> The Unicode standard suggests[2] that the former approach may provide more consistency in how it's done, but that doesn't seem important or relevant for our purposes. I'd favor whichever approach results in simpler code. Regards, Jeff Davis [1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48534 [2] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G66453
Hi, On 2025-02-15 12:35:45 -0800, Jeff Davis wrote: > I am not suggesting a change, but there's a minor point about the > behavior of the replacement that I'd like to highlight: > > Unicode discusses a choice[1]: "An ill-formed subsequence consisting of > more than one code unit could be treated as a single error or as > multiple errors." > > The patch implements the latter. Escaping: > <7A F0 80 80 41 7A> > results in: > <7A C0 20 C0 20 C0 20 41 7A> > > The Unicode standard suggests[2] that the former approach may provide > more consistency in how it's done, but that doesn't seem important or > relevant for our purposes. I'd favor whichever approach results in > simpler code. It seems completely infeasible to me to to implement the "single error" approach in a minor version. It'd afaict require non-trivial new infrastructure. We can't just consume up to the next byte without a high bit, because some encodings have subsequent bytes that are not guaranteed to have a high bit set. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2025-02-15 12:35:45 -0800, Jeff Davis wrote: >> I am not suggesting a change, but there's a minor point about the >> behavior of the replacement that I'd like to highlight: >> Unicode discusses a choice[1]: "An ill-formed subsequence consisting of >> more than one code unit could be treated as a single error or as >> multiple errors." > It seems completely infeasible to me to to implement the "single error" > approach in a minor version. It'd afaict require non-trivial new > infrastructure. We can't just consume up to the next byte without a high bit, > because some encodings have subsequent bytes that are not guaranteed to have a > high bit set. Yeah. Also I think that probably depends on being able to tell the difference between a first byte and a not-first byte of a multibyte character, something that works in UTF-8 but not necessarily elsewhere. As I commented in the security thread, Unicode's recommendations seem pretty UTF-8-centric; I'm hesitant to adopt them wholesale in code that has to deal with other encodings. The v5 patch seems Good Enough(TM) to me. We can refine it later perhaps; I don't think something like the above would affect anything that external code should care about. regards, tom lane
On Sat, 2025-02-15 at 15:43 -0500, Andres Freund wrote: > It seems completely infeasible to me to to implement the "single > error" > approach in a minor version. +1, keep with the behavior in the proposed patches. Even in the next major version, I'd be inclined to try to move toward interfaces that can produce an error reliably rather than spending more time trying to use a "better" pattern of invalid bytes. I just wanted to point out that some kind of decision was made and link to relevant background information. Regards, Jeff Davis
Hi, On 2025-02-15 15:52:01 -0500, Tom Lane wrote: > The v5 patch seems Good Enough(TM) to me. Agreed. > We can refine it later perhaps; I don't think something like the above would > affect anything that external code should care about. I don't really think it's worth spending cycles on this anytime soon. It makes sense to put the effort in to replace invalid "characters" in a minimal way when intending to actually use the "stripped" output permanently. But all we're trying to do here is to a) ensure that the backend will error out b) reduce the chances that other tooling (psql, xml parsers, ..) get confused due to invalidly encoded data. Greetings, Andres Freund