Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date
Msg-id CAOYmi+=BomJrQUBgy5FQY9ZtHvuK7WOJNB6foPUv21qfb2+YPw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] json_lex_string: don't overread on bad UTF8  (Peter Eisentraut <peter@eisentraut.org>)
Responses Re: [PATCH] json_lex_string: don't overread on bad UTF8
List pgsql-hackers
On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 30.04.24 19:39, Jacob Champion wrote:
> > Tangentially: Should we maybe rethink pieces of the json_lex_string
> > error handling? For example, do we really want to echo an incomplete
> > multibyte sequence once we know it's bad?
>
> I can't quite find the place you might be looking at in
> json_lex_string(),

(json_lex_string() reports the beginning and end of the "area of
interest" via the JsonLexContext; it's json_errdetail() that turns
that into an error message.)

> but for the general encoding conversion we have what
> would appear to be the same behavior in report_invalid_encoding(), and
> we go out of our way there to produce a verbose error message including
> the invalid data.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Thanks,
--Jacob



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: pg17 issues with not-null contraints
Next
From: Peter Eisentraut
Date:
Subject: Re: Document NULL