Home > mailing lists

Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

From	Jacob Champion
Subject	Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date	May 3 17:05:38
Msg-id	CAOYmi+=BomJrQUBgy5FQY9ZtHvuK7WOJNB6foPUv21qfb2+YPw@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] json_lex_string: don't overread on bad UTF8 (Peter Eisentraut <peter@eisentraut.org>)
Responses	Re: [PATCH] json_lex_string: don't overread on bad UTF8
List	pgsql-hackers

Tree view

On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 30.04.24 19:39, Jacob Champion wrote:
> > Tangentially: Should we maybe rethink pieces of the json_lex_string
> > error handling? For example, do we really want to echo an incomplete
> > multibyte sequence once we know it's bad?
>
> I can't quite find the place you might be looking at in
> json_lex_string(),

(json_lex_string() reports the beginning and end of the "area of
interest" via the JsonLexContext; it's json_errdetail() that turns
that into an error message.)

> but for the general encoding conversion we have what
> would appear to be the same behavior in report_invalid_encoding(), and
> we go out of our way there to produce a verbose error message including
> the invalid data.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Thanks,
--Jacob

pgsql-hackers by date:

From: Justin Pryzby
Date: 03 May, 17:05:19
Subject: Re: pg17 issues with not-null contraints

From: Peter Eisentraut
Date: 03 May, 17:09:58
Subject: Re: Document NULL

Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

Previous

Next