Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date
Msg-id CAOYmi+k_U2zuXhCswwBbo8fH743n4Gzr=hsYH+eyBa+mEOs7Rg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] json_lex_string: don't overread on bad UTF8  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts.  Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind..  I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable.  Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context.  That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Removing unneeded self joins
Next
From: Devrim Gündüz
Date:
Subject: Weird "null" errors during DROP TYPE (pg_upgrade)