Home > mailing lists

Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

From	Jacob Champion
Subject	Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date	May 2, 2024 23:29:18
Msg-id	CAOYmi+k_U2zuXhCswwBbo8fH743n4Gzr=hsYH+eyBa+mEOs7Rg@mail.gmail.com Whole thread
In response to	Re: [PATCH] json_lex_string: don't overread on bad UTF8 (Michael Paquier <michael@paquier.xyz>)
List	pgsql-hackers

Tree view

On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts.  Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind..  I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable.  Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context.  That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob

pgsql-hackers by date:

From: Tom Lane
Date: 02 May 2024, 23:19:34
Subject: Re: Removing unneeded self joins

From: Devrim Gündüz
Date: 02 May 2024, 23:36:33
Subject: Weird "null" errors during DROP TYPE (pg_upgrade)

Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

Previous

Next