Re: [PATCH] json_lex_string: don't overread on bad UTF8 - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: [PATCH] json_lex_string: don't overread on bad UTF8
Date
Msg-id ZjHcbcFcZ7pku2VH@paquier.xyz
Whole thread Raw
In response to [PATCH] json_lex_string: don't overread on bad UTF8  (Jacob Champion <jacob.champion@enterprisedb.com>)
Responses Re: [PATCH] json_lex_string: don't overread on bad UTF8
List pgsql-hackers
On Tue, Apr 30, 2024 at 10:39:04AM -0700, Jacob Champion wrote:
> When json_lex_string() hits certain types of invalid input, it calls
> pg_encoding_mblen_bounded(), which assumes that its input is
> null-terminated and calls strnlen(). But the JSON lexer is constructed
> with an explicit string length, and we don't ensure that the string is
> null-terminated in all cases, so we can walk off the end of the
> buffer. This isn't really relevant on the server side, where you'd
> have to get a superuser to help you break string encodings, but for
> client-side usage on untrusted input (such as my OAuth patch) it would
> be more important.

Not sure to like much the fact that this advances token_terminator
first.  Wouldn't it be better to calculate pg_encoding_mblen() first,
then save token_terminator?  I feel a bit uneasy about saving a value
in token_terminator past the end of the string.  That a nit in this
context, still..

> Attached is a draft patch that explicitly checks against the
> end-of-string pointer and clamps the token_terminator to it. Note that
> this removes the only caller of pg_encoding_mblen_bounded() and I'm
> not sure what we should do with that function. It seems like a
> reasonable API, just not here.

Hmm.  Keeping it around as currently designed means that it could
cause more harm than anything in the long term, even in the stable
branches if new code uses it.  There is a risk of seeing this new code
incorrectly using it again, even if its top comment tells that we rely
on the string being nul-terminated.  A safer alternative would be to
redesign it so as the length of the string is provided in input,
removing the dependency of strlen in its internals, perhaps.  Anyway,
without any callers of it, I'd be tempted to wipe it from HEAD and
call it a day.

> The new test needs to record two versions of the error message, one
> for invalid token and one for invalid escape sequence. This is
> because, for smaller chunk sizes, the partial-token logic in the
> incremental JSON parser skips the affected code entirely when it can't
> find an ending double-quote.

Ah, that makes sense.  That looks OK here.  A comment around the test
would be adapted to document that, I guess.

> Tangentially: Should we maybe rethink pieces of the json_lex_string
> error handling? For example, do we really want to echo an incomplete
> multibyte sequence once we know it's bad? It also looks like there are
> places where the FAIL_AT_CHAR_END macro is called after the `s`
> pointer has already advanced past the code point of interest. I'm not
> sure if that's intentional.

Advancing the tracking pointer 's' before reporting an error related
the end of the string is a bad practive, IMO, and we should avoid
that.  json_lex_string() does not offer a warm feeling regarding that
with escape characters, at least :/
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
Next
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum