On Fri, Feb 13, 2026 at 09:27:02AM -0800, Noah Misch wrote:
> On Fri, Feb 13, 2026 at 07:46:22AM +0000, PG Bug reporting form wrote:
> > After upgrading from PostgreSQL 15.15 to 15.16, substring(text) raises:
> > >ERROR: invalid byte sequence for encoding "UTF8": 0xe6 0x97
> > on valid UTF-8 text stored in a TOAST-compressed column.
>
> > user=> select substring(data from 1 for 1) from toast_repro;
> > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xe6 0x97
>
> Thanks for the report. That is a bug and a regression; I regret missing it
> during review. The substring operation works by taking a 4-byte slice from
> the toasted value (4 bytes being the max length of a UTF8 char in PostgreSQL),
> the finding the actual first character within those bytes. However, it
> incorrectly requires those four bytes to be a valid UTF8 string. I'll start
> on a fix.
Attached. I may add some more tests, e.g. a toasted invalid string where the
detoasted length is less than the slice we request. This version is viable,
however.
I audited the other pg_mbstrlen_with_len(), and I think they're all okay with
an error if the input has an incomplete char. Hence, those don't need changes
beyond what we're already released. Most pass either parser input or an
existing datum with its len. text_position_get_match_pos() is the most subtle
caller, and I think it's fine.
I audited other uses of slice detoast. The only other one is bytea substring,
which is obviously indifferent to character encoding.