On Wed, Nov 20, 2024 at 11:29:56AM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> Upthread, you mentioned that we could bypass multiple lookups unless both
>> the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII. But if
>> there are encodings with the high bit set that don't require multiple bytes
>> per character, then how can we do that?
>
> Well, we don't know the length of the hypothetically-truncated
> character, but if there was one then all its bytes must have had their
> high bits set. Suppose that the untruncated name has a 4-byte
> multibyte character extending from the NAMEDATALEN-3 byte through the
> NAMEDATALEN'th byte (counting in origin zero here):
>
> [...]
>
> Now as for the shortcut cases: if C3 does not have the high bit set,
> it cannot be part of a multibyte character. Therefore the original
> encoding-aware truncation would have removed C3 and following bytes,
> but no more. The character immediately before might have been one
> byte or several, but it doesn't matter. Similarly, if C2 does not
> have the high bit set, it cannot be part of a multibyte character.
> The original truncation would have removed C3 and following bytes,
> but no more.
Oh, I think I had an off-by-one error in my mental model and was thinking
of the NAMEDATALEN-1'th byte as the last possible byte in the identifier
(i.e., name[NAMEDATALEN - 2]), whereas you meant the location where the
trailing zero would go for the largest possible all-ASCII identifier (i.e.,
name[NAMEDATALEN - 1]). Thank you for elaborating.
--
nathan