On Tue, Nov 19, 2024 at 11:23:13PM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I'm admittedly not an expert in the multi-byte code, but since there are
>> encodings like LATIN1 that use a byte per character, don't we need to do
>> multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII?
>
> I don't think so, but maybe I'm missing something. An important
> property of backend-legal encodings is that all bytes of a multibyte
> character have their high bits set. Thus if the NAMEDATALEN-2'th
> byte does not have that, it is not part of a multibyte character.
> That's also the reason we can stop if we reach a high-bit-clear
> byte while backing up to earlier bytes.
That's good to know. If we can assume that 1) all bytes of a multibyte
character have the high bit set and 2) all multibyte characters actually
require multiple bytes, then there are just a handful of cases that require
multiple lookups, and we can restrict even those to some extent, too.
--
nathan