Nathan Bossart <nathandbossart@gmail.com> writes:
> Upthread, you mentioned that we could bypass multiple lookups unless both
> the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII. But if
> there are encodings with the high bit set that don't require multiple bytes
> per character, then how can we do that?
Well, we don't know the length of the hypothetically-truncated
character, but if there was one then all its bytes must have had their
high bits set. Suppose that the untruncated name has a 4-byte
multibyte character extending from the NAMEDATALEN-3 byte through the
NAMEDATALEN'th byte (counting in origin zero here):
...61 irrelevant bytes... C1 C2 C3 C4 ...
The original CREATE DATABASE would have removed that whole character
and stored a name of length NAMEDATALEN-3:
...61 irrelevant bytes...
In the connection attempt, when we
receive the untruncated name, we'll first try to truncate it to
NAMEDATALEN-1 bytes:
...61 irrelevant bytes... C1 C2
We'll look that up and not find it. At this point we remember that
C3 had the high bit set, and we note that C2 does too, so we try
...61 irrelevant bytes... C1
That still doesn't work, but C1 still has the high bit set,
so we try
...61 irrelevant bytes...
and find the match.
Now as for the shortcut cases: if C3 does not have the high bit set,
it cannot be part of a multibyte character. Therefore the original
encoding-aware truncation would have removed C3 and following bytes,
but no more. The character immediately before might have been one
byte or several, but it doesn't matter. Similarly, if C2 does not
have the high bit set, it cannot be part of a multibyte character.
The original truncation would have removed C3 and following bytes,
but no more.
Another way to think about this is that without knowledge of the
encoding, we don't know whether a run of several high-bit-set
bytes represents one character or several. But all the encodings
we support are ASCII extensions, meaning that any high-bit-clear
byte represents an ASCII character and is not part of a multibyte
character. So it would have gotten truncated or not independently
of what's around it.
regards, tom lane