On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote:
> On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote:
> > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
> >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> >> > I don't understand this logic. Why are two bytes important? If we knew
> >> > it was UTF8 we could check for non-first bytes always starting with
> >> > bits 10, but we can't know that.
> >>
> >> I think this is because this is a reliable way to detect if the truncation happened
> >> in the middle of a character, without needing to know the specifics of the encoding.
> >>
> >> My understanding is that the key insight is that in any multibyte encoding, all
> >> bytes within a multibyte character will have their high bits set.
> >>
> >> That's just my understanding from the code and Tom's previous explanations: I
> >> might be wrong as not an expert in this area.
> >
> > But the logic doesn't make sense. Why would two bytes be any different
> > than one?
>
> Tom provided a concise explanation upthread [0]. My understanding is the
> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
> where we know that we couldn't possibly have truncated in the middle of a
> multi-byte character. This allows us to avoid doing multiple pg_database
> lookups.
Where does Tom mention anything about checking two bytes? He is
basically saying remove all trailing high-bit characters until you get a
match, because once you get a match, you are have found the point of
valid truncation for the encoding. In fact, here, he specifically talks
about MAX_MULTIBYTE_CHAR_LEN-1:
https://www.postgresql.org/message-id/3796535.1732044807%40sss.pgh.pa.us
This text:
* If the original name is too long and we see two consecutive bytes
* with their high bits set at the truncation point, we might have
* truncated in the middle of a multibyte character. In multibyte
* encodings, every byte of a multibyte character has its high bit
* set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
* NAMEDATALEN-2, we know we're in the middle of a multibyte
* character. We need to try truncating one more byte back to find the
* start of the next character.
needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is
true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the
middle of a multibyte character."
> > I assumed you would just remove all trailing high-bit bytes
> > and stop and the first non-high-bit byte.
>
> I think this risks truncating more than one multi-byte character, which
> would cause the login path to truncate differently than the CREATE/ALTER
> DATABASE path (which is encoding-aware).
True, we can stop at MAX_MULTIBYTE_CHAR_LEN-1, and know there is no match.
> * Try to do multibyte-aware truncation (the patch at hand).
Yes, I am fine with that, but we need to do more than the patch does to
accomplish this, unless I am totally confused.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"