Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails - Mailing list pgsql-bugs

From Bruce Momjian
Subject Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
Date
Msg-id Zz9jfOkVmlYcYHSy@momjian.us
Whole thread Raw
In response to Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
List pgsql-bugs
On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote:
> On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote:
> > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
> >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> >> > I don't understand this logic.  Why are two bytes important?  If we knew
> >> > it was UTF8 we could check for non-first bytes always starting with
> >> > bits 10, but we can't know that.
> >> 
> >> I think this is because this is a reliable way to detect if the truncation happened
> >> in the middle of a character, without needing to know the specifics of the encoding.
> >> 
> >> My understanding is that the key insight is that in any multibyte encoding, all
> >> bytes within a multibyte character will have their high bits set.
> >> 
> >> That's just my understanding from the code and Tom's previous explanations:  I
> >> might be wrong as not an expert in this area.
> > 
> > But the logic doesn't make sense.  Why would two bytes be any different
> > than one?
> 
> Tom provided a concise explanation upthread [0].  My understanding is the
> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
> where we know that we couldn't possibly have truncated in the middle of a
> multi-byte character.  This allows us to avoid doing multiple pg_database
> lookups.

Where does Tom mention anything about checking two bytes?  He is
basically saying remove all trailing high-bit characters until you get a
match, because once you get a match, you are have found the point of
valid truncation for the encoding.  In fact, here, he specifically talks
about MAX_MULTIBYTE_CHAR_LEN-1:

    https://www.postgresql.org/message-id/3796535.1732044807%40sss.pgh.pa.us

This text:

               * If the original name is too long and we see two consecutive bytes
               * with their high bits set at the truncation point, we might have
               * truncated in the middle of a multibyte character. In multibyte
               * encodings, every byte of a multibyte character has its high bit
               * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
               * NAMEDATALEN-2, we know we're in the middle of a multibyte
               * character. We need to try truncating one more byte back to find the
               * start of the next character.

needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is
true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the
middle of a multibyte character."

> > I assumed you would just remove all trailing high-bit bytes
> > and stop and the first non-high-bit byte.
> 
> I think this risks truncating more than one multi-byte character, which
> would cause the login path to truncate differently than the CREATE/ALTER
> DATABASE path (which is encoding-aware).

True, we can stop at MAX_MULTIBYTE_CHAR_LEN-1, and know there is no match.

> * Try to do multibyte-aware truncation (the patch at hand).

Yes, I am fine with that, but we need to do more than the patch does to
accomplish this, unless I am totally confused.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



pgsql-bugs by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
Next
From: Nathan Bossart
Date:
Subject: Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails