Thread: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18711 Logged by: Adam Rauch Email address: adam@labkey.com PostgreSQL version: 17.1 Operating system: Windows 11 Description: For many years, our test infrastructure has used database names that are longer than the stated maximum length of 63 characters. The PostgreSQL server simply truncates these names to 63 characters in all operations and everything works fine. Starting with 17.x, our application is able to CREATE and ALTER databases with long names, but all connection attempts using them fail with an error: database "<very long name>" does not exist I fully recognize that these names are not legal, the previous truncation behavior was not documented, and there are obvious ways to work around this "issue." But given the long running truncation behavior, the now inconsistent behavior (CREATE and ALTER happily accept long database names, but connections fail), and the lack of any mention in the release notes makes me think this may be an unintentional change worth flagging. I happen to be connecting via the (latest) JDBC driver. I can provide the simple Java code that passes on PostgreSQL 16 but fails on PostgreSQL 17, if needed. Thanks! Adam
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > For many years, our test infrastructure has used database names that are > longer than the stated maximum length of 63 characters. The PostgreSQL > server simply truncates these names to 63 characters in all operations and > everything works fine. Starting with 17.x, our application is able to CREATE > and ALTER databases with long names, but all connection attempts using them > fail with an error: database "<very long name>" does not exist Yeah, this was an intentional change in v17: commit 562bee0fc13dc95710b8db6a48edad2f3d052f2e Author: Nathan Bossart <nathan@postgresql.org> Date: Mon Jul 3 13:18:05 2023 -0700 Don't truncate database and user names in startup packets. Unlike commands such as CREATE DATABASE, ProcessStartupPacket() does not perform multibyte-aware truncation of overlength names. This means that connection attempts might fail even if the user provides the same overlength names that were used in CREATE DATABASE, CREATE ROLE, etc. Ideally, we'd do the same multibyte- aware truncation in both code paths, but it doesn't seem worth the added complexity of trying to discover the encoding of the names. Instead, let's simply skip truncating the names in the startup packet and let the user/database lookup fail later on. With this change, users must provide the exact names stored in the catalogs, even if the names were truncated. This reverts commit d18c1d1f51. Author: Bertrand Drouvot Reviewed-by: Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/07436793-1426-29b2-f924-db7422a05fb7%40gmail.com As said, the difficulty is that we don't know what encoding the incoming name is meant to be in, and with multibyte encodings that matters. The name actually stored in the catalog might be less than 63 bytes long if it was truncated in a multibyte-aware way, so that the former behavior of blindly truncating at 63 bytes can still yield unexpected no-such-database results. I can imagine still performing the truncation if the incoming name is all-ASCII, but that seems like a hack. The world isn't nearly as ASCII-centric as it was in 2001. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Sun, Nov 17, 2024 at 01:00:14PM -0500, Tom Lane wrote: > As said, the difficulty is that we don't know what encoding the > incoming name is meant to be in, and with multibyte encodings that > matters. The name actually stored in the catalog might be less > than 63 bytes long if it was truncated in a multibyte-aware way, > so that the former behavior of blindly truncating at 63 bytes > can still yield unexpected no-such-database results. > > I can imagine still performing the truncation if the incoming > name is all-ASCII, but that seems like a hack. The world isn't > nearly as ASCII-centric as it was in 2001. I wonder if we should consider removing the identifier truncation altogether. Granted, it mostly works (or at least did before v17), but I'm not sure we'd make the same decision today if we were starting from scratch. IMHO it'd be better to ERROR so that users are forced to produce legal identifiers. That being said, I realize this behavior has been present for over a quarter century now [0] [1] [2], and folks are unlikely to be happy with even more breakage. [0] https://postgr.es/c/d15c37c [1] https://postgr.es/c/0672a3c [2] https://postgr.es/c/49581f9 -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > On Sun, Nov 17, 2024 at 01:00:14PM -0500, Tom Lane wrote: >> As said, the difficulty is that we don't know what encoding the >> incoming name is meant to be in, and with multibyte encodings that >> matters. The name actually stored in the catalog might be less >> than 63 bytes long if it was truncated in a multibyte-aware way, >> so that the former behavior of blindly truncating at 63 bytes >> can still yield unexpected no-such-database results. > I wonder if we should consider removing the identifier truncation > altogether. Granted, it mostly works (or at least did before v17), but I'm > not sure we'd make the same decision today if we were starting from > scratch. IMHO it'd be better to ERROR so that users are forced to produce > legal identifiers. That being said, I realize this behavior has been > present for over a quarter century now [0] [1] [2], and folks are unlikely > to be happy with even more breakage. Yeah, I think removing it now is a non-starter. I did think of a way that we could approximate encoding-correct truncation here, relying on the fact that what's in pg_database is encoding-correct according to somebody: 1. If NAMEDATALEN-1'th byte is ASCII (high bit clear), just truncate there and look up as usual. 2. If it's non-ASCII, truncate there and try to look up. On success, we're good. On failure, if the next-to-last byte is non-ASCII, truncate that too and try to look up. Repeat a maximum of MAX_MULTIBYTE_CHAR_LEN-1 times before failing. I think this works unconditionally so long as all entries in pg_database.datname are in the same encoding. If there's a mixture of encodings (which we don't forbid) then in principle you could probably select a database other than the one the client thought it was asking for. But that seems mighty improbable, and the answer can always be "so connect using the name as it appears in the catalog". It's ugly of course. But considering that we got a complaint so quickly after v17 release, I'm not sure we can just camp on 562bee0fc as being an acceptable answer. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote: > I did think of a way that we could approximate encoding-correct > truncation here, relying on the fact that what's in pg_database > is encoding-correct according to somebody: > > 1. If NAMEDATALEN-1'th byte is ASCII (high bit clear), just truncate > there and look up as usual. > > 2. If it's non-ASCII, truncate there and try to look up. On success, > we're good. On failure, if the next-to-last byte is non-ASCII, > truncate that too and try to look up. Repeat a maximum of > MAX_MULTIBYTE_CHAR_LEN-1 times before failing. > > I think this works unconditionally so long as all entries in > pg_database.datname are in the same encoding. If there's a > mixture of encodings (which we don't forbid) then in principle > you could probably select a database other than the one the > client thought it was asking for. But that seems mighty > improbable, and the answer can always be "so connect using > the name as it appears in the catalog". That's an interesting idea. That code would probably need to live in GetDatabaseTuple(), but it seems doable. We might be able to avoid the "mighty improbable" case by always truncating up to MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches, too. -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote: >> I did think of a way that we could approximate encoding-correct >> truncation here, relying on the fact that what's in pg_database >> is encoding-correct according to somebody: >> ... > That's an interesting idea. That code would probably need to live in > GetDatabaseTuple(), but it seems doable. We might be able to avoid the > "mighty improbable" case by always truncating up to > MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches, > too. Hmm ... but with short characters (e.g. LATIN1) there might be legitimately-different names that that rule would complain about. Still, the workaround remains "so spell it like it is in the catalog". On balance I think that's an improvement over what I was visualizing. Also, we could bypass the multiple lookups unless both the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which should be rare enough to make it not much of a performance issue. One annoying point is that we also need this for role lookup. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote: > Also, we could bypass the multiple lookups unless both the > NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which > should be rare enough to make it not much of a performance issue. I'm admittedly not an expert in the multi-byte code, but since there are encodings like LATIN1 that use a byte per character, don't we need to do multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII? > One annoying point is that we also need this for role lookup. Right. -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote: >> Also, we could bypass the multiple lookups unless both the >> NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which >> should be rare enough to make it not much of a performance issue. > I'm admittedly not an expert in the multi-byte code, but since there are > encodings like LATIN1 that use a byte per character, don't we need to do > multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII? I don't think so, but maybe I'm missing something. An important property of backend-legal encodings is that all bytes of a multibyte character have their high bits set. Thus if the NAMEDATALEN-2'th byte does not have that, it is not part of a multibyte character. That's also the reason we can stop if we reach a high-bit-clear byte while backing up to earlier bytes. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bertrand Drouvot
Date:
On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: > > On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote: > >> I did think of a way that we could approximate encoding-correct > >> truncation here, relying on the fact that what's in pg_database > >> is encoding-correct according to somebody: > >> ... > > > That's an interesting idea. That code would probably need to live in > > GetDatabaseTuple(), but it seems doable. We might be able to avoid the > > "mighty improbable" case by always truncating up to > > MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches, > > too. > > Hmm ... but with short characters (e.g. LATIN1) there might be > legitimately-different names that that rule would complain about. Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That would then work for short characters too IIUC. > Also, we could bypass the multiple lookups unless both the > NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which > should be rare enough to make it not much of a performance issue. > > One annoying point is that we also need this for role lookup. Yeah. The role existence is checked before the database existence. And even if that would not be the case, it would make litle sense to rely on the pg_encoding_max_length() of the matching database for the role, as those are global. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes: > Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That > would then work for short characters too IIUC. No. We don't know which encoding it is. Even if you wanted to say "use the database encoding", we haven't identified the database yet. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bertrand Drouvot
Date:
Hi, On Wed, Nov 20, 2024 at 10:10:51AM -0500, Tom Lane wrote: > Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes: > > Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That > > would then work for short characters too IIUC. > > No. We don't know which encoding it is. Even if you wanted to say > "use the database encoding", we haven't identified the database yet. I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname and encoding from FormData_pg_database and start from there the comparison with the dbname passed as an argument to GetDatabaseTuple(). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote: > On Wed, Nov 20, 2024 at 10:10:51AM -0500, Tom Lane wrote: >> Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes: >> > Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That >> > would then work for short characters too IIUC. >> >> No. We don't know which encoding it is. Even if you wanted to say >> "use the database encoding", we haven't identified the database yet. > > I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname > and encoding from FormData_pg_database and start from there the comparison > with the dbname passed as an argument to GetDatabaseTuple(). Thoughts? I was wondering if we could use the database encoding to disambiguate if we found multiple matches, but IIUC the identifier will be truncated using the encoding of the database from which it was created. But hopefully this is all rare enough in practice... -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Tue, Nov 19, 2024 at 11:23:13PM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> I'm admittedly not an expert in the multi-byte code, but since there are >> encodings like LATIN1 that use a byte per character, don't we need to do >> multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII? > > I don't think so, but maybe I'm missing something. An important > property of backend-legal encodings is that all bytes of a multibyte > character have their high bits set. Thus if the NAMEDATALEN-2'th > byte does not have that, it is not part of a multibyte character. > That's also the reason we can stop if we reach a high-bit-clear > byte while backing up to earlier bytes. That's good to know. If we can assume that 1) all bytes of a multibyte character have the high bit set and 2) all multibyte characters actually require multiple bytes, then there are just a handful of cases that require multiple lookups, and we can restrict even those to some extent, too. -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote: >> I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname >> and encoding from FormData_pg_database and start from there the comparison >> with the dbname passed as an argument to GetDatabaseTuple(). Thoughts? > I was wondering if we could use the database encoding to disambiguate if we > found multiple matches, but IIUC the identifier will be truncated using the > encoding of the database from which it was created. Yeah, you can't really assume that a database's name is stored using the encoding of that database. And even if you did, you still can't make any assumptions about pg_role. I doubt we want to do this differently for roles than databases. We've had some past discussions about tightening all this up, like insisting that names in shared catalogs are always in UTF8. But there are downsides to that too. In any case, there is no way that a seqscan of pg_database (or pg_role) is going to be adequately performant, let alone faster than at-most-four indexed probes. We just had somebody inquiring about whether it'd be okay to create a million roles... regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > That's good to know. If we can assume that 1) all bytes of a multibyte > character have the high bit set and 2) all multibyte characters actually > require multiple bytes, then there are just a handful of cases that require > multiple lookups, and we can restrict even those to some extent, too. I'm failing to parse your (2). Either that's content-free or you're thinking something that probably isn't true. There are encodings (mostly the LATINn series) that have high-bit-set characters that only occupy one byte. So I don't think we can take any shortcuts compared to the strip-one-byte-at-a-time approach. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bertrand Drouvot
Date:
On Wed, Nov 20, 2024 at 10:39:35AM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: > > On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote: > >> I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname > >> and encoding from FormData_pg_database and start from there the comparison > >> with the dbname passed as an argument to GetDatabaseTuple(). Thoughts? > > > I was wondering if we could use the database encoding to disambiguate if we > > found multiple matches, but IIUC the identifier will be truncated using the > > encoding of the database from which it was created. > > Yeah, you can't really assume that a database's name is stored using > the encoding of that database. Yeah, good point, let's stick to the MAX_MULTIBYTE_CHAR_LEN idea then and discard the usage of pg_encoding_max_length(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Wed, Nov 20, 2024 at 10:54:50AM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> That's good to know. If we can assume that 1) all bytes of a multibyte >> character have the high bit set and 2) all multibyte characters actually >> require multiple bytes, then there are just a handful of cases that require >> multiple lookups, and we can restrict even those to some extent, too. > > I'm failing to parse your (2). Either that's content-free or you're > thinking something that probably isn't true. There are encodings > (mostly the LATINn series) that have high-bit-set characters that > only occupy one byte. So I don't think we can take any shortcuts > compared to the strip-one-byte-at-a-time approach. I'm probably missing something here, sorry. Upthread, you mentioned that we could bypass multiple lookups unless both the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII. But if there are encodings with the high bit set that don't require multiple bytes per character, then how can we do that? For example, let's say the initially-truncated identifier ends with an ASCII byte followed by a non-ASCII byte. That last byte might be a LATIN1 character, or it could be the beginning of a character that requires multiple bytes, so we need to lookup both the initially truncated string as well as the string with one extra byte truncated, right? -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes: > Upthread, you mentioned that we could bypass multiple lookups unless both > the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII. But if > there are encodings with the high bit set that don't require multiple bytes > per character, then how can we do that? Well, we don't know the length of the hypothetically-truncated character, but if there was one then all its bytes must have had their high bits set. Suppose that the untruncated name has a 4-byte multibyte character extending from the NAMEDATALEN-3 byte through the NAMEDATALEN'th byte (counting in origin zero here): ...61 irrelevant bytes... C1 C2 C3 C4 ... The original CREATE DATABASE would have removed that whole character and stored a name of length NAMEDATALEN-3: ...61 irrelevant bytes... In the connection attempt, when we receive the untruncated name, we'll first try to truncate it to NAMEDATALEN-1 bytes: ...61 irrelevant bytes... C1 C2 We'll look that up and not find it. At this point we remember that C3 had the high bit set, and we note that C2 does too, so we try ...61 irrelevant bytes... C1 That still doesn't work, but C1 still has the high bit set, so we try ...61 irrelevant bytes... and find the match. Now as for the shortcut cases: if C3 does not have the high bit set, it cannot be part of a multibyte character. Therefore the original encoding-aware truncation would have removed C3 and following bytes, but no more. The character immediately before might have been one byte or several, but it doesn't matter. Similarly, if C2 does not have the high bit set, it cannot be part of a multibyte character. The original truncation would have removed C3 and following bytes, but no more. Another way to think about this is that without knowledge of the encoding, we don't know whether a run of several high-bit-set bytes represents one character or several. But all the encodings we support are ASCII extensions, meaning that any high-bit-clear byte represents an ASCII character and is not part of a multibyte character. So it would have gotten truncated or not independently of what's around it. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Wed, Nov 20, 2024 at 11:29:56AM -0500, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> Upthread, you mentioned that we could bypass multiple lookups unless both >> the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII. But if >> there are encodings with the high bit set that don't require multiple bytes >> per character, then how can we do that? > > Well, we don't know the length of the hypothetically-truncated > character, but if there was one then all its bytes must have had their > high bits set. Suppose that the untruncated name has a 4-byte > multibyte character extending from the NAMEDATALEN-3 byte through the > NAMEDATALEN'th byte (counting in origin zero here): > > [...] > > Now as for the shortcut cases: if C3 does not have the high bit set, > it cannot be part of a multibyte character. Therefore the original > encoding-aware truncation would have removed C3 and following bytes, > but no more. The character immediately before might have been one > byte or several, but it doesn't matter. Similarly, if C2 does not > have the high bit set, it cannot be part of a multibyte character. > The original truncation would have removed C3 and following bytes, > but no more. Oh, I think I had an off-by-one error in my mental model and was thinking of the NAMEDATALEN-1'th byte as the last possible byte in the identifier (i.e., name[NAMEDATALEN - 2]), whereas you meant the location where the trailing zero would go for the largest possible all-ASCII identifier (i.e., name[NAMEDATALEN - 1]). Thank you for elaborating. -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bruce Momjian
Date:
On Thu, Nov 21, 2024 at 07:27:22AM +0000, Bertrand Drouvot wrote: > + /* > + * If the original name is too long and we see two consecutive bytes > + * with their high bits set at the truncation point, we might have > + * truncated in the middle of a multibyte character. In multibyte > + * encodings, every byte of a multibyte character has its high bit > + * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and > + * NAMEDATALEN-2, we know we're in the middle of a multibyte > + * character. We need to try truncating one more byte back to find the > + * start of the next character. > + */ ... > + /* > + * If we've hit a byte with high bit clear (an ASCII byte), we > + * know we can't be in the middle of a multibyte character, > + * because all bytes of a multibyte character must have their > + * high bits set. Any following byte must therefore be the > + * start of a new character, so we can stop looking for > + * earlier truncation points. > + */ I don't understand this logic. Why are two bytes important? If we knew it was UTF8 we could check for non-first bytes always starting with bits 10, but we can't know that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bertrand Drouvot
Date:
Hi, On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: > On Thu, Nov 21, 2024 at 07:27:22AM +0000, Bertrand Drouvot wrote: > > + /* > > + * If the original name is too long and we see two consecutive bytes > > + * with their high bits set at the truncation point, we might have > > + * truncated in the middle of a multibyte character. In multibyte > > + * encodings, every byte of a multibyte character has its high bit > > + * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and > > + * NAMEDATALEN-2, we know we're in the middle of a multibyte > > + * character. We need to try truncating one more byte back to find the > > + * start of the next character. > > + */ > ... > > + /* > > + * If we've hit a byte with high bit clear (an ASCII byte), we > > + * know we can't be in the middle of a multibyte character, > > + * because all bytes of a multibyte character must have their > > + * high bits set. Any following byte must therefore be the > > + * start of a new character, so we can stop looking for > > + * earlier truncation points. > > + */ > > I don't understand this logic. Why are two bytes important? If we knew > it was UTF8 we could check for non-first bytes always starting with > bits 10, but we can't know that. I think this is because this is a reliable way to detect if the truncation happened in the middle of a character, without needing to know the specifics of the encoding. My understanding is that the key insight is that in any multibyte encoding, all bytes within a multibyte character will have their high bits set. That's just my understanding from the code and Tom's previous explanations: I might be wrong as not an expert in this area. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bruce Momjian
Date:
On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote: > On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: > > I don't understand this logic. Why are two bytes important? If we knew > > it was UTF8 we could check for non-first bytes always starting with > > bits 10, but we can't know that. > > I think this is because this is a reliable way to detect if the truncation happened > in the middle of a character, without needing to know the specifics of the encoding. > > My understanding is that the key insight is that in any multibyte encoding, all > bytes within a multibyte character will have their high bits set. > > That's just my understanding from the code and Tom's previous explanations: I > might be wrong as not an expert in this area. But the logic doesn't make sense. Why would two bytes be any different than one? I assumed you would just remove all trailing high-bit bytes and stop and the first non-high-bit byte. Also, do we really expect there to be trailing multi-byte characters and then some ASCII before it? Isn't it likely it will be all ASCII or all multi-byte characters? I guess for Latin1, it would work fine, but I assume for Asian languages, it will be almost all multi-byte characters. I guess digits would be ASCII. This all just seems very unfocused. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote: > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote: >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: >> > I don't understand this logic. Why are two bytes important? If we knew >> > it was UTF8 we could check for non-first bytes always starting with >> > bits 10, but we can't know that. >> >> I think this is because this is a reliable way to detect if the truncation happened >> in the middle of a character, without needing to know the specifics of the encoding. >> >> My understanding is that the key insight is that in any multibyte encoding, all >> bytes within a multibyte character will have their high bits set. >> >> That's just my understanding from the code and Tom's previous explanations: I >> might be wrong as not an expert in this area. > > But the logic doesn't make sense. Why would two bytes be any different > than one? Tom provided a concise explanation upthread [0]. My understanding is the same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases where we know that we couldn't possibly have truncated in the middle of a multi-byte character. This allows us to avoid doing multiple pg_database lookups. > I assumed you would just remove all trailing high-bit bytes > and stop and the first non-high-bit byte. I think this risks truncating more than one multi-byte character, which would cause the login path to truncate differently than the CREATE/ALTER DATABASE path (which is encoding-aware). > Also, do we really expect > there to be trailing multi-byte characters and then some ASCII before > it? Isn't it likely it will be all ASCII or all multi-byte characters? > I guess for Latin1, it would work fine, but I assume for Asian > languages, it will be almost all multi-byte characters. I guess digits > would be ASCII. All of these seem within the realm of possibility to me. > This all just seems very unfocused. I see the following options: * Try to do multibyte-aware truncation (the patch at hand). * Only truncate for all-ASCII identifiers for historical purposes. Folks using non-ASCII characters in database names will need to specify the datname exactly during login. * ERROR for long identifiers instead of automatically truncating (upthread this was considered a non-starter since this behavior has been around for so long). * Revert the patch, leaving multibyte database names potentially broken (AFAIK Bertrand's initial report is the only one). * Do nothing, so folks who previously relied on the truncation will now have to specify the datname exactly during login as of >= v17. [0] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bruce Momjian
Date:
On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote: > On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote: > > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote: > >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote: > >> > I don't understand this logic. Why are two bytes important? If we knew > >> > it was UTF8 we could check for non-first bytes always starting with > >> > bits 10, but we can't know that. > >> > >> I think this is because this is a reliable way to detect if the truncation happened > >> in the middle of a character, without needing to know the specifics of the encoding. > >> > >> My understanding is that the key insight is that in any multibyte encoding, all > >> bytes within a multibyte character will have their high bits set. > >> > >> That's just my understanding from the code and Tom's previous explanations: I > >> might be wrong as not an expert in this area. > > > > But the logic doesn't make sense. Why would two bytes be any different > > than one? > > Tom provided a concise explanation upthread [0]. My understanding is the > same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases > where we know that we couldn't possibly have truncated in the middle of a > multi-byte character. This allows us to avoid doing multiple pg_database > lookups. Where does Tom mention anything about checking two bytes? He is basically saying remove all trailing high-bit characters until you get a match, because once you get a match, you are have found the point of valid truncation for the encoding. In fact, here, he specifically talks about MAX_MULTIBYTE_CHAR_LEN-1: https://www.postgresql.org/message-id/3796535.1732044807%40sss.pgh.pa.us This text: * If the original name is too long and we see two consecutive bytes * with their high bits set at the truncation point, we might have * truncated in the middle of a multibyte character. In multibyte * encodings, every byte of a multibyte character has its high bit * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and * NAMEDATALEN-2, we know we're in the middle of a multibyte * character. We need to try truncating one more byte back to find the * start of the next character. needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the middle of a multibyte character." > > I assumed you would just remove all trailing high-bit bytes > > and stop and the first non-high-bit byte. > > I think this risks truncating more than one multi-byte character, which > would cause the login path to truncate differently than the CREATE/ALTER > DATABASE path (which is encoding-aware). True, we can stop at MAX_MULTIBYTE_CHAR_LEN-1, and know there is no match. > * Try to do multibyte-aware truncation (the patch at hand). Yes, I am fine with that, but we need to do more than the patch does to accomplish this, unless I am totally confused. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Thu, Nov 21, 2024 at 11:44:44AM -0500, Bruce Momjian wrote: > On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote: >> Tom provided a concise explanation upthread [0]. My understanding is the >> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases >> where we know that we couldn't possibly have truncated in the middle of a >> multi-byte character. This allows us to avoid doing multiple pg_database >> lookups. > > Where does Tom mention anything about checking two bytes? Here [0]. And he further elaborated on this idea here [1]. > He is > basically saying remove all trailing high-bit characters until you get a > match, because once you get a match, you are have found the point of > valid truncation for the encoding. Yes, we still need to do that if it's possible the truncation wiped out part of a multi-byte character. But it's not possible that we truncated part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th byte is ASCII, in which case we can avoid doing extra lookups. > This text: > > * If the original name is too long and we see two consecutive bytes > * with their high bits set at the truncation point, we might have > * truncated in the middle of a multibyte character. In multibyte > * encodings, every byte of a multibyte character has its high bit > * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and > * NAMEDATALEN-2, we know we're in the middle of a multibyte > * character. We need to try truncating one more byte back to find the > * start of the next character. > > needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is > true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the > middle of a multibyte character." Agreed, the second-to-last sentence should be adjusted to something like "we might be in the middle of a multibyte character." We don't know for sure. >> * Try to do multibyte-aware truncation (the patch at hand). > > Yes, I am fine with that, but we need to do more than the patch does to > accomplish this, unless I am totally confused. What more do you think is required? [0] https://postgr.es/m/3976665.1732057784%40sss.pgh.pa.us [1] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Bruce Momjian
Date:
On Thu, Nov 21, 2024 at 11:09:14AM -0600, Nathan Bossart wrote: > On Thu, Nov 21, 2024 at 11:44:44AM -0500, Bruce Momjian wrote: > > On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote: > >> Tom provided a concise explanation upthread [0]. My understanding is the > >> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases > >> where we know that we couldn't possibly have truncated in the middle of a > >> multi-byte character. This allows us to avoid doing multiple pg_database > >> lookups. > > > > Where does Tom mention anything about checking two bytes? > > Here [0]. And he further elaborated on this idea here [1]. > > > He is > > basically saying remove all trailing high-bit characters until you get a > > match, because once you get a match, you are have found the point of > > valid truncation for the encoding. > > Yes, we still need to do that if it's possible the truncation wiped out > part of a multi-byte character. But it's not possible that we truncated > part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th > byte is ASCII, in which case we can avoid doing extra lookups. Why would you check for two characters at the end rather than just a normal check in the main loop? > > needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is > > true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the > > middle of a multibyte character." > > Agreed, the second-to-last sentence should be adjusted to something like > "we might be in the middle of a multibyte character." We don't know for > sure. > > >> * Try to do multibyte-aware truncation (the patch at hand). > > > > Yes, I am fine with that, but we need to do more than the patch does to > > accomplish this, unless I am totally confused. > > What more do you think is required? I think the IS_HIGHBIT_SET needs to be integrated into the 'for' loop more clearly; the 'if' check plus the comment above it is just confusing. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Nathan Bossart
Date:
On Thu, Nov 21, 2024 at 01:05:38PM -0500, Bruce Momjian wrote: > On Thu, Nov 21, 2024 at 11:09:14AM -0600, Nathan Bossart wrote: >> Yes, we still need to do that if it's possible the truncation wiped out >> part of a multi-byte character. But it's not possible that we truncated >> part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th >> byte is ASCII, in which case we can avoid doing extra lookups. > > Why would you check for two characters at the end rather than just a > normal check in the main loop? It might be possible to integrate this check into the loop, which could potentially be cleaner. The reason I didn't at first is because it requires checking a byte that we will have already truncated away. We have to look at the original, non-truncated string for that. I'll give it a try (unless Bertrand beats me to it). >> What more do you think is required? > > I think the IS_HIGHBIT_SET needs to be integrated into the 'for' loop > more clearly; the 'if' check plus the comment above it is just > confusing. Got it. Thank you for reviewing. -- nathan
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > But the logic doesn't make sense. Why would two bytes be any different > than one? I assumed you would just remove all trailing high-bit bytes > and stop and the first non-high-bit byte. To take the most obvious counterexample: what if the name contains *only* high-bit-set bytes? In any case, this logic must achieve the same effect as the original encoding-aware truncation, which will not have removed more than it absolutely had to. regards, tom lane
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
John Naylor
Date:
On Thu, Nov 21, 2024 at 2:27 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: + /* + * If the original name is too long and we see two consecutive bytes + * with their high bits set at the truncation point, we might have + * truncated in the middle of a multibyte character. In multibyte + * encodings, every byte of a multibyte character has its high bit + * set. Counterexample: Shift JIS -- I don't think we can short-circuit the full check. -- John Naylor Amazon Web Services
Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
From
Tom Lane
Date:
John Naylor <johncnaylorls@gmail.com> writes: > On Thu, Nov 21, 2024 at 2:27 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > + * If the original name is too long and we see two consecutive bytes > + * with their high bits set at the truncation point, we might have > + * truncated in the middle of a multibyte character. In multibyte > + * encodings, every byte of a multibyte character has its high bit > + * set. > Counterexample: Shift JIS -- I don't think we can short-circuit the full check. Shift JIS is not an allowed server-side encoding, for precisely that reason. This whole proposal is based on the assumption that the string we need to match in pg_database is valid in some server-side encoding. regards, tom lane