Thread: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails

The following bug has been logged on the website:

Bug reference:      18711
Logged by:          Adam Rauch
Email address:      adam@labkey.com
PostgreSQL version: 17.1
Operating system:   Windows 11
Description:

For many years, our test infrastructure has used database names that are
longer than the stated maximum length of 63 characters. The PostgreSQL
server simply truncates these names to 63 characters in all operations and
everything works fine. Starting with 17.x, our application is able to CREATE
and ALTER databases with long names, but all connection attempts using them
fail with an error:  database "<very long name>" does not exist

I fully recognize that these names are not legal, the previous truncation
behavior was not documented, and there are obvious ways to work around this
"issue." But given the long running truncation behavior, the now
inconsistent behavior (CREATE and ALTER happily accept long database names,
but connections fail), and the lack of any mention in the release notes
makes me think this may be an unintentional change worth flagging.

I happen to be connecting via the (latest) JDBC driver. I can provide the
simple Java code that passes on PostgreSQL 16 but fails on PostgreSQL 17, if
needed.

Thanks!
Adam


PG Bug reporting form <noreply@postgresql.org> writes:
> For many years, our test infrastructure has used database names that are
> longer than the stated maximum length of 63 characters. The PostgreSQL
> server simply truncates these names to 63 characters in all operations and
> everything works fine. Starting with 17.x, our application is able to CREATE
> and ALTER databases with long names, but all connection attempts using them
> fail with an error:  database "<very long name>" does not exist

Yeah, this was an intentional change in v17:

commit 562bee0fc13dc95710b8db6a48edad2f3d052f2e
Author: Nathan Bossart <nathan@postgresql.org>
Date:   Mon Jul 3 13:18:05 2023 -0700

    Don't truncate database and user names in startup packets.

    Unlike commands such as CREATE DATABASE, ProcessStartupPacket()
    does not perform multibyte-aware truncation of overlength names.
    This means that connection attempts might fail even if the user
    provides the same overlength names that were used in CREATE
    DATABASE, CREATE ROLE, etc.  Ideally, we'd do the same multibyte-
    aware truncation in both code paths, but it doesn't seem worth the
    added complexity of trying to discover the encoding of the names.
    Instead, let's simply skip truncating the names in the startup
    packet and let the user/database lookup fail later on.  With this
    change, users must provide the exact names stored in the catalogs,
    even if the names were truncated.

    This reverts commit d18c1d1f51.

    Author: Bertrand Drouvot
    Reviewed-by: Kyotaro Horiguchi, Tom Lane
    Discussion: https://postgr.es/m/07436793-1426-29b2-f924-db7422a05fb7%40gmail.com

As said, the difficulty is that we don't know what encoding the
incoming name is meant to be in, and with multibyte encodings that
matters.  The name actually stored in the catalog might be less
than 63 bytes long if it was truncated in a multibyte-aware way,
so that the former behavior of blindly truncating at 63 bytes
can still yield unexpected no-such-database results.

I can imagine still performing the truncation if the incoming
name is all-ASCII, but that seems like a hack.  The world isn't
nearly as ASCII-centric as it was in 2001.

            regards, tom lane



On Sun, Nov 17, 2024 at 01:00:14PM -0500, Tom Lane wrote:
> As said, the difficulty is that we don't know what encoding the
> incoming name is meant to be in, and with multibyte encodings that
> matters.  The name actually stored in the catalog might be less
> than 63 bytes long if it was truncated in a multibyte-aware way,
> so that the former behavior of blindly truncating at 63 bytes
> can still yield unexpected no-such-database results.
> 
> I can imagine still performing the truncation if the incoming
> name is all-ASCII, but that seems like a hack.  The world isn't
> nearly as ASCII-centric as it was in 2001.

I wonder if we should consider removing the identifier truncation
altogether.  Granted, it mostly works (or at least did before v17), but I'm
not sure we'd make the same decision today if we were starting from
scratch.  IMHO it'd be better to ERROR so that users are forced to produce
legal identifiers.  That being said, I realize this behavior has been
present for over a quarter century now [0] [1] [2], and folks are unlikely
to be happy with even more breakage.

[0] https://postgr.es/c/d15c37c
[1] https://postgr.es/c/0672a3c
[2] https://postgr.es/c/49581f9

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Sun, Nov 17, 2024 at 01:00:14PM -0500, Tom Lane wrote:
>> As said, the difficulty is that we don't know what encoding the
>> incoming name is meant to be in, and with multibyte encodings that
>> matters.  The name actually stored in the catalog might be less
>> than 63 bytes long if it was truncated in a multibyte-aware way,
>> so that the former behavior of blindly truncating at 63 bytes
>> can still yield unexpected no-such-database results.

> I wonder if we should consider removing the identifier truncation
> altogether.  Granted, it mostly works (or at least did before v17), but I'm
> not sure we'd make the same decision today if we were starting from
> scratch.  IMHO it'd be better to ERROR so that users are forced to produce
> legal identifiers.  That being said, I realize this behavior has been
> present for over a quarter century now [0] [1] [2], and folks are unlikely
> to be happy with even more breakage.

Yeah, I think removing it now is a non-starter.

I did think of a way that we could approximate encoding-correct
truncation here, relying on the fact that what's in pg_database
is encoding-correct according to somebody:

1. If NAMEDATALEN-1'th byte is ASCII (high bit clear), just truncate
there and look up as usual.

2. If it's non-ASCII, truncate there and try to look up. On success,
we're good.  On failure, if the next-to-last byte is non-ASCII,
truncate that too and try to look up.  Repeat a maximum of
MAX_MULTIBYTE_CHAR_LEN-1 times before failing.

I think this works unconditionally so long as all entries in
pg_database.datname are in the same encoding.  If there's a
mixture of encodings (which we don't forbid) then in principle
you could probably select a database other than the one the
client thought it was asking for.  But that seems mighty
improbable, and the answer can always be "so connect using
the name as it appears in the catalog".

It's ugly of course.  But considering that we got a complaint
so quickly after v17 release, I'm not sure we can just camp on
562bee0fc as being an acceptable answer.

            regards, tom lane



On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote:
> I did think of a way that we could approximate encoding-correct
> truncation here, relying on the fact that what's in pg_database
> is encoding-correct according to somebody:
> 
> 1. If NAMEDATALEN-1'th byte is ASCII (high bit clear), just truncate
> there and look up as usual.
> 
> 2. If it's non-ASCII, truncate there and try to look up. On success,
> we're good.  On failure, if the next-to-last byte is non-ASCII,
> truncate that too and try to look up.  Repeat a maximum of
> MAX_MULTIBYTE_CHAR_LEN-1 times before failing.
> 
> I think this works unconditionally so long as all entries in
> pg_database.datname are in the same encoding.  If there's a
> mixture of encodings (which we don't forbid) then in principle
> you could probably select a database other than the one the
> client thought it was asking for.  But that seems mighty
> improbable, and the answer can always be "so connect using
> the name as it appears in the catalog".

That's an interesting idea.  That code would probably need to live in
GetDatabaseTuple(), but it seems doable.  We might be able to avoid the
"mighty improbable" case by always truncating up to
MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches,
too.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote:
>> I did think of a way that we could approximate encoding-correct
>> truncation here, relying on the fact that what's in pg_database
>> is encoding-correct according to somebody:
>> ...

> That's an interesting idea.  That code would probably need to live in
> GetDatabaseTuple(), but it seems doable.  We might be able to avoid the
> "mighty improbable" case by always truncating up to
> MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches,
> too.

Hmm ... but with short characters (e.g. LATIN1) there might be
legitimately-different names that that rule would complain about.
Still, the workaround remains "so spell it like it is in the catalog".
On balance I think that's an improvement over what I was visualizing.

Also, we could bypass the multiple lookups unless both the
NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which
should be rare enough to make it not much of a performance issue.

One annoying point is that we also need this for role lookup.

            regards, tom lane



On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote:
> Also, we could bypass the multiple lookups unless both the
> NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which
> should be rare enough to make it not much of a performance issue.

I'm admittedly not an expert in the multi-byte code, but since there are
encodings like LATIN1 that use a byte per character, don't we need to do
multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII?

> One annoying point is that we also need this for role lookup.

Right.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote:
>> Also, we could bypass the multiple lookups unless both the
>> NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which
>> should be rare enough to make it not much of a performance issue.

> I'm admittedly not an expert in the multi-byte code, but since there are
> encodings like LATIN1 that use a byte per character, don't we need to do
> multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII?

I don't think so, but maybe I'm missing something.  An important
property of backend-legal encodings is that all bytes of a multibyte
character have their high bits set.  Thus if the NAMEDATALEN-2'th
byte does not have that, it is not part of a multibyte character.
That's also the reason we can stop if we reach a high-bit-clear
byte while backing up to earlier bytes.

            regards, tom lane



On Tue, Nov 19, 2024 at 06:09:44PM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > On Tue, Nov 19, 2024 at 02:33:27PM -0500, Tom Lane wrote:
> >> I did think of a way that we could approximate encoding-correct
> >> truncation here, relying on the fact that what's in pg_database
> >> is encoding-correct according to somebody:
> >> ...
> 
> > That's an interesting idea.  That code would probably need to live in
> > GetDatabaseTuple(), but it seems doable.  We might be able to avoid the
> > "mighty improbable" case by always truncating up to
> > MAX_MULTIBYTE_CHAR_LEN-1 times and failing if there are multiple matches,
> > too.
> 
> Hmm ... but with short characters (e.g. LATIN1) there might be
> legitimately-different names that that rule would complain about.

Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That
would then work for short characters too IIUC.

> Also, we could bypass the multiple lookups unless both the
> NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII, which
> should be rare enough to make it not much of a performance issue.
> 
> One annoying point is that we also need this for role lookup.

Yeah. The role existence is checked before the database existence. And even if
that would not be the case, it would make litle sense to rely on the
pg_encoding_max_length() of the matching database for the role, as those are
global.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes:
> Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That
> would then work for short characters too IIUC.

No.  We don't know which encoding it is.  Even if you wanted to say
"use the database encoding", we haven't identified the database yet.

            regards, tom lane



Hi,

On Wed, Nov 20, 2024 at 10:10:51AM -0500, Tom Lane wrote:
> Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes:
> > Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That
> > would then work for short characters too IIUC.
> 
> No.  We don't know which encoding it is.  Even if you wanted to say
> "use the database encoding", we haven't identified the database yet.

I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname
and encoding from FormData_pg_database and start from there the comparison 
with the dbname passed as an argument to GetDatabaseTuple(). Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote:
> On Wed, Nov 20, 2024 at 10:10:51AM -0500, Tom Lane wrote:
>> Bertrand Drouvot <bertranddrouvot.pg@gmail.com> writes:
>> > Could we rely on pg_encoding_max_length() instead of MAX_MULTIBYTE_CHAR_LEN? That
>> > would then work for short characters too IIUC.
>> 
>> No.  We don't know which encoding it is.  Even if you wanted to say
>> "use the database encoding", we haven't identified the database yet.
> 
> I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname
> and encoding from FormData_pg_database and start from there the comparison 
> with the dbname passed as an argument to GetDatabaseTuple(). Thoughts?

I was wondering if we could use the database encoding to disambiguate if we
found multiple matches, but IIUC the identifier will be truncated using the
encoding of the database from which it was created.  But hopefully this is
all rare enough in practice...

-- 
nathan



On Tue, Nov 19, 2024 at 11:23:13PM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I'm admittedly not an expert in the multi-byte code, but since there are
>> encodings like LATIN1 that use a byte per character, don't we need to do
>> multiple lookups any time the NAMEDATALEN-1'th byte is non-ASCII?
> 
> I don't think so, but maybe I'm missing something.  An important
> property of backend-legal encodings is that all bytes of a multibyte
> character have their high bits set.  Thus if the NAMEDATALEN-2'th
> byte does not have that, it is not part of a multibyte character.
> That's also the reason we can stop if we reach a high-bit-clear
> byte while backing up to earlier bytes.

That's good to know.  If we can assume that 1) all bytes of a multibyte
character have the high bit set and 2) all multibyte characters actually
require multiple bytes, then there are just a handful of cases that require
multiple lookups, and we can restrict even those to some extent, too.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote:
>> I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname
>> and encoding from FormData_pg_database and start from there the comparison
>> with the dbname passed as an argument to GetDatabaseTuple(). Thoughts?

> I was wondering if we could use the database encoding to disambiguate if we
> found multiple matches, but IIUC the identifier will be truncated using the
> encoding of the database from which it was created.

Yeah, you can't really assume that a database's name is stored using
the encoding of that database.  And even if you did, you still can't
make any assumptions about pg_role.  I doubt we want to do this
differently for roles than databases.

We've had some past discussions about tightening all this up, like
insisting that names in shared catalogs are always in UTF8.  But
there are downsides to that too.

In any case, there is no way that a seqscan of pg_database (or
pg_role) is going to be adequately performant, let alone faster
than at-most-four indexed probes.  We just had somebody inquiring
about whether it'd be okay to create a million roles...

            regards, tom lane



Nathan Bossart <nathandbossart@gmail.com> writes:
> That's good to know.  If we can assume that 1) all bytes of a multibyte
> character have the high bit set and 2) all multibyte characters actually
> require multiple bytes, then there are just a handful of cases that require
> multiple lookups, and we can restrict even those to some extent, too.

I'm failing to parse your (2).  Either that's content-free or you're
thinking something that probably isn't true.  There are encodings
(mostly the LATINn series) that have high-bit-set characters that
only occupy one byte.  So I don't think we can take any shortcuts
compared to the strip-one-byte-at-a-time approach.

            regards, tom lane



On Wed, Nov 20, 2024 at 10:39:35AM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > On Wed, Nov 20, 2024 at 03:20:45PM +0000, Bertrand Drouvot wrote:
> >> I had in mind to "fully scan" pg_database in GetDatabaseTuple(), get the datname
> >> and encoding from FormData_pg_database and start from there the comparison 
> >> with the dbname passed as an argument to GetDatabaseTuple(). Thoughts?
> 
> > I was wondering if we could use the database encoding to disambiguate if we
> > found multiple matches, but IIUC the identifier will be truncated using the
> > encoding of the database from which it was created.
> 
> Yeah, you can't really assume that a database's name is stored using
> the encoding of that database.

Yeah, good point, let's stick to the MAX_MULTIBYTE_CHAR_LEN idea then and discard
the usage of pg_encoding_max_length().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, Nov 20, 2024 at 10:54:50AM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> That's good to know.  If we can assume that 1) all bytes of a multibyte
>> character have the high bit set and 2) all multibyte characters actually
>> require multiple bytes, then there are just a handful of cases that require
>> multiple lookups, and we can restrict even those to some extent, too.
> 
> I'm failing to parse your (2).  Either that's content-free or you're
> thinking something that probably isn't true.  There are encodings
> (mostly the LATINn series) that have high-bit-set characters that
> only occupy one byte.  So I don't think we can take any shortcuts
> compared to the strip-one-byte-at-a-time approach.

I'm probably missing something here, sorry.

Upthread, you mentioned that we could bypass multiple lookups unless both
the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII.  But if
there are encodings with the high bit set that don't require multiple bytes
per character, then how can we do that?  For example, let's say the
initially-truncated identifier ends with an ASCII byte followed by a
non-ASCII byte.  That last byte might be a LATIN1 character, or it could be
the beginning of a character that requires multiple bytes, so we need to
lookup both the initially truncated string as well as the string with one
extra byte truncated, right?

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> Upthread, you mentioned that we could bypass multiple lookups unless both
> the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII.  But if
> there are encodings with the high bit set that don't require multiple bytes
> per character, then how can we do that?

Well, we don't know the length of the hypothetically-truncated
character, but if there was one then all its bytes must have had their
high bits set.  Suppose that the untruncated name has a 4-byte
multibyte character extending from the NAMEDATALEN-3 byte through the
NAMEDATALEN'th byte (counting in origin zero here):

    ...61 irrelevant bytes... C1 C2 C3 C4 ...

The original CREATE DATABASE would have removed that whole character
and stored a name of length NAMEDATALEN-3:

    ...61 irrelevant bytes...

In the connection attempt, when we
receive the untruncated name, we'll first try to truncate it to
NAMEDATALEN-1 bytes:

    ...61 irrelevant bytes... C1 C2

We'll look that up and not find it.  At this point we remember that
C3 had the high bit set, and we note that C2 does too, so we try

    ...61 irrelevant bytes... C1

That still doesn't work, but C1 still has the high bit set,
so we try

    ...61 irrelevant bytes...

and find the match.

Now as for the shortcut cases: if C3 does not have the high bit set,
it cannot be part of a multibyte character.  Therefore the original
encoding-aware truncation would have removed C3 and following bytes,
but no more.  The character immediately before might have been one
byte or several, but it doesn't matter.  Similarly, if C2 does not
have the high bit set, it cannot be part of a multibyte character.
The original truncation would have removed C3 and following bytes,
but no more.

Another way to think about this is that without knowledge of the
encoding, we don't know whether a run of several high-bit-set
bytes represents one character or several.  But all the encodings
we support are ASCII extensions, meaning that any high-bit-clear
byte represents an ASCII character and is not part of a multibyte
character.  So it would have gotten truncated or not independently
of what's around it.

            regards, tom lane



On Wed, Nov 20, 2024 at 11:29:56AM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> Upthread, you mentioned that we could bypass multiple lookups unless both
>> the NAMEDATALEN-1'th and NAMEDATALEN-2'th bytes are non-ASCII.  But if
>> there are encodings with the high bit set that don't require multiple bytes
>> per character, then how can we do that?
> 
> Well, we don't know the length of the hypothetically-truncated
> character, but if there was one then all its bytes must have had their
> high bits set.  Suppose that the untruncated name has a 4-byte
> multibyte character extending from the NAMEDATALEN-3 byte through the
> NAMEDATALEN'th byte (counting in origin zero here):
>
> [...]
> 
> Now as for the shortcut cases: if C3 does not have the high bit set,
> it cannot be part of a multibyte character.  Therefore the original
> encoding-aware truncation would have removed C3 and following bytes,
> but no more.  The character immediately before might have been one
> byte or several, but it doesn't matter.  Similarly, if C2 does not
> have the high bit set, it cannot be part of a multibyte character.
> The original truncation would have removed C3 and following bytes,
> but no more.

Oh, I think I had an off-by-one error in my mental model and was thinking
of the NAMEDATALEN-1'th byte as the last possible byte in the identifier
(i.e., name[NAMEDATALEN - 2]), whereas you meant the location where the
trailing zero would go for the largest possible all-ASCII identifier (i.e.,
name[NAMEDATALEN - 1]).  Thank you for elaborating.

-- 
nathan



On Thu, Nov 21, 2024 at 07:27:22AM +0000, Bertrand Drouvot wrote:
> +        /*
> +         * If the original name is too long and we see two consecutive bytes
> +         * with their high bits set at the truncation point, we might have
> +         * truncated in the middle of a multibyte character. In multibyte
> +         * encodings, every byte of a multibyte character has its high bit
> +         * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
> +         * NAMEDATALEN-2, we know we're in the middle of a multibyte
> +         * character. We need to try truncating one more byte back to find the
> +         * start of the next character.
> +         */
...
> +                /*
> +                 * If we've hit a byte with high bit clear (an ASCII byte), we
> +                 * know we can't be in the middle of a multibyte character,
> +                 * because all bytes of a multibyte character must have their
> +                 * high bits set. Any following byte must therefore be the
> +                 * start of a new character, so we can stop looking for
> +                 * earlier truncation points.
> +                 */

I don't understand this logic.  Why are two bytes important?  If we knew
it was UTF8 we could check for non-first bytes always starting with
bits 10, but we can't know that.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Hi,

On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 07:27:22AM +0000, Bertrand Drouvot wrote:
> > +        /*
> > +         * If the original name is too long and we see two consecutive bytes
> > +         * with their high bits set at the truncation point, we might have
> > +         * truncated in the middle of a multibyte character. In multibyte
> > +         * encodings, every byte of a multibyte character has its high bit
> > +         * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
> > +         * NAMEDATALEN-2, we know we're in the middle of a multibyte
> > +         * character. We need to try truncating one more byte back to find the
> > +         * start of the next character.
> > +         */
> ...
> > +                /*
> > +                 * If we've hit a byte with high bit clear (an ASCII byte), we
> > +                 * know we can't be in the middle of a multibyte character,
> > +                 * because all bytes of a multibyte character must have their
> > +                 * high bits set. Any following byte must therefore be the
> > +                 * start of a new character, so we can stop looking for
> > +                 * earlier truncation points.
> > +                 */
> 
> I don't understand this logic.  Why are two bytes important?  If we knew
> it was UTF8 we could check for non-first bytes always starting with
> bits 10, but we can't know that.

I think this is because this is a reliable way to detect if the truncation happened
in the middle of a character, without needing to know the specifics of the encoding.

My understanding is that the key insight is that in any multibyte encoding, all
bytes within a multibyte character will have their high bits set.

That's just my understanding from the code and Tom's previous explanations:  I
might be wrong as not an expert in this area.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> > I don't understand this logic.  Why are two bytes important?  If we knew
> > it was UTF8 we could check for non-first bytes always starting with
> > bits 10, but we can't know that.
> 
> I think this is because this is a reliable way to detect if the truncation happened
> in the middle of a character, without needing to know the specifics of the encoding.
> 
> My understanding is that the key insight is that in any multibyte encoding, all
> bytes within a multibyte character will have their high bits set.
> 
> That's just my understanding from the code and Tom's previous explanations:  I
> might be wrong as not an expert in this area.

But the logic doesn't make sense.  Why would two bytes be any different
than one?  I assumed you would just remove all trailing high-bit bytes
and stop and the first non-high-bit byte.  Also, do we really expect
there to be trailing multi-byte characters and then some ASCII before
it? Isn't it likely it will be all ASCII or all multi-byte characters? 
I guess for Latin1, it would work fine, but I assume for Asian
languages, it will be almost all multi-byte characters.  I guess digits
would be ASCII.  This all just seems very unfocused.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
>> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
>> > I don't understand this logic.  Why are two bytes important?  If we knew
>> > it was UTF8 we could check for non-first bytes always starting with
>> > bits 10, but we can't know that.
>> 
>> I think this is because this is a reliable way to detect if the truncation happened
>> in the middle of a character, without needing to know the specifics of the encoding.
>> 
>> My understanding is that the key insight is that in any multibyte encoding, all
>> bytes within a multibyte character will have their high bits set.
>> 
>> That's just my understanding from the code and Tom's previous explanations:  I
>> might be wrong as not an expert in this area.
> 
> But the logic doesn't make sense.  Why would two bytes be any different
> than one?

Tom provided a concise explanation upthread [0].  My understanding is the
same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
where we know that we couldn't possibly have truncated in the middle of a
multi-byte character.  This allows us to avoid doing multiple pg_database
lookups.

> I assumed you would just remove all trailing high-bit bytes
> and stop and the first non-high-bit byte.

I think this risks truncating more than one multi-byte character, which
would cause the login path to truncate differently than the CREATE/ALTER
DATABASE path (which is encoding-aware).

> Also, do we really expect
> there to be trailing multi-byte characters and then some ASCII before
> it? Isn't it likely it will be all ASCII or all multi-byte characters? 
> I guess for Latin1, it would work fine, but I assume for Asian
> languages, it will be almost all multi-byte characters.  I guess digits
> would be ASCII.

All of these seem within the realm of possibility to me.

> This all just seems very unfocused.

I see the following options:

* Try to do multibyte-aware truncation (the patch at hand).
* Only truncate for all-ASCII identifiers for historical purposes.  Folks
  using non-ASCII characters in database names will need to specify the
  datname exactly during login.
* ERROR for long identifiers instead of automatically truncating (upthread
  this was considered a non-starter since this behavior has been around for
  so long).
* Revert the patch, leaving multibyte database names potentially broken
  (AFAIK Bertrand's initial report is the only one).
* Do nothing, so folks who previously relied on the truncation will now
  have to specify the datname exactly during login as of >= v17.

[0] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us

-- 
nathan



On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote:
> On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote:
> > On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
> >> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> >> > I don't understand this logic.  Why are two bytes important?  If we knew
> >> > it was UTF8 we could check for non-first bytes always starting with
> >> > bits 10, but we can't know that.
> >> 
> >> I think this is because this is a reliable way to detect if the truncation happened
> >> in the middle of a character, without needing to know the specifics of the encoding.
> >> 
> >> My understanding is that the key insight is that in any multibyte encoding, all
> >> bytes within a multibyte character will have their high bits set.
> >> 
> >> That's just my understanding from the code and Tom's previous explanations:  I
> >> might be wrong as not an expert in this area.
> > 
> > But the logic doesn't make sense.  Why would two bytes be any different
> > than one?
> 
> Tom provided a concise explanation upthread [0].  My understanding is the
> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
> where we know that we couldn't possibly have truncated in the middle of a
> multi-byte character.  This allows us to avoid doing multiple pg_database
> lookups.

Where does Tom mention anything about checking two bytes?  He is
basically saying remove all trailing high-bit characters until you get a
match, because once you get a match, you are have found the point of
valid truncation for the encoding.  In fact, here, he specifically talks
about MAX_MULTIBYTE_CHAR_LEN-1:

    https://www.postgresql.org/message-id/3796535.1732044807%40sss.pgh.pa.us

This text:

               * If the original name is too long and we see two consecutive bytes
               * with their high bits set at the truncation point, we might have
               * truncated in the middle of a multibyte character. In multibyte
               * encodings, every byte of a multibyte character has its high bit
               * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
               * NAMEDATALEN-2, we know we're in the middle of a multibyte
               * character. We need to try truncating one more byte back to find the
               * start of the next character.

needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is
true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the
middle of a multibyte character."

> > I assumed you would just remove all trailing high-bit bytes
> > and stop and the first non-high-bit byte.
> 
> I think this risks truncating more than one multi-byte character, which
> would cause the login path to truncate differently than the CREATE/ALTER
> DATABASE path (which is encoding-aware).

True, we can stop at MAX_MULTIBYTE_CHAR_LEN-1, and know there is no match.

> * Try to do multibyte-aware truncation (the patch at hand).

Yes, I am fine with that, but we need to do more than the patch does to
accomplish this, unless I am totally confused.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



On Thu, Nov 21, 2024 at 11:44:44AM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote:
>> Tom provided a concise explanation upthread [0].  My understanding is the
>> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
>> where we know that we couldn't possibly have truncated in the middle of a
>> multi-byte character.  This allows us to avoid doing multiple pg_database
>> lookups.
> 
> Where does Tom mention anything about checking two bytes?

Here [0].  And he further elaborated on this idea here [1].

> He is
> basically saying remove all trailing high-bit characters until you get a
> match, because once you get a match, you are have found the point of
> valid truncation for the encoding.

Yes, we still need to do that if it's possible the truncation wiped out
part of a multi-byte character.  But it's not possible that we truncated
part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th
byte is ASCII, in which case we can avoid doing extra lookups.

> This text:
> 
>                * If the original name is too long and we see two consecutive bytes
>                * with their high bits set at the truncation point, we might have
>                * truncated in the middle of a multibyte character. In multibyte
>                * encodings, every byte of a multibyte character has its high bit
>                * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
>                * NAMEDATALEN-2, we know we're in the middle of a multibyte
>                * character. We need to try truncating one more byte back to find the
>                * start of the next character.
> 
> needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is
> true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the
> middle of a multibyte character."

Agreed, the second-to-last sentence should be adjusted to something like
"we might be in the middle of a multibyte character."  We don't know for
sure.

>> * Try to do multibyte-aware truncation (the patch at hand).
> 
> Yes, I am fine with that, but we need to do more than the patch does to
> accomplish this, unless I am totally confused.

What more do you think is required?

[0] https://postgr.es/m/3976665.1732057784%40sss.pgh.pa.us
[1] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us

-- 
nathan



On Thu, Nov 21, 2024 at 11:09:14AM -0600, Nathan Bossart wrote:
> On Thu, Nov 21, 2024 at 11:44:44AM -0500, Bruce Momjian wrote:
> > On Thu, Nov 21, 2024 at 09:14:23AM -0600, Nathan Bossart wrote:
> >> Tom provided a concise explanation upthread [0].  My understanding is the
> >> same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
> >> where we know that we couldn't possibly have truncated in the middle of a
> >> multi-byte character.  This allows us to avoid doing multiple pg_database
> >> lookups.
> > 
> > Where does Tom mention anything about checking two bytes?
> 
> Here [0].  And he further elaborated on this idea here [1].
> 
> > He is
> > basically saying remove all trailing high-bit characters until you get a
> > match, because once you get a match, you are have found the point of
> > valid truncation for the encoding.
> 
> Yes, we still need to do that if it's possible the truncation wiped out
> part of a multi-byte character.  But it's not possible that we truncated
> part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th
> byte is ASCII, in which case we can avoid doing extra lookups.

Why would you check for two characters at the end rather than just a
normal check in the main loop?

> > needs to be fixed, at a minimum, specifically, "So if IS_HIGHBIT_SET is
> > true for both NAMEDATALEN-1 and NAMEDATALEN-2, we know we're in the
> > middle of a multibyte character."
> 
> Agreed, the second-to-last sentence should be adjusted to something like
> "we might be in the middle of a multibyte character."  We don't know for
> sure.
> 
> >> * Try to do multibyte-aware truncation (the patch at hand).
> > 
> > Yes, I am fine with that, but we need to do more than the patch does to
> > accomplish this, unless I am totally confused.
> 
> What more do you think is required?

I think the IS_HIGHBIT_SET needs to be integrated into the 'for' loop
more clearly;  the 'if' check plus the comment above it is just
confusing.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



On Thu, Nov 21, 2024 at 01:05:38PM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 11:09:14AM -0600, Nathan Bossart wrote:
>> Yes, we still need to do that if it's possible the truncation wiped out
>> part of a multi-byte character.  But it's not possible that we truncated
>> part of a multi-byte character if the NAMEDATALEN-1'th or NAMEDATALEN-2'th
>> byte is ASCII, in which case we can avoid doing extra lookups.
> 
> Why would you check for two characters at the end rather than just a
> normal check in the main loop?

It might be possible to integrate this check into the loop, which could
potentially be cleaner.  The reason I didn't at first is because it
requires checking a byte that we will have already truncated away.  We have
to look at the original, non-truncated string for that.  I'll give it a try
(unless Bertrand beats me to it).

>> What more do you think is required?
> 
> I think the IS_HIGHBIT_SET needs to be integrated into the 'for' loop
> more clearly;  the 'if' check plus the comment above it is just
> confusing.

Got it.  Thank you for reviewing.

-- 
nathan



Bruce Momjian <bruce@momjian.us> writes:
> But the logic doesn't make sense.  Why would two bytes be any different
> than one?  I assumed you would just remove all trailing high-bit bytes
> and stop and the first non-high-bit byte. 

To take the most obvious counterexample: what if the name contains
*only* high-bit-set bytes?  In any case, this logic must achieve
the same effect as the original encoding-aware truncation, which
will not have removed more than it absolutely had to.

            regards, tom lane



On Thu, Nov 21, 2024 at 2:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

+ /*
+ * If the original name is too long and we see two consecutive bytes
+ * with their high bits set at the truncation point, we might have
+ * truncated in the middle of a multibyte character. In multibyte
+ * encodings, every byte of a multibyte character has its high bit
+ * set.

Counterexample: Shift JIS -- I don't think we can short-circuit the full check.

--
John Naylor
Amazon Web Services



John Naylor <johncnaylorls@gmail.com> writes:
> On Thu, Nov 21, 2024 at 2:27 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> + * If the original name is too long and we see two consecutive bytes
> + * with their high bits set at the truncation point, we might have
> + * truncated in the middle of a multibyte character. In multibyte
> + * encodings, every byte of a multibyte character has its high bit
> + * set.

> Counterexample: Shift JIS -- I don't think we can short-circuit the full check.

Shift JIS is not an allowed server-side encoding, for precisely
that reason.  This whole proposal is based on the assumption that
the string we need to match in pg_database is valid in some
server-side encoding.

            regards, tom lane