Thread: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

Hello PostgreSQL Community,

I am writing to propose an alternative compilation of PostgreSQL that allows for a 256-byte identifier length limit, alongside the existing 64-byte version.

Problem:
The current limit of 63 bytes can be quite restrictive, especially for databases that use multi-byte character sets like UTF-8. In such cases, a Chinese character takes up 3 bytes, limiting the name to just 21 characters.

Use-Case:
In specific use-cases, table names like "能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and "气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed the current limit.

Proposed Solution:
I propose that we offer an alternative compilation of PostgreSQL that increases the NAMEDATALEN constant to allow for 256-byte identifiers. This would be particularly useful for databases that make extensive use of multi-byte character sets like UTF-8.

Dual Compilation:
To maintain backward compatibility and to offer more flexibility, I suggest compiling this 256-byte version alongside the existing 64-byte version. Users could then choose which version best suits their needs.

Backward Compatibility:
I understand that this change could affect backward compatibility. Offering this as an alternative compilation could be a way to mitigate those concerns. I am open to suggestions on how this could be implemented to minimize disruptions.

I look forward to your feedback and hope we can make PostgreSQL even more adaptable to various needs.

Best regards,
Chuxiong

On Tue, 2023-10-10 at 11:49 +0800, David HJ wrote:
> I am writing to propose an alternative compilation of PostgreSQL that allows
> for a 256-byte identifier length limit, alongside the existing 64-byte version.
>
> Problem:
> The current limit of 63 bytes can be quite restrictive, especially for databases
> that use multi-byte character sets like UTF-8. In such cases, a Chinese character
> takes up 3 bytes, limiting the name to just 21 characters.
>
> Use-Case:
> In specific use-cases, table names like
> "能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and
> "气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed
> the current limit.

I have little understanding for people who name their tables

energy_consumption_statistics_oil_and_natural_gas_usage_in_major_countries_around_the_world_annual_carbon_emissions_and_renewable_energy_ratio_table.
The table could just as well be named "能源统计排放比".  The exact description could
go in the table comment.  Apart from that, it is a good idea to use table names
that are standard SQL identifiers, so that you don't have to double quote them
all the time.

Anyway, you are not the first person to hit the limit, so there is clearly a
pain that some people feel.

> Proposed Solution:
> I propose that we offer an alternative compilation of PostgreSQL that increases
> the NAMEDATALEN constant to allow for 256-byte identifiers. This would be
> particularly useful for databases that make extensive use of multi-byte character
> sets like UTF-8.

Wouldn't it be a good solution if we promote the #define to a configure option,
like "./configure --identifier-length-limit=256"?

Yours,
Laurenz Albe



On Tue, Oct 10, 2023 at 08:22:55AM +0200, Laurenz Albe wrote:
> On Tue, 2023-10-10 at 11:49 +0800, David HJ wrote:
> > I am writing to propose an alternative compilation of PostgreSQL that allows
> > for a 256-byte identifier length limit, alongside the existing 64-byte version.
> >
> > Problem:
> > The current limit of 63 bytes can be quite restrictive, especially for databases
> > that use multi-byte character sets like UTF-8. In such cases, a Chinese character
> > takes up 3 bytes, limiting the name to just 21 characters.
>
> Anyway, you are not the first person to hit the limit, so there is clearly a
> pain that some people feel.
>
> > Proposed Solution:
> > I propose that we offer an alternative compilation of PostgreSQL that increases
> > the NAMEDATALEN constant to allow for 256-byte identifiers. This would be
> > particularly useful for databases that make extensive use of multi-byte character
> > sets like UTF-8.
>
> Wouldn't it be a good solution if we promote the #define to a configure option,
> like "./configure --identifier-length-limit=256"?

Note that there was some thread recently [1] where the possibility of having
some kind of compilation matrix to generate multiple set of binaries with
various compile-time values was discussed, so I guess it could fit well with
that approach.

[1]
https://www.postgresql.org/message-id/20230630211153.kbysulcjedxa5ii6@awork3.anarazel.de
and following messages



> Hello PostgreSQL Community,
> 
> I am writing to propose an alternative compilation of PostgreSQL that
> allows for a 256-byte identifier length limit, alongside the existing
> 64-byte version.
> 
> *Problem:*
> The current limit of 63 bytes can be quite restrictive, especially for
> databases that use multi-byte character sets like UTF-8. In such cases, a
> Chinese character takes up 3 bytes, limiting the name to just 21 characters.
> 
> *Use-Case:*
> In specific use-cases, table names like
> "能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and
> "气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed the current limit.
> 
> *Proposed Solution:*
> I propose that we offer an alternative compilation of PostgreSQL that
> increases the NAMEDATALEN constant to allow for 256-byte identifiers. This
> would be particularly useful for databases that make extensive use of
> multi-byte character sets like UTF-8.

Another solution would be, letting the meaning of NAMEDATALEN to be
number of *characters*, not the number of bytes. This way, you can use
up to 64 UTF-8 characters. In my understanding MySQL already does this
way. I know this requires non trivial code modifications to PostgreSQL
but would be better than to make binaries with random NAMEDATALEN
values.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

On Tue, 2023-10-10 at 15:53 +0900, Tatsuo Ishii wrote:
> Another solution would be, letting the meaning of NAMEDATALEN to be
> number of *characters*, not the number of bytes. This way, you can use
> up to 64 UTF-8 characters. In my understanding MySQL already does this
> way. I know this requires non trivial code modifications to PostgreSQL
> but would be better than to make binaries with random NAMEDATALEN
> values.

Since "name" is a fixed-length data type, that would require the stored
size to increase to accomodate the extra bytes.  Wouldn't that change the
storage format and break pg_upgrade?

Yours,
Laurenz Albe



Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Tue, 2023-10-10 at 15:53 +0900, Tatsuo Ishii wrote:
>> Another solution would be, letting the meaning of NAMEDATALEN to be
>> number of *characters*, not the number of bytes. This way, you can use
>> up to 64 UTF-8 characters. In my understanding MySQL already does this
>> way. I know this requires non trivial code modifications to PostgreSQL
>> but would be better than to make binaries with random NAMEDATALEN
>> values.

> Since "name" is a fixed-length data type, that would require the stored
> size to increase to accomodate the extra bytes.  Wouldn't that change the
> storage format and break pg_upgrade?

Yeah, the real reason this is unlikely to happen is precisely that
"name" is fixed-length.  Increasing the standard NAMEDATALEN by 4x,
or even 2x, has been proposed and rejected many times before because
of the bloat it would cause in places like pg_attribute, pg_proc,
in-memory tuple descriptors, etc.

The real way forward IMO is to find a way to make "name" variable-length,
thus both satisfying people who need a few long names and reducing
overhead for everybody.  This is difficult to do without breaking
mountains of backend code, but there's been some discussions about
ways to accomplish that.  The most recent thread I could find is

https://www.postgresql.org/message-id/flat/CALSd-crdmj9PGdvdioU%3Da5W7P%3DTgNmEB2QP9wiF6DTUbBuMXrQ%40mail.gmail.com

            regards, tom lane



On 10.10.23 08:22, Laurenz Albe wrote:
> Apart from that, it is a good idea to use table names that are standard 
> SQL identifiers, so that you don't have to double quote them all the time.

FWIW, the Chinese character sequences posted here would be valid 
unquoted identifiers if PostgreSQL implemented standard SQL 
correctly/completely.




Peter Eisentraut <peter@eisentraut.org> writes:
> On 10.10.23 08:22, Laurenz Albe wrote:
>> Apart from that, it is a good idea to use table names that are standard 
>> SQL identifiers, so that you don't have to double quote them all the time.

> FWIW, the Chinese character sequences posted here would be valid 
> unquoted identifiers if PostgreSQL implemented standard SQL 
> correctly/completely.

I'm pretty sure they're valid unquoted identifiers today,
because by and large we'll take any non-ASCII as identifier
characters.  Conforming to the letter of the spec would
reduce, not increase, the set of strings we'll call identifiers.

            regards, tom lane