Thread: NAMEDATALEN increase because of non-latin languages

NAMEDATALEN increase because of non-latin languages

From
Денис Романенко
Date:
Hello dear hackers. I understand the position of the developers community about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we speak about latin languages. 

Postgresql has wonderful support for unicode in table and column names. And it looks like very good idea to create table with names on native language for databases across the world. But when I want to create, for example, table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for catalog of counteragent contacts) it will be auto-shrinked to "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem - many words in Russian are just longer than it's English counterparts and I have many examples like this.

Although recompiling the source is not so hard, updating is hard. I know that is not free for disk space because of storing table names and field names but, from my point of view, in 2021 year convenience is more important than disk space. 

I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future releases. 

Sorry for wasted time for this message if this topic is not match with direction of postgresql development (and thank you for your hard work)

Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Wed, Aug 18, 2021 at 7:08 PM Денис Романенко <deromanenko@gmail.com> wrote:
>
> Hello dear hackers. I understand the position of the developers community about NAMEDATALEN length - and, in fact, 63
bytesis more than enough - but only if we speak about latin languages. 
>
> Postgresql has wonderful support for unicode in table and column names. And it looks like very good idea to create
tablewith names on native language for databases across the world. But when I want to create, for example, table with
name"Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for catalog of counteragent contacts) it will be
auto-shrinkedto "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem - many words in Russian are
justlonger than it's English counterparts and I have many examples like this. 
>
> Although recompiling the source is not so hard, updating is hard. I know that is not free for disk space because of
storingtable names and field names but, from my point of view, in 2021 year convenience is more important than disk
space.

Unfortunately, the problem isn't really the additional disk space it
would require.  The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.

I understand your frustration, but given those problems I don't think
that postgres will increase the default NAMEDATALEN value any time
soon, even though it's in contradiction with the SQL standard.



Re: NAMEDATALEN increase because of non-latin languages

From
Ranier Vilela
Date:
Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <deromanenko@gmail.com> escreveu:
Hello dear hackers. I understand the position of the developers community about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we speak about latin languages. 

Postgresql has wonderful support for unicode in table and column names. And it looks like very good idea to create table with names on native language for databases across the world. But when I want to create, for example, table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for catalog of counteragent contacts) it will be auto-shrinked to "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem - many words in Russian are just longer than it's English counterparts and I have many examples like this.

Although recompiling the source is not so hard, updating is hard. I know that is not free for disk space because of storing table names and field names but, from my point of view, in 2021 year convenience is more important than disk space. 

I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future releases. 
+1 once that Oracle Database 12.2 and higher, has support for 128 bytes names.
What possibly, in the future, could impact some migration from Oracle to Postgres.

regards,
Ranier Vilela

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:

On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Unfortunately, the problem isn't really the additional disk space it
> would require.  The problem is the additional performance hit and
> memory overhead, as the catalog names are part of the internal
> syscache.

Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a long time that I recall.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Wed, Aug 18, 2021 at 7:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > Unfortunately, the problem isn't really the additional disk space it
> > would require.  The problem is the additional performance hit and
> > memory overhead, as the catalog names are part of the internal
> > syscache.
>
> Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a long
timethat I recall.
 

Agreed, but I don't have access to such hardware.  However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning, so I
don't see how we could afford that without some kind of cache TTL or
similar.  AFAIR the last discussion about it a few years ago didn't
lead anywhere :(



Re: NAMEDATALEN increase because of non-latin languages

From
Денис Романенко
Date:
I don't very close with PG testing methodology, but I can pay for a server (virtual or dedicated, DO maybe) and give access to it, if anyone has time for that.

Or if someone describes to me steps and shows where to look - I can do it by myself.

Re: NAMEDATALEN increase because of non-latin languages

From
Hannu Krosing
Date:
Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ?

Memory sizes and processor speeds have grown by order(s) of magnitude
since the 64 byte limit was decided and supporting non-ASCII charsets
properly seems like a prudent thing to do.

Also - have we checked that at least the truncation does not cut utf-8
characters in half ?

-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Wed, Aug 18, 2021 at 1:33 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 7:27 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > Unfortunately, the problem isn't really the additional disk space it
> > > would require.  The problem is the additional performance hit and
> > > memory overhead, as the catalog names are part of the internal
> > > syscache.
> >
> > Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a
longtime that I recall.
 
>
> Agreed, but I don't have access to such hardware.  However this won't
> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning, so I
> don't see how we could afford that without some kind of cache TTL or
> similar.  AFAIR the last discussion about it a few years ago didn't
> lead anywhere :(
>
>



Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:

On Wed, Aug 18, 2021 at 7:33 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a long time that I recall.
>
> Agreed, but I don't have access to such hardware.  However this won't

Well, by "recent" I had in mind something more recent than 2002, which is the time where I see a lot of hits in the archives if you search for this topic.

> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning, so I

That's a fair point.

> don't see how we could afford that without some kind of cache TTL or
> similar.  AFAIR the last discussion about it a few years ago didn't
> lead anywhere :(

If you mean the thread "Protect syscache from bloating with negative cache entries", it had activity earlier this year, so I wouldn't give up hope just yet. Progress has been slow, so I'll see about putting some effort into that after concluding my attempt to speed up the syscaches first [1].

The main thing I'm worried about is the fact that a name would no longer fit in a Datum. The rest I think we can mitigate in some way.


--
John Naylor
EDB: http://www.enterprisedb.com

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
On Wed, Aug 18, 2021 at 8:03 AM Hannu Krosing <hannuk@google.com> wrote:
>
> Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ?

That couldn't work because characters are variable length. The limit has to be a fixed length in bytes so we can quickly compute offsets in the attribute tuple.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Wed, Aug 18, 2021 at 8:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> >
> > Agreed, but I don't have access to such hardware.  However this won't
>
> Well, by "recent" I had in mind something more recent than 2002, which is the time where I see a lot of hits in the
archivesif you search for this topic.
 

Yeah, but my current laptop has a tendency to crash after a few minute
if I stress it too much, so I'm still out.

> If you mean the thread "Protect syscache from bloating with negative cache entries", it had activity earlier this
year,so I wouldn't give up hope just yet.
 

Yes that's the thread I was thinking about.  I'm not giving up hope
either, but I also don't see it being solved for v15.

> The main thing I'm worried about is the fact that a name would no longer fit in a Datum. The rest I think we can
mitigatein some way.
 

Agreed.



Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Wed, Aug 18, 2021 at 8:03 PM Hannu Krosing <hannuk@google.com> wrote:
>
> Also - have we checked that at least the truncation does not cut utf-8
> characters in half ?

Yes, same for all other places that can truncate text (like the query
text in pg_stat_activity and similar).  See usage of pg_mbcliplen() in
truncate_identifier.



Re: NAMEDATALEN increase because of non-latin languages

From
Laurenz Albe
Date:
On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
> Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <deromanenko@gmail.com> escreveu:
> > Hello dear hackers. I understand the position of the developers community about
> >  NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we
> >  speak about latin languages. 
> > 
> > Postgresql has wonderful support for unicode in table and column names. And it
> >  looks like very good idea to create table with names on native language for
> >  databases across the world. But when I want to create, for example, table with
> >  name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for
> >  catalog of counteragent contacts) it will be auto-shrinked to
> >  "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem -
> >  many words in Russian are just longer than it's English counterparts and I
> >  have many examples like this.
> > 
> > Although recompiling the source is not so hard, updating is hard. I know that
> >  is not free for disk space because of storing table names and field names but,
> >  from my point of view, in 2021 year convenience is more important than disk space. 
> > 
> > I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future releases. 

My stance here is that you should always use ASCII only for database identifiers,
not only because of this, but also to avoid unpleasant encoding problems if
you want to do something like

 pg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb

on a shell with an encoding different from the database encoding.

So I am not too excited about this.

> +1 once that Oracle Database 12.2 and higher, has support for 128 bytes names.
> What possibly, in the future, could impact some migration from Oracle to Postgres.

That seems to be a better argument from my point of view.

I have no idea as to how bad the additional memory impact for the catalog
caches would be...

Yours,
Laurenz Albe




Re: NAMEDATALEN increase because of non-latin languages

From
Ranier Vilela
Date:
Em qua., 18 de ago. de 2021 às 09:33, Laurenz Albe <laurenz.albe@cybertec.at> escreveu:
On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
> Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <deromanenko@gmail.com> escreveu:
> > Hello dear hackers. I understand the position of the developers community about
> >  NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we
> >  speak about latin languages.
> >
> > Postgresql has wonderful support for unicode in table and column names. And it
> >  looks like very good idea to create table with names on native language for
> >  databases across the world. But when I want to create, for example, table with
> >  name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for
> >  catalog of counteragent contacts) it will be auto-shrinked to
> >  "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem -
> >  many words in Russian are just longer than it's English counterparts and I
> >  have many examples like this.
> >
> > Although recompiling the source is not so hard, updating is hard. I know that
> >  is not free for disk space because of storing table names and field names but,
> >  from my point of view, in 2021 year convenience is more important than disk space.
> >
> > I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future releases.

My stance here is that you should always use ASCII only for database identifiers,
not only because of this, but also to avoid unpleasant encoding problems if
you want to do something like

 pg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb

on a shell with an encoding different from the database encoding.

So I am not too excited about this.

> +1 once that Oracle Database 12.2 and higher, has support for 128 bytes names.
> What possibly, in the future, could impact some migration from Oracle to Postgres.

That seems to be a better argument from my point of view.

I have no idea as to how bad the additional memory impact for the catalog
caches would be...
It seems to me that this is a case for macro:
HAS_SUPPORT_NAME_128_BYTES
Деnis Романенко would like and would pay the price for regression in exchange for the convenience.
What impacts him now is the difficulty of maintaining a private tree, with this support.

regards,
Ranier Vilela

Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
John Naylor <john.naylor@enterprisedb.com> writes:
> The main thing I'm worried about is the fact that a name would no longer
> fit in a Datum. The rest I think we can mitigate in some way.

Not sure what you mean by that?  name is a pass-by-ref data type.

Anyway, this whole argument could be rendered moot if we could convert
name to a variable-length type.  That would satisfy *both* sides of
the argument, since those who need long names could have them, while
those who don't would see net reduction instead of growth in space usage.

Of course, this is far far easier said than done; else we would have
done it years ago.  But maybe it's not entirely out of reach.
I do not think it'd be hard to change "name" to have the same on-disk
storage representation as cstring; the hard part is what about its
usage in fixed-width catalog structures.  Maybe we could finesse that
by decreeing that the name column always has to be the last
non-CATALOG_VARLEN field.  (This would require fixing up the couple of
places where we let some other var-width field have that distinction;
but I suspect that would be small in comparison to the other work this
implies.  If there are any catalogs having two name columns, one of them
would become more difficult to reach from C code.)

Another fun thing --- and, I fear, another good argument against just
raising NAMEDATALEN --- is what about TupleDescs, which last I checked
used an array of fixed-width pg_attribute images.  But maybe we could
replace that with an array of pointers.  Andres already did a lot of
the heavy code churn required to hide that data structure behind
TupleDescAttr() macros, so changing the representation should be much
less painful than it would once have been.

I wonder if we'd get complaints from changing the catalog column layouts
that much.  People are used to the name at the front, I think.  OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.

Anyway, I have little desire to work on this myself, but I recommend that
somebody who is more affected by the name length restriction look into it.

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Anyway, this whole argument could be rendered moot if we could convert
> name to a variable-length type.  That would satisfy *both* sides of
> the argument, since those who need long names could have them, while
> those who don't would see net reduction instead of growth in space usage.

Yeah it seems like the best way forward.

> Of course, this is far far easier said than done; else we would have
> done it years ago.  But maybe it's not entirely out of reach.
> I do not think it'd be hard to change "name" to have the same on-disk
> storage representation as cstring; the hard part is what about its
> usage in fixed-width catalog structures.  Maybe we could finesse that
> by decreeing that the name column always has to be the last
> non-CATALOG_VARLEN field.  (This would require fixing up the couple of
> places where we let some other var-width field have that distinction;
> but I suspect that would be small in comparison to the other work this
> implies.  If there are any catalogs having two name columns, one of them
> would become more difficult to reach from C code.)

Here is the list on some recent build (as of 17707c059c):

     relname      |            array_agg
------------------+----------------------------------
 pg_collation     | {collname,collctype,collcollate}
 pg_database      | {datctype,datcollate,datname}
 pg_event_trigger | {evtevent,evtname}
 pg_subscription  | {subname,subslotname}
 pg_trigger       | {tgname,tgnewtable,tgoldtable}
(5 rows)

> I wonder if we'd get complaints from changing the catalog column layouts
> that much.  People are used to the name at the front, I think.  OTOH,
> I expected a lot of bleating about the OID column becoming frontmost,
> but there hasn't been much.

I don't think that would be comparable.  Having an extra oid in the
1st column doesn't really make a raw SELECT * harder to read.  But
having the XXXname column way behind, and not even at the end, means
that most people will have to type an extra "xxxname," for each
throwaway query run to quickly verify something.  I know that I often
do that, and while I could live with it I'd rather not have to do it.



Re: NAMEDATALEN increase because of non-latin languages

From
Peter Eisentraut
Date:
On 18.08.21 13:33, Julien Rouhaud wrote:
> Agreed, but I don't have access to such hardware.  However this won't
> influence the memory overhead part, and there is already frequent
> problems with that, especially since declarative partitioning,

On the flip side, with partitioning you need room for longer table 
names, since you need room for the real name plus some partition identifier.



Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if we'd get complaints from changing the catalog column layouts
>> that much.  People are used to the name at the front, I think.  OTOH,
>> I expected a lot of bleating about the OID column becoming frontmost,
>> but there hasn't been much.

> I don't think that would be comparable.  Having an extra oid in the
> 1st column doesn't really make a raw SELECT * harder to read.  But
> having the XXXname column way behind, and not even at the end, means
> that most people will have to type an extra "xxxname," for each
> throwaway query run to quickly verify something.  I know that I often
> do that, and while I could live with it I'd rather not have to do it.

Yeah, it would annoy the heck out of me too.  Again there's a potential
technical solution, which is to decouple the user-visible column order
from the storage order.  However, multiple people have tilted at that
windmill without much success, so making it a prerequisite for improving
the name-length situation doesn't seem like a smart plan.

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
Andrew Dunstan
Date:
On 8/18/21 10:53 AM, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
>> On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder if we'd get complaints from changing the catalog column layouts
>>> that much.  People are used to the name at the front, I think.  OTOH,
>>> I expected a lot of bleating about the OID column becoming frontmost,
>>> but there hasn't been much.
>> I don't think that would be comparable.  Having an extra oid in the
>> 1st column doesn't really make a raw SELECT * harder to read.  But
>> having the XXXname column way behind, and not even at the end, means
>> that most people will have to type an extra "xxxname," for each
>> throwaway query run to quickly verify something.  I know that I often
>> do that, and while I could live with it I'd rather not have to do it.
> Yeah, it would annoy the heck out of me too.  Again there's a potential
> technical solution, which is to decouple the user-visible column order
> from the storage order.  However, multiple people have tilted at that
> windmill without much success, so making it a prerequisite for improving
> the name-length situation doesn't seem like a smart plan.
>
>             


There might be other benefits, though. IIRC what we discussed years ago
was having for each attribute an immutable number (presumably attnum as
now) and a mutable display order and storage order. Previous patches
didn't implement this and so were rejected. I think part of the trouble
is that we'd have to go through roughly 1700 mentions of attnum in the
source and decide if it's really attnum they want or if it's
attnum_display_order or attnum_storage_order. So this has the potential
to be extraordinarily invasive and potentially bug-prone. And then
there's the world of extensions to consider.

I have a bit less sympathy for the argument that just moving it will
break things that use 'select *' on the catalogs. In general, if you
care about the order of columns you should name the columns you want in
the order you want. I've seen 'select *' break for people on other
changes, like adding or dropping a column. It might cause Postgres
developers a bit of pain, but it should be manageable, so I kind of like
your suggestion.


cheers

andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 8/18/21 10:53 AM, Tom Lane wrote:
>> Yeah, it would annoy the heck out of me too.  Again there's a potential
>> technical solution, which is to decouple the user-visible column order
>> from the storage order.  However, multiple people have tilted at that
>> windmill without much success, so making it a prerequisite for improving
>> the name-length situation doesn't seem like a smart plan.

> There might be other benefits, though. IIRC what we discussed years ago
> was having for each attribute an immutable number (presumably attnum as
> now) and a mutable display order and storage order. Previous patches
> didn't implement this and so were rejected. I think part of the trouble
> is that we'd have to go through roughly 1700 mentions of attnum in the
> source and decide if it's really attnum they want or if it's
> attnum_display_order or attnum_storage_order. So this has the potential
> to be extraordinarily invasive and potentially bug-prone. And then
> there's the world of extensions to consider.

Yeah, exactly: conceptually that's simple, but flushing all the bugs
out would be a years-long nightmare.  It'd make all the fun we had
with missed attisdropped checks look like a walk in the park.  Unless
somebody can figure out a way to mechanically check for mistakes,
I don't think I want to go there.

I wonder though if we could fix the immediate problem with something
less ambitious.  The hard part of the full proposal, I think, is
separating permanent identity from physical position.  If we were to
split out *only* the display order from that, the patch footprint
ought to be far far smaller --- basically, I think, we'd need to fix
star-expansion and not a lot more in the backend.  Of course,
client-side code like psql's \d and pg_dump would need to be upgraded
too, but any missed things there would be cosmetic that's-not-the-
expected-column-order bugs, not core dumps.  Also, at least in the v1
patch we could use this just for system catalogs without exposing it
as a feature for user tables, which would greatly restrict the set of
client-side places that really need fixed.

(I think Alvaro was the last person to mess with this issue, so I
wonder what his take is on the feasibility of such a restricted
solution.)

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
Alvaro Herrera
Date:
On 2021-Aug-18, Tom Lane wrote:

> I wonder though if we could fix the immediate problem with something
> less ambitious.  The hard part of the full proposal, I think, is
> separating permanent identity from physical position.  If we were to
> split out *only* the display order from that, the patch footprint
> ought to be far far smaller --- basically, I think, we'd need to fix
> star-expansion and not a lot more in the backend.  Of course,
> client-side code like psql's \d and pg_dump would need to be upgraded
> too, but any missed things there would be cosmetic that's-not-the-
> expected-column-order bugs, not core dumps.  Also, at least in the v1
> patch we could use this just for system catalogs without exposing it
> as a feature for user tables, which would greatly restrict the set of
> client-side places that really need fixed.
> 
> (I think Alvaro was the last person to mess with this issue, so I
> wonder what his take is on the feasibility of such a restricted
> solution.)

Yeah, my impression is that the project of just changing star expansion
is much, much easier than splitting out attribute identity from physical
location.  The former, as I recall, is a localized change in expandRTE
and friends, and you don't have to touch anything else.  The other part
of the change is much more invasive and got me into territory that I
wasn't able to navigate successfully.

(I think we should consider keeping 'attnum' as the display-order
attribute, and have the physical-and-identity attribute get a new name,
say attphysnum.  That's so that you don't have to change psql and the
whole world -- the change is transparent to them.  This means we need a
first step that's very invasive, because every single current use of
'attnum' has to be changed to attphysnum first, followed by a functional
patch that changes a few of those back to attnum.  It'd be a large
inconvenience to backend developers to ease the lives of client-side
developers.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: NAMEDATALEN increase because of non-latin languages

From
Andrew Dunstan
Date:
On 8/18/21 12:39 PM, Alvaro Herrera wrote:
> On 2021-Aug-18, Tom Lane wrote:
>
>> I wonder though if we could fix the immediate problem with something
>> less ambitious.  The hard part of the full proposal, I think, is
>> separating permanent identity from physical position.  If we were to
>> split out *only* the display order from that, the patch footprint
>> ought to be far far smaller --- basically, I think, we'd need to fix
>> star-expansion and not a lot more in the backend.  Of course,
>> client-side code like psql's \d and pg_dump would need to be upgraded
>> too, but any missed things there would be cosmetic that's-not-the-
>> expected-column-order bugs, not core dumps.  Also, at least in the v1
>> patch we could use this just for system catalogs without exposing it
>> as a feature for user tables, which would greatly restrict the set of
>> client-side places that really need fixed.
>>
>> (I think Alvaro was the last person to mess with this issue, so I
>> wonder what his take is on the feasibility of such a restricted
>> solution.)
> Yeah, my impression is that the project of just changing star expansion
> is much, much easier than splitting out attribute identity from physical
> location.  The former, as I recall, is a localized change in expandRTE
> and friends, and you don't have to touch anything else.  The other part
> of the change is much more invasive and got me into territory that I
> wasn't able to navigate successfully.
>
> (I think we should consider keeping 'attnum' as the display-order
> attribute, and have the physical-and-identity attribute get a new name,
> say attphysnum.  That's so that you don't have to change psql and the
> whole world -- the change is transparent to them.  This means we need a
> first step that's very invasive, because every single current use of
> 'attnum' has to be changed to attphysnum first, followed by a functional
> patch that changes a few of those back to attnum.  It'd be a large
> inconvenience to backend developers to ease the lives of client-side
> developers.)


Can we call it attid just in case we decide some day to break the nexus
on the other side?


I like the idea of keeping it to the catalog to start with.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Thu, Aug 19, 2021 at 12:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Yeah, exactly: conceptually that's simple, but flushing all the bugs
> out would be a years-long nightmare.  It'd make all the fun we had
> with missed attisdropped checks look like a walk in the park.  Unless
> somebody can figure out a way to mechanically check for mistakes,
> I don't think I want to go there.

Maybe a silly idea, but we could have some shared_preload_libraries
module with a command_utility_hook that intercept table creation and
randomize the logical column order.  If we also have some GUC (assert
only if needed) to switch star expansion to physical-order rather than
logical-order, we could then use the regression test as a big hammer
to see if anything breaks.  That's clearly not ideal (it would
obviously need some other hacks to avoid other breakage like \d and
other things) and not 100% coverage, but it should give some
confidence in any patch completeness.



Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2021-08-18 10:21:03 -0400, Tom Lane wrote:
> Anyway, this whole argument could be rendered moot if we could convert
> name to a variable-length type.  That would satisfy *both* sides of
> the argument, since those who need long names could have them, while
> those who don't would see net reduction instead of growth in space usage.

Yes, I think that's the best direction to go. We're loosing a fair bit
of in-memory efficiency with current NAMEDATALEN already, it'd imo be
beneficial to go for variable length encoding from that angle alone.


> Of course, this is far far easier said than done; else we would have
> done it years ago.  But maybe it's not entirely out of reach.
> I do not think it'd be hard to change "name" to have the same on-disk
> storage representation as cstring; the hard part is what about its
> usage in fixed-width catalog structures.

Indeed. ISTM that the hardest part of that is dealing with copying around
Form* structs?

I wonder if we're getting closer to the time where we should just give up on
the struct / ondisk layout mirroring for catalog tables, and generate explicit
transformation routines via Catalog.pm. If we have to touch places handling
Form_* structs anyway, we could make that more worthwhile by doing away with
CATALOG_VARLEN etc - the transformation routines would just make those fields
pointers which then can be accessed normally.


> Maybe we could finesse that
> by decreeing that the name column always has to be the last
> non-CATALOG_VARLEN field.  (This would require fixing up the couple of
> places where we let some other var-width field have that distinction;
> but I suspect that would be small in comparison to the other work this
> implies.  If there are any catalogs having two name columns, one of them
> would become more difficult to reach from C code.)

I wish we could find a way to make "relative pointers" (i.e. a variable offset
from some base struct) work on the C level in a transparent and portable
manner. Just about any instruction set has them natively anyway, and for a lot
of tasks like variable length members it can be considerably more
efficient... It'd also make DSM using code simpler and faster. Oh well, one
can dream.



> Another fun thing --- and, I fear, another good argument against just
> raising NAMEDATALEN --- is what about TupleDescs, which last I checked
> used an array of fixed-width pg_attribute images.  But maybe we could
> replace that with an array of pointers.  Andres already did a lot of
> the heavy code churn required to hide that data structure behind
> TupleDescAttr() macros, so changing the representation should be much
> less painful than it would once have been.

I was recently wondering if we shouldn't go to a completely bespoke
datastructure for TupleDesc->attrs, rather than reusing FormData_pg_attribute.

Right now every attribute uses nearly two cachelines (112 bytes). Given how
frequent a task tuple [de]forming is, and how often it's a bottleneck,
increasing the cache efficiency of tupledescs would worth quite a bit of
effort - I do see tupledesc attr cache misses in profiles. A secondary benefit
would be that we do create a lot of short-lived descs in the executor,
slimming those down obviously would be good on its own. A third benefit would
be that we could get rid of attcacheoff in pg_attribute, that always smelled
funny to me.

One possible way to structure such future tupledescs would be to have multiple
arrays in struct TupleDescData. With an array of just the data necessary for
[de]forming at the place ->attrs is, and other stuff in one or more separate
arrays. The other option could perhaps be omitted for some tupledescs or
computed lazily.

For deforming we just need attlen (2byte), attbyval (1 byte), attalign (1byte)
and optionally attcacheoff (4 byte), for forming we also need attstorage (1
byte). Naively that ends up being 12 bytes - 5 attrs / cacheline is a heck of
a lot better than ~0.5.

Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
Matthias van de Meent
Date:
On Thu, 19 Aug 2021 at 13:44, Andres Freund <andres@anarazel.de> wrote:
>
> > Another fun thing --- and, I fear, another good argument against just
> > raising NAMEDATALEN --- is what about TupleDescs, which last I checked
> > used an array of fixed-width pg_attribute images.  But maybe we could
> > replace that with an array of pointers.  Andres already did a lot of
> > the heavy code churn required to hide that data structure behind
> > TupleDescAttr() macros, so changing the representation should be much
> > less painful than it would once have been.
>
> I was recently wondering if we shouldn't go to a completely bespoke
> datastructure for TupleDesc->attrs, rather than reusing FormData_pg_attribute.
>
> Right now every attribute uses nearly two cachelines (112 bytes). Given how
> frequent a task tuple [de]forming is, and how often it's a bottleneck,
> increasing the cache efficiency of tupledescs would worth quite a bit of
> effort - I do see tupledesc attr cache misses in profiles. A secondary benefit
> would be that we do create a lot of short-lived descs in the executor,
> slimming those down obviously would be good on its own. A third benefit would
> be that we could get rid of attcacheoff in pg_attribute, that always smelled
> funny to me.
>
> One possible way to structure such future tupledescs would be to have multiple
> arrays in struct TupleDescData. With an array of just the data necessary for
> [de]forming at the place ->attrs is, and other stuff in one or more separate
> arrays. The other option could perhaps be omitted for some tupledescs or
> computed lazily.
>
> For deforming we just need attlen (2byte), attbyval (1 byte), attalign (1byte)
> and optionally attcacheoff (4 byte), for forming we also need attstorage (1
> byte). Naively that ends up being 12 bytes - 5 attrs / cacheline is a heck of
> a lot better than ~0.5.

I tried to implement this 'compact attribute access descriptor' a few
months ago in my effort to improve btree index performance.

I abandoned the idea at the time as I didn't find any measurable
difference for the (limited!) tests I ran, where the workload was
mainly re-indexing, select * into, and similar items while
benchmarking reindexing in the 'pp-complete' dataset. But, seeing that
there might be interest outside this effort on a basis seperate from
just plain performance, I'll share the results.

Attached is the latest version of my patch that I could find; it might
be incorrect or fail, as this is something I sent to myself between 2
of my systems during development of the patch. Also, attached as .txt,
as I don't want any CFBot coverage on this (this is not proposed for
inclusion, it is just a show of work, and might be basis for future
work).

The patch allocates an array of 'TupleAttrAlignData'-structs at the
end of the attrs-array, fills it with the correct data upon
TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
and destructing tuples.

One main difference from what you described was that I used a union
for storing attbyval and attstorage, as the latter is only applicable
to attlen < 0, and the first only for attlen >= 0. This keeps the
whole structure in 8 bytes, whilst also being useable in both tuple
forming and deforming.

I hope this can is useful, otherwise sorry for the noise.


Kind regards,

Matthias van de Meent

Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2021-08-19 14:47:42 +0200, Matthias van de Meent wrote:
> I tried to implement this 'compact attribute access descriptor' a few
> months ago in my effort to improve btree index performance.

cool


> The patch allocates an array of 'TupleAttrAlignData'-structs at the
> end of the attrs-array, fills it with the correct data upon
> TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
> and destructing tuples.

> One main difference from what you described was that I used a union
> for storing attbyval and attstorage, as the latter is only applicable
> to attlen < 0, and the first only for attlen >= 0. This keeps the
> whole structure in 8 bytes, whilst also being useable in both tuple
> forming and deforming.

That's why I just talked about the naive way - it's clearly possible to
do better... ;)


> I hope this can is useful, otherwise sorry for the noise.

It is!


I haven't looked at your patch in detail, but I suspect that one reason
that you didn't see performance benefits is that you added overhead as
well. The computation of the "compact" memory location now will need a
few more instructions than before, and I suspect the compiler may not
even be able to optimize out some of the redundant accesses in loops.

It'd be interesting to see what you'd get if you stored the compact
array as the flexible-array and stored a pointer to the "full" attrs
array (while still keeping it allocated together).


Another reason is that it looks like you didn't touch
slot_deform_heap_tuple(), which is I think the hottest of the deforming
routines...

Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
Matthias van de Meent
Date:
On Thu, 19 Aug 2021 at 14:58, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-19 14:47:42 +0200, Matthias van de Meent wrote:
> > I tried to implement this 'compact attribute access descriptor' a few
> > months ago in my effort to improve btree index performance.
>
> cool
>
>
> > The patch allocates an array of 'TupleAttrAlignData'-structs at the
> > end of the attrs-array, fills it with the correct data upon
> > TupleDesc-creation, and uses this TupleAttrAlign-data for constructing
> > and destructing tuples.
>
> > One main difference from what you described was that I used a union
> > for storing attbyval and attstorage, as the latter is only applicable
> > to attlen < 0, and the first only for attlen >= 0. This keeps the
> > whole structure in 8 bytes, whilst also being useable in both tuple
> > forming and deforming.
>
> That's why I just talked about the naive way - it's clearly possible to
> do better... ;)
>
>
> > I hope this can is useful, otherwise sorry for the noise.
>
> It is!

Great!

> I haven't looked at your patch in detail, but I suspect that one reason
> that you didn't see performance benefits is that you added overhead as
> well. The computation of the "compact" memory location now will need a
> few more instructions than before, and I suspect the compiler may not
> even be able to optimize out some of the redundant accesses in loops.
>
> It'd be interesting to see what you'd get if you stored the compact
> array as the flexible-array and stored a pointer to the "full" attrs
> array (while still keeping it allocated together).

Yes, I remember testing swapping the order of the compact array with the FormData_pg_attribute array as well, with no clear results.

I think this can partially be attributed to the split access methods of the data in the attribute descriptor: some of it is 'give me the name', some of it is 'does this attribute exist, what type description does it have?' (atttypid, attnum, atttypmod, , and others are only interested in the physical representation information. Prioritizing some over the other might work, but I think to make full use of that it'd need a lot of work.

> Another reason is that it looks like you didn't touch
> slot_deform_heap_tuple(), which is I think the hottest of the deforming
> routines...

That might be for normal operations, but I'm not certain that code is in the hot path for (btree) indexing workloads, due to the relatively high number of operations on each tuple whilst sorting, or finding an insertion point or scan start point.

Anyway, after some digging I found the final state of this patch before I stopped working on it, and after polishing it up a bit with your suggestions it now passes check-world on the latest head (8d2d6ec7).


Kind regards,

Matthias van de Meent
Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
Hi,

I wanted to revive this thread to summarize what was discussed and get
a sense of next steps we could take.

The idea that gained the most traction is to make identifiers
variable-length in the catalogs, which has the added benefit of
reducing memory in syscaches in the common case. That presented two
challenges:

1. That would require putting the name physically closer to the end of
the column list. To make this less annoying for users, we'd need to
separate physical order from display order (at least/at first only for
system catalogs). This would require:

- changing star expansion in SELECTs (expandRTE etc)
- adjusting pg_dump, \d, etc

That much seems clear and agreed upon.

2. We'd need to change how TupleDescs are stored and accessed.

For this point, Matthias shared a prototype patch for a 'compact
attribute access descriptor' and Andres wondered if we should "give up
on the struct / ondisk layout mirroring for catalog tables, and
generate explicit transformation routines via Catalog.pm".

After this discussion dropped off, and it's not quite clear to me what
the first logical step is to make progress on the thread subject, and
what's nice to have for other reasons. Is Matthias's patch or
something like it a good next step?

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
Hi,

On Fri, Jun 03, 2022 at 01:28:16PM +0700, John Naylor wrote:
> 
> I wanted to revive this thread to summarize what was discussed and get
> a sense of next steps we could take.

Thanks for reviving this topic!

> The idea that gained the most traction is to make identifiers
> variable-length in the catalogs, which has the added benefit of
> reducing memory in syscaches in the common case. That presented two
> challenges:
> 
> 1. That would require putting the name physically closer to the end of
> the column list. To make this less annoying for users, we'd need to
> separate physical order from display order (at least/at first only for
> system catalogs). This would require:
> 
> - changing star expansion in SELECTs (expandRTE etc)
> - adjusting pg_dump, \d, etc

I tried to work on a physical / logical attribute position patch (more at the
end of the mail).  I based it on Álvaro's feedback at [1], which is:

- rename current pg_attribute.attnum to attphysnum
- add a new pg_attribute.attnum for the now logical position
- tweak a few places to use attnum rather than attphysnum (which however isn't
  what I did)

The idea, IIUC, is to make a physical position reordering entirely transparent
to clients.

However, once you change the star expansion, things start to subtly break in a
lot of places.  I try to dig in a bit to see if things could be made to work
the same but it opened a whole can of worms.  So before spending too much time
on something that could be a non-starter I'd like to get some feedback on the
general approach.

While some problem wouldn't happen if we restricted the feature to system
catalogs only (e.g. with renamed / dropped attributes, inheritance...), a lot
would still exist and would have to be dealt with initially.  However I'm not
sure what behavior would be wanted or acceptable, especially if we want
something that can eventually be used for user relations too, with possibly
dynamic positions.

I'll describe some of those problems, and just to make things easier I will use
a normal table "ab" with 2 attributes, a and b, with their physical / logical
position reversed.

Obviously

SELECT * FROM ab

should return a and b in that order.  The aliases should also probably match
the logical position, as in:

SELECT * FROM ab ab(aa, bb)

should probably map aa to a and bb to b.

But should COPY FROM / COPY TO / INSERT INTO use the logical position too if
not column list is provided?  I'd think so, but it goes further from the
original "only handle star expansion" idea.

And should record_in / record_out use the logical position, as in:
SELECT ab::text FROM ab / SELECT (a, b)::ab;

I would think not, as relying on a possibly dynamic order could break things if
you store the results somewhere, but YMMV.

Note that there a variations of that, like
SELECT ab::ab FROM (SELECT * FROM ab) ab;

But those should probably work as intended (for now it doesn't) as you can't
store a bare record, and storing a plain "ab" type wouldn't be problematic with
dynamic changes.


Then, what about joinrels expansion?  I learned that the column ordering rules
are far from being obvious, and I didn't find those in the documentation (note
that I don't know if that something actually described in the SQL standard).
So for instance, if a join is using an explicit USING clause rather than an ON
clause, the merged columns are expanded first, so:

SELECT * FROM ab ab1 JOIN ab ab2 USING (b)

should unexpectedly expand to (b, a, a).  Is this order a strict requirement?

For now I sort the expanded columns by (varno, attnum) but that's clearly a
noticeable change.

Another problem (that probably wouldn't be a problem for system catalogs) is
that defaults are evaluated in the physical position.  This example from the
regression test will clearly have a different behavior if the columns are in a
different physical order:

 CREATE TABLE INSERT_TBL (
     x INT DEFAULT nextval('insert_seq'),
     y TEXT DEFAULT '-NULL-',
     z INT DEFAULT -1 * currval('insert_seq'),
     CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND x < 8),
     CHECK (x + z = 0));

But changing the behavior to rely on the logical position seems quite
dangerous.


I'm also attaching some few patches as a basis for an implementation
discussion.  Those are just POC level prototypes with the bare minimum changes
to try to have a behavior based on the logical position, and absolutely no
effort to optimize anything.  I'm also not familiar with all the impacted
places, so I'm definitely not claiming that this is a good approach.  Any
suggestion on a better approach is welcome.

0001 is the s/attphysnum/attnum/ suggested by Álvaro.  Here again it's the bare
minimum to make the code compile, a lot of the code still reference attnum in
some ways (attnameAttnum(), variables like attnums and so on) that would need
to be cleaned up.

0002 is the addition of the new pg_attribute.attnum with some changes trying to
make things work, but no infrastructure to change it anywhere.

The relevant changes are there.  In expandNSItemAttrs and some other places,
the difference is that the used TupleDesc is first copied and sorted by logical
position, and then the attphysnum attributes are used rather than a loop
counter, or the targetlist are sorted and then the resno" are computed .  Other
places (only the relation columns aliases I think for now) also builds an array
for the physical / logical mappings,.

I also changed psql to display the column in logical position, and emit an
extra line with the physical position in the verbose mode, but that's a clearly
poor design which would need a lot more thoughts.  No changes in pg_dump or any
other tool.

0003 adds some infrastructure to specify the logical position for system
catalogs.  It also moves relname and attname near the end of the non-varlena
part of pg_class / pg_attribute (just before the end so no change is needed in
the *_FIXED_PART_SIZE macros)

I also added some patches I used for testing:

0004 automatically reverses relation attributes order and preserves the
original logical position, so you can run the whole regression tests and see
what breaks.

0005 adds an ORDER (column_name, [...]) clause in the simplest CREATE TABLE
command form to declare the logical order and manually test things, something
like:

CREATE TABLE ab (b int, a int) ORDER (a, b);

[1] https://www.postgresql.org/message-id/202108181639.xjuovrpwgkr2@alvherre.pgsql

Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2022-06-03 13:28:16 +0700, John Naylor wrote:
> 1. That would require putting the name physically closer to the end of
> the column list. To make this less annoying for users, we'd need to
> separate physical order from display order (at least/at first only for
> system catalogs).

FWIW, it's not at all clear to me that this would be required. If I were to
tackle this I'd look at breaking up the mirroring of C structs to catalog
struct layout, by having genbki (or such) generate functions to map to/from
tuple layout. It'll be an annoying amount of changes, but feasible, I think.


> This would require:
> 
> - changing star expansion in SELECTs (expandRTE etc)
> - adjusting pg_dump, \d, etc
> 
> That much seems clear and agreed upon.

FWIW, I don't agree that this is a reasonable way to tackle changing
NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
fraction of the problem of making Names variable length. You'll still have all
the problems of name fields being accessed all over, but now not being
included in the part of the struct visible to C etc.

Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 6:13 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> While some problem wouldn't happen if we restricted the feature to system
> catalogs only (e.g. with renamed / dropped attributes, inheritance...), a lot
> would still exist and would have to be dealt with initially.  However I'm not
> sure what behavior would be wanted or acceptable, especially if we want
> something that can eventually be used for user relations too, with possibly
> dynamic positions.

I am not very convinced that this would make the project a whole lot
easier, but it does seem like the result would be less nice.

> I'll describe some of those problems, and just to make things easier I will use
> a normal table "ab" with 2 attributes, a and b, with their physical / logical
> position reversed.
>
> Obviously
>
> SELECT * FROM ab
>
> should return a and b in that order.  The aliases should also probably match
> the logical position, as in:
>
> SELECT * FROM ab ab(aa, bb)
>
> should probably map aa to a and bb to b.

Check.

> But should COPY FROM / COPY TO / INSERT INTO use the logical position too if
> not column list is provided?  I'd think so, but it goes further from the
> original "only handle star expansion" idea.

I think yes.

> And should record_in / record_out use the logical position, as in:
> SELECT ab::text FROM ab / SELECT (a, b)::ab;
>
> I would think not, as relying on a possibly dynamic order could break things if
> you store the results somewhere, but YMMV.

I think here the answer is yes again. I mean, consider that you could
also ALTER TABLE DROP COLUMN and then ALTER TABLE ADD COLUMN with the
same name. That is surely going to affect the meaning of such things.
I don't think we want to have one meaning if you reorder things that
way and a different meaning if you reorder things using whatever
commands we create for changing the display column positions.

> Note that there a variations of that, like
> SELECT ab::ab FROM (SELECT * FROM ab) ab;
>
> But those should probably work as intended (for now it doesn't) as you can't
> store a bare record, and storing a plain "ab" type wouldn't be problematic with
> dynamic changes.

If the sub-SELECT and the cast both use the display order, which I
think they should, then there's no problem here, I believe.

> Then, what about joinrels expansion?  I learned that the column ordering rules
> are far from being obvious, and I didn't find those in the documentation (note
> that I don't know if that something actually described in the SQL standard).
> So for instance, if a join is using an explicit USING clause rather than an ON
> clause, the merged columns are expanded first, so:
>
> SELECT * FROM ab ab1 JOIN ab ab2 USING (b)
>
> should unexpectedly expand to (b, a, a).  Is this order a strict requirement?

I dunno, but I can't see why it creates a problem for this patch to
maintain the current behavior. I mean, just use the logical column
position instead of the physical one here and forget about the details
of how it works beyond that.

> Another problem (that probably wouldn't be a problem for system catalogs) is
> that defaults are evaluated in the physical position.  This example from the
> regression test will clearly have a different behavior if the columns are in a
> different physical order:
>
>  CREATE TABLE INSERT_TBL (
>         x INT DEFAULT nextval('insert_seq'),
>         y TEXT DEFAULT '-NULL-',
>         z INT DEFAULT -1 * currval('insert_seq'),
>         CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND x < 8),
>         CHECK (x + z = 0));
>
> But changing the behavior to rely on the logical position seems quite
> dangerous.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 10:17 AM Andres Freund <andres@anarazel.de> wrote:
> > This would require:
> >
> > - changing star expansion in SELECTs (expandRTE etc)
> > - adjusting pg_dump, \d, etc
> >
> > That much seems clear and agreed upon.
>
> FWIW, I don't agree that this is a reasonable way to tackle changing
> NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
> fraction of the problem of making Names variable length. You'll still have all
> the problems of name fields being accessed all over, but now not being
> included in the part of the struct visible to C etc.

Yeah, I agree. I think that being able to reorder columns logically
without changing anything physically would be cool, but I think we
could find some way of making the catalog columns variable-length
without introducing that feature.

I'm not sure whether your idea of creating translator functions is a
good one or not, but it doesn't seem too crazy. It'd be like a special
purpose tuple deformer.

deform_pg_class_tuple(&pg_class_struct, pg_class_tuple);

The code in there could be automatically generated statements that
maybe look like this:

memcpy(&struct.relnamespace, tuple + Aoffset_pg_class_relnamespace,
sizeof(Oid));

Once you hit the first variable-length field you'd need something more
clever, but because the code would be specialized for each catalog, it
would probably be quite fast anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 23, 2022 at 10:17 AM Andres Freund <andres@anarazel.de> wrote:
>> FWIW, I don't agree that this is a reasonable way to tackle changing
>> NAMEDATALEN. It'd be nice to have, but it to me it seems a pretty small
>> fraction of the problem of making Names variable length. You'll still have all
>> the problems of name fields being accessed all over, but now not being
>> included in the part of the struct visible to C etc.

> Yeah, I agree. I think that being able to reorder columns logically
> without changing anything physically would be cool, but I think we
> could find some way of making the catalog columns variable-length
> without introducing that feature.

Agreed.  Bearing in mind that multiple smart people have tackled that idea
and gotten nowhere, I think setting it as a prerequisite for this project
is a good way of ensuring this won't happen.

> I'm not sure whether your idea of creating translator functions is a
> good one or not, but it doesn't seem too crazy. It'd be like a special
> purpose tuple deformer.

Sounds worth investigating, anyway.  It'd also get us out from under
C-struct-related problems such as the nearby AIX alignment issue.

The extra cost of the deforming step could also be repaid, in some
cases, by not having to use SysCacheGetAttr etc later on to fetch
variable-length fields.  That is, I'm imagining that the deformer
would extract all the fields, even varlena ones, and drop pointers
or whatever into fields of the C struct.

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sounds worth investigating, anyway.  It'd also get us out from under
> C-struct-related problems such as the nearby AIX alignment issue.

Yeah.

> The extra cost of the deforming step could also be repaid, in some
> cases, by not having to use SysCacheGetAttr etc later on to fetch
> variable-length fields.  That is, I'm imagining that the deformer
> would extract all the fields, even varlena ones, and drop pointers
> or whatever into fields of the C struct.

Yeah, if we were going to do something like this, I can't see why we
wouldn't do it this way. It wouldn't make sense to do it for only some
of the attributes.

I'm not sure exactly where we would put this translation step, though.
I think for the syscaches and relcache we'd want to translate when
populating the cache so that when you do a cache lookup you get the
data already translated. It's hard to be sure without testing, but
that seems like it would make this cheap enough that we wouldn't have
to be too worried, since the number of times we build new cache
entries should be small compared to the number of times we access
existing ones. The trickier thing might be code that uses
systable_beginscan() et. al. directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2022-06-23 14:42:17 -0400, Robert Haas wrote:
> On Thu, Jun 23, 2022 at 2:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The extra cost of the deforming step could also be repaid, in some
> > cases, by not having to use SysCacheGetAttr etc later on to fetch
> > variable-length fields.  That is, I'm imagining that the deformer
> > would extract all the fields, even varlena ones, and drop pointers
> > or whatever into fields of the C struct.

I was also thinking we'd translate all attributes. Not entirely sure whether
we'd want to use "plain" pointers though - there are some places where we rely
on being able to copy such structs around. That'd be a bit easier with
relative pointers, pointing to the end of the struct. But likely the
notational overhead of dealing with relative pointers would be far higher than
the notational overhead of having to invoke a generated "tuple struct" copy
function. Which we'd likely need anyway, because some previously statically
sized allocations would end up being variable sized?


> Yeah, if we were going to do something like this, I can't see why we
> wouldn't do it this way. It wouldn't make sense to do it for only some
> of the attributes.

Agreed.


> I'm not sure exactly where we would put this translation step, though.
> I think for the syscaches and relcache we'd want to translate when
> populating the cache so that when you do a cache lookup you get the
> data already translated. It's hard to be sure without testing, but
> that seems like it would make this cheap enough that we wouldn't have
> to be too worried, since the number of times we build new cache
> entries should be small compared to the number of times we access
> existing ones. The trickier thing might be code that uses
> systable_beginscan() et. al. directly.

I was thinking we'd basically do it wherever we do a GETSTRUCT() today.

A first step could be to transform code like
    (Form_pg_attribute) GETSTRUCT(tuple)
into
   GETSTRUCT(pg_attribute, tuple)

then, in a subsequent step, we'd redefine GETSTRUCT as something
#define GESTRUCT(catalog, tuple) tuple_to_struct_##catalog(tuple)

Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
> I was thinking we'd basically do it wherever we do a GETSTRUCT() today.
>
> A first step could be to transform code like
>     (Form_pg_attribute) GETSTRUCT(tuple)
> into
>    GETSTRUCT(pg_attribute, tuple)
>
> then, in a subsequent step, we'd redefine GETSTRUCT as something
> #define GESTRUCT(catalog, tuple) tuple_to_struct_##catalog(tuple)

That seems a little fraught, because you'd be turning what's now
basically a trivial operation into a non-trivial operation involving
memory allocation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 23, 2022 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
>> I was thinking we'd basically do it wherever we do a GETSTRUCT() today.

> That seems a little fraught, because you'd be turning what's now
> basically a trivial operation into a non-trivial operation involving
> memory allocation.

Nonetheless, the presence of GETSTRUCT calls should be a good guide
to where we need to do something.

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
On Thu, Jun 23, 2022 at 9:17 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-06-03 13:28:16 +0700, John Naylor wrote:
> > 1. That would require putting the name physically closer to the end of
> > the column list. To make this less annoying for users, we'd need to
> > separate physical order from display order (at least/at first only for
> > system catalogs).
>
> FWIW, it's not at all clear to me that this would be required. If I were to
> tackle this I'd look at breaking up the mirroring of C structs to catalog
> struct layout, by having genbki (or such) generate functions to map to/from
> tuple layout. It'll be an annoying amount of changes, but feasible, I think.

Hmm, I must have misunderstood this aspect. In my mind I was thinking
that if a varlen attribute were at the end, these functions would make
it easier to access them quickly. But from this and the follow-on
responses, these would be used to access varlen attributes wherever
they may be. I'll take a look at the current uses of GETSTRUCT().

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Nonetheless, the presence of GETSTRUCT calls should be a good guide
> to where we need to do something.

Indubitably.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 11:11 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Hmm, I must have misunderstood this aspect. In my mind I was thinking
> that if a varlen attribute were at the end, these functions would make
> it easier to access them quickly. But from this and the follow-on
> responses, these would be used to access varlen attributes wherever
> they may be. I'll take a look at the current uses of GETSTRUCT().

I don't know whether we can or should move all the "name" columns to
the end of the catalog. It would be user-visible and probably not
user-desirable, but it would save something in terms of tuple
deforming cost. I'm just not sure how much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: NAMEDATALEN increase because of non-latin languages

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't know whether we can or should move all the "name" columns to
> the end of the catalog. It would be user-visible and probably not
> user-desirable,

I'm a strong -1 on changing that if we're not absolutely forced to.

> but it would save something in terms of tuple
> deforming cost. I'm just not sure how much.

I'd guess nothing, if we are deforming all the fields.

            regards, tom lane



Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Thu, Jun 23, 2022 at 10:19:44AM -0400, Robert Haas wrote:
> On Thu, Jun 23, 2022 at 6:13 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> > And should record_in / record_out use the logical position, as in:
> > SELECT ab::text FROM ab / SELECT (a, b)::ab;
> >
> > I would think not, as relying on a possibly dynamic order could break things if
> > you store the results somewhere, but YMMV.
>
> I think here the answer is yes again. I mean, consider that you could
> also ALTER TABLE DROP COLUMN and then ALTER TABLE ADD COLUMN with the
> same name. That is surely going to affect the meaning of such things.
> I don't think we want to have one meaning if you reorder things that
> way and a different meaning if you reorder things using whatever
> commands we create for changing the display column positions.

It indeed would, but ALTER TABLE DROP COLUMN is a destructive operation, and
I'm assuming that anyone doing that is aware that it will have an impact on
stored data and such.  I initially thought that changing the display order of
columns shouldn't have the same impact with the stability of otherwise
unchanged record definition, as it would make such reorder much more impacting.
But I agree that having different behaviors seems worse.

> > Then, what about joinrels expansion?  I learned that the column ordering rules
> > are far from being obvious, and I didn't find those in the documentation (note
> > that I don't know if that something actually described in the SQL standard).
> > So for instance, if a join is using an explicit USING clause rather than an ON
> > clause, the merged columns are expanded first, so:
> >
> > SELECT * FROM ab ab1 JOIN ab ab2 USING (b)
> >
> > should unexpectedly expand to (b, a, a).  Is this order a strict requirement?
>
> I dunno, but I can't see why it creates a problem for this patch to
> maintain the current behavior. I mean, just use the logical column
> position instead of the physical one here and forget about the details
> of how it works beyond that.

I'm not that familiar with this part of the code so I may have missed
something, but I didn't see any place where I could just simply do that.

To be clear, the approach I used is to change the expansion ordering but
otherwise keep the current behavior, to try to minimize the changes.  This is
done by keeping the attribute in the physical ordering pretty much everywhere,
including in the nsitems, and just logically reorder them during the expansion.
In other words all the code still knows that the 1st column is the first
physical column and so on.

So in that query, the ordering is supposed to happen when handling the "SELECT
*", which makes it impossible to retain that order.

I'm assuming that what you meant is to change the ordering when processing the
JOIN and retain the old "SELECT *" behavior, which is to emit items in the
order they're found.  But IIUC the only way to do that would be to change the
order when building the nsitems themselves, and have the code believe that the
attributes are physically stored in the logical order.  That's probably doable,
but that looks like a way more impacting change.  Or did you mean to keep the
approach I used, and just have some special case for "SELECT *" when referring
to a joinrel and instead try to handle the logical expansion in the join?
AFAICS it would require to add some extra info in the parsing structures, as it
doesn't really really store any position, just relies on array offset / list
position and maps things that way.

> > Another problem (that probably wouldn't be a problem for system catalogs) is
> > that defaults are evaluated in the physical position.  This example from the
> > regression test will clearly have a different behavior if the columns are in a
> > different physical order:
> >
> >  CREATE TABLE INSERT_TBL (
> >         x INT DEFAULT nextval('insert_seq'),
> >         y TEXT DEFAULT '-NULL-',
> >         z INT DEFAULT -1 * currval('insert_seq'),
> >         CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND x < 8),
> >         CHECK (x + z = 0));
> >
> > But changing the behavior to rely on the logical position seems quite
> > dangerous.
>
> Why?

It feels to me like a POLA violation, and probably people wouldn't expect it to
behave this way (even if this is clearly some corner case problem).  Even if
you argue that this is not simply a default display order but something more
like real column order, the physical position being some implementation detail,
it still doesn't really feels right.

The main reason for having the possibility to change the logical position is to
have "better looking", easier to work with, relations even if you have some
requirements with the real physical order like trying to optimize things as
much as possible (reordering columns to avoid padding space, put non-nullable
columns first...).  The order in which defaults are evaluated looks like the
same kind of requirements.  How useful would it be if you could chose a logical
order, but not being able to chose the one you actually want because it would
break your default values?

Anyway, per the nearby discussions I don't see much interest, especially not in
the context of varlena identifiers (I should have started a different thread,
sorry about that), so I don't think it's worth investing more efforts into it.



Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2022-06-26 10:48:24 +0800, Julien Rouhaud wrote:
> Anyway, per the nearby discussions I don't see much interest, especially not in
> the context of varlena identifiers (I should have started a different thread,
> sorry about that), so I don't think it's worth investing more efforts into
> it.

FWIW, I still think it's a worthwhile feature - just, to me, unrelated to
varlena identifiers. I've seen tremendous amounts of space wasted in tables
just because of alignment issues...

- Andres



Re: NAMEDATALEN increase because of non-latin languages

From
Julien Rouhaud
Date:
On Sat, Jun 25, 2022 at 08:00:04PM -0700, Andres Freund wrote:
>
> On 2022-06-26 10:48:24 +0800, Julien Rouhaud wrote:
> > Anyway, per the nearby discussions I don't see much interest, especially not in
> > the context of varlena identifiers (I should have started a different thread,
> > sorry about that), so I don't think it's worth investing more efforts into
> > it.
>
> FWIW, I still think it's a worthwhile feature

Oh, ok, I will start a new thread then.

> - just, to me, unrelated to varlena identifiers.

Yeah I get that and I agree.

> I've seen tremendous amounts of space wasted in tables
> just because of alignment issues...

Same here unfortunately.



Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
On Fri, Jun 24, 2022 at 4:49 AM Andres Freund <andres@anarazel.de> wrote:
> A first step could be to transform code like
>     (Form_pg_attribute) GETSTRUCT(tuple)
> into
>    GETSTRUCT(pg_attribute, tuple)

To get the ball rolling, I've done this much in 0001. This is an easy
mechanical change, although we'll next want a third argument to pass
the Form_* pointer.

0002 changes this macro to

#define GETSTRUCT(CAT, TUP) Deform_##CAT##_tuple((char *)
((TUP)->t_data) + (TUP)->t_data->t_hoff)

...where genbki.pl generates stub macros in the form of

#define Deform_pg_am_tuple(TUP) (Form_pg_am) (TUP)

...which live in pg_*_deform.h and are #included into each main
catalog header after the Form_* is defined. Feedback on whether a more
tasteful way should be sought would be appreciated.

While not very interesting on its own, it gives an idea of how the
build would work. Next step would be to turn these into functions like

static inline void
Deform_pg_am_tuple(Form_pg_am, char * tuple)
{...}

that just memcpy() things over as already discussed, adjusting the
callers of GETSTRUCT to match. This won't be quite as mechanical and
will require some manual work since some not least because some call
sites mix declaration and assignment. I imagine there will also be a
need for the inverse operation, forming a tuple from a struct.

We will also have to take care to work around cases where the Form_*
pointer is currently used for in-place updates. I imagine there are
not many, but they will require extra care. The pseudo-catalog
pg_sequence_data is an example, and a bit of a special case anyway.

On Fri, Jun 24, 2022 at 1:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sounds worth investigating, anyway.  It'd also get us out from under
> C-struct-related problems such as the nearby AIX alignment issue.

Getting around that issue sounds like a good first goal for committing.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:

On Thu, Jun 23, 2022 at 9:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I'm not sure whether your idea of creating translator functions is a
> good one or not, but it doesn't seem too crazy. It'd be like a special
> purpose tuple deformer.
>
> deform_pg_class_tuple(&pg_class_struct, pg_class_tuple);
>
> The code in there could be automatically generated statements that
> maybe look like this:
>
> memcpy(&struct.relnamespace, tuple + Aoffset_pg_class_relnamespace,
> sizeof(Oid));

I've made a small step in this direction.

0001 is just boilerplate, same as v1
0002 teaches Catalog.pm to export both C attr name and SQL attr name, so we can use "sizeof".
0003 generates static inline functions that work the same as the current GETSTRUCT macro, i.e. just cast to the right pointer and return it.
0004 generates a function for pg_cast that memcpys to a passed struct, for demonstration.

Here's the full deforming function generated for pg_cast. In English, the current offset is the previous offset plus the previous type length, plus any alignment padding suitable for the current type (there is none here, so the alignment aspect is not tested). I'm hoping something like this will be sufficient for what's in the current structs, but of course will need additional work when expanding those to include pointers to varlen attributes. I've not yet inspected the emitted assembly language, but regression tests pass.

static inline void
Deform_pg_cast_tuple(Form_pg_cast pg_cast_struct, char * pg_cast_tuple)
{
#define Aoff_pg_cast_oid 0
memcpy(&pg_cast_struct->oid, pg_cast_tuple + Aoff_pg_cast_oid, sizeof(Oid));

#define Aoff_pg_cast_castsource att_align_nominal(Aoff_pg_cast_oid + sizeof(Oid), 'i')
memcpy(&pg_cast_struct->castsource, pg_cast_tuple + Aoff_pg_cast_castsource, sizeof(Oid));

#define Aoff_pg_cast_casttarget att_align_nominal(Aoff_pg_cast_castsource + sizeof(Oid), 'i')
memcpy(&pg_cast_struct->casttarget, pg_cast_tuple + Aoff_pg_cast_casttarget, sizeof(Oid));

#define Aoff_pg_cast_castfunc att_align_nominal(Aoff_pg_cast_casttarget + sizeof(Oid), 'i')
memcpy(&pg_cast_struct->castfunc, pg_cast_tuple + Aoff_pg_cast_castfunc, sizeof(Oid));

#define Aoff_pg_cast_castcontext att_align_nominal(Aoff_pg_cast_castfunc + sizeof(Oid), 'c')
memcpy(&pg_cast_struct->castcontext, pg_cast_tuple + Aoff_pg_cast_castcontext, sizeof(char));

#define Aoff_pg_cast_castmethod att_align_nominal(Aoff_pg_cast_castcontext + sizeof(char), 'c')
memcpy(&pg_cast_struct->castmethod, pg_cast_tuple + Aoff_pg_cast_castmethod, sizeof(char));
}

Here's an example of the changes done in the call sites. For the rest of the catalogs, perhaps many of them can be done mechanically. I still anticipate a lot of manual work. Note that GETSTRUCT_MEMCPY is just temporary scaffolding to distinguish from just returning a pointer, and is not proposed for commit.

-               Form_pg_cast castForm = GETSTRUCT(pg_cast, tuple);
+               FormData_pg_cast castForm;
                CoercionContext castcontext;
 
+               GETSTRUCT_MEMCPY(pg_cast, &castForm, tuple);
+
                /* convert char value for castcontext to CoercionContext enum */
-               switch (castForm->castcontext)
+               switch (castForm.castcontext)

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2022-07-18 09:46:44 +0700, John Naylor wrote:
> I've made a small step in this direction.

Thanks for working on this!


> 0001 is just boilerplate, same as v1

If we were to go for this, I wonder if we should backpatch the cast containing
version of GESTRUCT for less pain backpatching bugfixes. That'd likely require
using a different name for the cast containing one.


> 0002 teaches Catalog.pm to export both C attr name and SQL attr name, so we
> can use "sizeof".
> 0003 generates static inline functions that work the same as the current
> GETSTRUCT macro, i.e. just cast to the right pointer and return it.

It seems likely that inline functions are going to be too large for
this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the
function every time doesn't seem great.


> current offset is the previous offset plus the previous type length, plus
> any alignment padding suitable for the current type (there is none here, so
> the alignment aspect is not tested). I'm hoping something like this will be
> sufficient for what's in the current structs, but of course will need
> additional work when expanding those to include pointers to varlen
> attributes. I've not yet inspected the emitted assembly language, but
> regression tests pass.

Hm. Wouldn't it make sense to just use the normal tuple deforming routines and
then map the results to the structs?

Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:

On Mon, Jul 18, 2022 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:

> > 0001 is just boilerplate, same as v1
>
> If we were to go for this, I wonder if we should backpatch the cast containing
> version of GESTRUCT for less pain backpatching bugfixes. That'd likely require
> using a different name for the cast containing one.

The new version in this series was meant to be temporary scaffolding, but in the back of my mind I wondered if we should go ahead and keep the simple cast for catalogs that have no varlenas or alignment issues. It sounds like you'd be in favor of that. 

> > 0003 generates static inline functions that work the same as the current
> > GETSTRUCT macro, i.e. just cast to the right pointer and return it.
>
> It seems likely that inline functions are going to be too large for
> this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the
> function every time doesn't seem great.

Ok.

> > current offset is the previous offset plus the previous type length, plus
> > any alignment padding suitable for the current type (there is none here, so
> > the alignment aspect is not tested). I'm hoping something like this will be
> > sufficient for what's in the current structs, but of course will need
> > additional work when expanding those to include pointers to varlen
> > attributes. I've not yet inspected the emitted assembly language, but
> > regression tests pass.
>
> Hm. Wouldn't it make sense to just use the normal tuple deforming routines and
> then map the results to the structs?

I wasn't sure if they'd be suitable for this, but if they are, that'd make this easier and more maintainable. I'll look into it.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:

I wrote:

> On Mon, Jul 18, 2022 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
> > Hm. Wouldn't it make sense to just use the normal tuple deforming routines and
> > then map the results to the structs?
>
> I wasn't sure if they'd be suitable for this, but if they are, that'd make this easier and more maintainable. I'll look into it.

This would seem to have its own problems: heap_deform_tuple writes to passed arrays of datums and bools. The lower level parts like fetchatt and nocachegetattr return datums, so still need some generated boilerplate. Some of these also assume they can write cached offsets on a passed tuple descriptor.

I'm thinking where the first few attributes are fixed length, not null, and (because of AIX) not double-aligned, we can do a single memcpy on multiple columns at once. That will still be a common pattern after namedata is varlen. Otherwise, use helper functions/macros similar to the above but instead of passing a tuple descriptor, use info we have at compile time. 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: NAMEDATALEN increase because of non-latin languages

From
Andres Freund
Date:
Hi,

On 2022-07-19 14:30:34 +0700, John Naylor wrote:
> I wrote:
> 
> > On Mon, Jul 18, 2022 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
> > > Hm. Wouldn't it make sense to just use the normal tuple deforming
> routines and
> > > then map the results to the structs?
> >
> > I wasn't sure if they'd be suitable for this, but if they are, that'd
> make this easier and more maintainable. I'll look into it.
> 
> This would seem to have its own problems: heap_deform_tuple writes to
> passed arrays of datums and bools. The lower level parts like fetchatt and
> nocachegetattr return datums, so still need some generated boilerplate.
> Some of these also assume they can write cached offsets on a passed tuple
> descriptor.

Sure. But that'll just be a form of conversion we do all over, rather than
encoding low-level data layout details. Basically
struct->member1 = DatumGetInt32(values[0]);
struct->member2 = DatumGetChar(values[1]);

etc.


> I'm thinking where the first few attributes are fixed length, not null, and
> (because of AIX) not double-aligned, we can do a single memcpy on multiple
> columns at once. That will still be a common pattern after namedata is
> varlen. Otherwise, use helper functions/macros similar to the above but
> instead of passing a tuple descriptor, use info we have at compile time.

I think that might be over-optimizing things. I don't think we do these
conversions at a rate that's high enough to warrant it - the common stuff
should be in relcache etc.  It's possible that we might want to optimize the
catcache case specifically - but that'd be more optimizing memory usage than
"conversion" imo.


Greetings,

Andres Freund



Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
On Tue, Jul 19, 2022 at 10:57 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-07-19 14:30:34 +0700, John Naylor wrote:
> > I'm thinking where the first few attributes are fixed length, not null, and
> > (because of AIX) not double-aligned, we can do a single memcpy on multiple
> > columns at once. That will still be a common pattern after namedata is
> > varlen. Otherwise, use helper functions/macros similar to the above but
> > instead of passing a tuple descriptor, use info we have at compile time.
>
> I think that might be over-optimizing things. I don't think we do these
> conversions at a rate that's high enough to warrant it - the common stuff
> should be in relcache etc.  It's possible that we might want to optimize the
> catcache case specifically - but that'd be more optimizing memory usage than
> "conversion" imo.

Okay, here is a hackish experiment that applies on top of v2 but also invalidates some of that earlier work. Since there is already a pg_cast.c, I demoed a new function there which looks like this:

void
Deform_pg_cast_tuple(Form_pg_cast pg_cast_struct, HeapTuple pg_cast_tuple, TupleDesc pg_cast_desc)
{
    Datum values[Natts_pg_cast];
    bool isnull[Natts_pg_cast];

    heap_deform_tuple(pg_cast_tuple, pg_cast_desc, values, isnull);

    pg_cast_struct->oid = DatumGetObjectId(values[Anum_pg_cast_oid - 1]);
    pg_cast_struct->castsource = DatumGetObjectId(values[Anum_pg_cast_castsource - 1]);
    pg_cast_struct->casttarget = DatumGetObjectId(values[Anum_pg_cast_casttarget - 1]);
    pg_cast_struct->castfunc = DatumGetObjectId(values[Anum_pg_cast_castfunc - 1]);
    pg_cast_struct->castcontext = DatumGetChar(values[Anum_pg_cast_castcontext - 1]);
    pg_cast_struct->castmethod = DatumGetChar(values[Anum_pg_cast_castmethod - 1]);
}

For the general case we can use pg_*_deform.c or something like that, with extern declarations in the main headers. To get this to work, I had to add a couple pointless table open/close calls to get the tuple descriptor, since currently the whole tuple is stored in the syscache, but that's not good even as a temporary measure. Storing the full struct in the syscache is a good future step, as noted upthread, but to get there without a bunch more churn, maybe the above function can copy the tuple descriptor into a local stack variable from an expanded version of schemapg.h. Once the deformed structs are stored in caches, I imagine most of the times we want to deform are when we have the table open, and we can pass the descriptor as above without additional code.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
Attached is v3, which is mostly putting Andres's suggestion above (use heap_deform_tuple) into a code generation context. Everything interesting is in 0002, and again only touches pg_cast as an experiment. Some design points which could be debatable:

- I thought individual C files would be a pain for the build process, so I just dumped the new functions into a single file, with the extern declarations in the pg_*.h files.
- I used `#include "catalog/pg_cast.h"` just for Form_pg_cast, but maybe a forward declaration would work.
- To get the struct member assignments, I resorted to a Perl hash to map the attribute types to the DatumGet* macros. That may not be great, but I don't have a better idea at the moment.
- I directly called this function for the table scan rather than hide it behind a new variant of GETSTRUCT as I did before. That seems clearer to me about the intent.

The syscache use of GETSTRUCT still uses a simple cast of the tuple (for pg_cast those calls live in parse_coerce.c, which is unchanged from master in v3). Next step I think is to see about the syscache piece -- teaching a syscache miss to deform the entire tuple into a struct and store it in the syscache.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: NAMEDATALEN increase because of non-latin languages

From
John Naylor
Date:
I wrote:

> The syscache use of GETSTRUCT still uses a simple cast of the tuple (for pg_cast those calls live in parse_coerce.c,
whichis unchanged from master in v3). Next step I think is to see about the syscache piece -- teaching a syscache miss
todeform the entire tuple into a struct and store it in the syscache. 

v4 is a hackish attempt at getting the syscache to deform the tuple
and store the struct. Using only pg_cast again for ease, it seems to
work for that. It's now about as far as I can get without thinking
about byref types.

0001 just adds copies of some syscache / catcache functions for
private use by pg_cast, as scaffolding.
0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call
heap_deform_tuple(). This then calls a for-now handwritten function
(similar to the one generated in v3) which palloc's space for the
struct and copies the fields over. Only this function knows the
catalog struct type, so a future design could call the per-cache
function via a pointer in the catcache control array. Since we already
have the isnull/values array, it's also trivial to copy the datums for
the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the
t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW
that knows it's looking directly at the struct.

Getting to varlen attributes: For one, I think it was mentioned above
that we will need a way to perform a deep copy of structs that contain
pointers to varlen fields. I imagine keeping track of the attributes
length will come up for some types and might be tricky. And before
that, the catalog machinery will need some preprocessor tricks to
declare pointers in the structs.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment