Re: NAMEDATALEN increase because of non-latin languages - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: NAMEDATALEN increase because of non-latin languages
Date
Msg-id 20220623101155.3dljtwradu7eik6g@jrouhaud
Whole thread Raw
In response to Re: NAMEDATALEN increase because of non-latin languages  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: NAMEDATALEN increase because of non-latin languages
List pgsql-hackers
Hi,

On Fri, Jun 03, 2022 at 01:28:16PM +0700, John Naylor wrote:
> 
> I wanted to revive this thread to summarize what was discussed and get
> a sense of next steps we could take.

Thanks for reviving this topic!

> The idea that gained the most traction is to make identifiers
> variable-length in the catalogs, which has the added benefit of
> reducing memory in syscaches in the common case. That presented two
> challenges:
> 
> 1. That would require putting the name physically closer to the end of
> the column list. To make this less annoying for users, we'd need to
> separate physical order from display order (at least/at first only for
> system catalogs). This would require:
> 
> - changing star expansion in SELECTs (expandRTE etc)
> - adjusting pg_dump, \d, etc

I tried to work on a physical / logical attribute position patch (more at the
end of the mail).  I based it on Álvaro's feedback at [1], which is:

- rename current pg_attribute.attnum to attphysnum
- add a new pg_attribute.attnum for the now logical position
- tweak a few places to use attnum rather than attphysnum (which however isn't
  what I did)

The idea, IIUC, is to make a physical position reordering entirely transparent
to clients.

However, once you change the star expansion, things start to subtly break in a
lot of places.  I try to dig in a bit to see if things could be made to work
the same but it opened a whole can of worms.  So before spending too much time
on something that could be a non-starter I'd like to get some feedback on the
general approach.

While some problem wouldn't happen if we restricted the feature to system
catalogs only (e.g. with renamed / dropped attributes, inheritance...), a lot
would still exist and would have to be dealt with initially.  However I'm not
sure what behavior would be wanted or acceptable, especially if we want
something that can eventually be used for user relations too, with possibly
dynamic positions.

I'll describe some of those problems, and just to make things easier I will use
a normal table "ab" with 2 attributes, a and b, with their physical / logical
position reversed.

Obviously

SELECT * FROM ab

should return a and b in that order.  The aliases should also probably match
the logical position, as in:

SELECT * FROM ab ab(aa, bb)

should probably map aa to a and bb to b.

But should COPY FROM / COPY TO / INSERT INTO use the logical position too if
not column list is provided?  I'd think so, but it goes further from the
original "only handle star expansion" idea.

And should record_in / record_out use the logical position, as in:
SELECT ab::text FROM ab / SELECT (a, b)::ab;

I would think not, as relying on a possibly dynamic order could break things if
you store the results somewhere, but YMMV.

Note that there a variations of that, like
SELECT ab::ab FROM (SELECT * FROM ab) ab;

But those should probably work as intended (for now it doesn't) as you can't
store a bare record, and storing a plain "ab" type wouldn't be problematic with
dynamic changes.


Then, what about joinrels expansion?  I learned that the column ordering rules
are far from being obvious, and I didn't find those in the documentation (note
that I don't know if that something actually described in the SQL standard).
So for instance, if a join is using an explicit USING clause rather than an ON
clause, the merged columns are expanded first, so:

SELECT * FROM ab ab1 JOIN ab ab2 USING (b)

should unexpectedly expand to (b, a, a).  Is this order a strict requirement?

For now I sort the expanded columns by (varno, attnum) but that's clearly a
noticeable change.

Another problem (that probably wouldn't be a problem for system catalogs) is
that defaults are evaluated in the physical position.  This example from the
regression test will clearly have a different behavior if the columns are in a
different physical order:

 CREATE TABLE INSERT_TBL (
     x INT DEFAULT nextval('insert_seq'),
     y TEXT DEFAULT '-NULL-',
     z INT DEFAULT -1 * currval('insert_seq'),
     CONSTRAINT INSERT_TBL_CON CHECK (x >= 3 AND y <> 'check failed' AND x < 8),
     CHECK (x + z = 0));

But changing the behavior to rely on the logical position seems quite
dangerous.


I'm also attaching some few patches as a basis for an implementation
discussion.  Those are just POC level prototypes with the bare minimum changes
to try to have a behavior based on the logical position, and absolutely no
effort to optimize anything.  I'm also not familiar with all the impacted
places, so I'm definitely not claiming that this is a good approach.  Any
suggestion on a better approach is welcome.

0001 is the s/attphysnum/attnum/ suggested by Álvaro.  Here again it's the bare
minimum to make the code compile, a lot of the code still reference attnum in
some ways (attnameAttnum(), variables like attnums and so on) that would need
to be cleaned up.

0002 is the addition of the new pg_attribute.attnum with some changes trying to
make things work, but no infrastructure to change it anywhere.

The relevant changes are there.  In expandNSItemAttrs and some other places,
the difference is that the used TupleDesc is first copied and sorted by logical
position, and then the attphysnum attributes are used rather than a loop
counter, or the targetlist are sorted and then the resno" are computed .  Other
places (only the relation columns aliases I think for now) also builds an array
for the physical / logical mappings,.

I also changed psql to display the column in logical position, and emit an
extra line with the physical position in the verbose mode, but that's a clearly
poor design which would need a lot more thoughts.  No changes in pg_dump or any
other tool.

0003 adds some infrastructure to specify the logical position for system
catalogs.  It also moves relname and attname near the end of the non-varlena
part of pg_class / pg_attribute (just before the end so no change is needed in
the *_FIXED_PART_SIZE macros)

I also added some patches I used for testing:

0004 automatically reverses relation attributes order and preserves the
original logical position, so you can run the whole regression tests and see
what breaks.

0005 adds an ORDER (column_name, [...]) clause in the simplest CREATE TABLE
command form to declare the logical order and manually test things, something
like:

CREATE TABLE ab (b int, a int) ORDER (a, b);

[1] https://www.postgresql.org/message-id/202108181639.xjuovrpwgkr2@alvherre.pgsql

Attachment

pgsql-hackers by date:

Previous
From: Pavel Borisov
Date:
Subject: Re: Custom tuplesorts for extensions
Next
From: John Naylor
Date:
Subject: Re: some aspects of our qsort might not be ideal