Thread: Re: Improving default column names/aliases of subscript text expressions

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> SELECT data['a'], data['b'], data['c'] FROM tj;

> Gives the following output:

>  data │ data  │    data
> ──────┼───────┼────────────
>  123  │ "abc" │ [123, 456]

> I'd much rather have it output:

>  a    │ b     │    c
> ──────┼───────┼────────────
>  123  │ "abc" │ [123, 456]

I dunno, this seems to be putting an undue amount of emphasis on
one very specific usage pattern.  Why should it matter whether
the subscripts are string literals or not?  What will happen when
ruleutils decides to dump those expressions with the implicit
cast to text being less implicit?

regression=# create view v1 as
SELECT data['a'] FROM tj;
CREATE VIEW
regression=# \d+ v1
                             View "public.v1"
 Column | Type  | Collation | Nullable | Default | Storage  | Description
--------+-------+-----------+----------+---------+----------+-------------
 data   | jsonb |           |          |         | extended |
View definition:
 SELECT data['a'::text] AS data
   FROM tj;

I'm also wondering how such a rule interacts with string literals
that don't get resolved as text:

regression=# create table t2 (data int[]);
CREATE TABLE
regression=# insert into t2 values(array[1,2,3,4]);
INSERT 0 1
regression=# select data['2'], data[3] from t2;
 data | data
------+------
    2 |    3
(1 row)

> 1. Change the default alias/eref to be more useful for subscripts
> (probably with a GUC to enable/disable this for people depending on
> the old names)

This would be a GUC that changes query semantics, which is a
thing we've learned the hard way is evil.  (It would not be
any less evil in one of the other implementation approaches.)
I think that to do something like this, we'd have to make a
considered decision that the new way is so much better than
the old that it's okay to break some people's queries.
I don't say that bar is unreachable, but it seems high.

One way to lower the bar would be to make the change affect
fewer cases, like maybe only JSON subscripts.  That points
towards your idea of making the subscript implementation
responsible.  However, if memory serves we pick the column
aliases before doing parse analysis, which'd make it hard
to know which data type is involved.  I do not really recall
why it's done that way or whether it'd be practical to change.

            regards, tom lane



Re: Improving default column names/aliases of subscript text expressions

From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 16:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I dunno, this seems to be putting an undue amount of emphasis on
> one very specific usage pattern. Why should it matter whether
> the subscripts are string literals or not?  What will happen when
> ruleutils decides to dump those expressions with the implicit
> cast to text being less implicit?

Thanks for the quick response. To be clear, the attached patch was
just a quick POC, it definitely needs more work if it's the direction
I should pursue. Indeed adding support for explicit casts seems like a
nice thing.

> regression=# create view v1 as
> SELECT data['a'] FROM tj;
> CREATE VIEW
> regression=# \d+ v1
>                              View "public.v1"
>  Column | Type  | Collation | Nullable | Default | Storage  | Description
> --------+-------+-----------+----------+---------+----------+-------------
>  data   | jsonb |           |          |         | extended |
> View definition:
>  SELECT data['a'::text] AS data
>    FROM tj;

Are you sure you ran this with my patch? When I do this it actually
fills in the "better" alias just fine, which makes sense because when
the view query is parsed it doesn't have the cast yet. The cast only
gets added by ruleutils, which also adds the explicit alias.

> create view v1 as SELECT data['a'] FROM tj;
CREATE VIEW
> \d+ v1;
                             View "public.v1"
 Column │ Type  │ Collation │ Nullable │ Default │ Storage  │ Description
────────┼───────┼───────────┼──────────┼─────────┼──────────┼─────────────
 a      │ jsonb │           │          │         │ extended │
View definition:
 SELECT data['a'::text] AS a
   FROM tj;

> I'm also wondering how such a rule interacts with string literals
> that don't get resolved as text:
>
> regression=# create table t2 (data int[]);
> CREATE TABLE
> regression=# insert into t2 values(array[1,2,3,4]);
> INSERT 0 1
> regression=# select data['2'], data[3] from t2;
>  data | data
> ------+------
>     2 |    3
> (1 row)

So what would you want here? Do you want these columns to be called 2
and 3? That shouldn't be too difficult to implement, but that seemed
like it would possibly go a bit too far.

> This would be a GUC that changes query semantics, which is a
> thing we've learned the hard way is evil.  (It would not be
> any less evil in one of the other implementation approaches.)

Hmm, okay. I didn't know that.

> I think that to do something like this, we'd have to make a
> considered decision that the new way is so much better than
> the old that it's okay to break some people's queries.
> I don't say that bar is unreachable, but it seems high.

Totally fair, that's part of the reason I limited this to only string
literals, to effectively only impact JSON (which only had subscript
support since PG14) and not arrays.

One thing that I didn't see you explicitly say: Do you agree that the
new column names are actually better than the old ones?

> One way to lower the bar would be to make the change affect
> fewer cases, like maybe only JSON subscripts.  That points
> towards your idea of making the subscript implementation
> responsible.  However, if memory serves we pick the column
> aliases before doing parse analysis, which'd make it hard
> to know which data type is involved.  I do not really recall
> why it's done that way or whether it'd be practical to change.

Oh, that does sound annoying indeed. I'll take a look at how hard that
would be to change.



Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Mon, 16 Dec 2024 at 16:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> View definition:
>> SELECT data['a'::text] AS data
>> FROM tj;

> Are you sure you ran this with my patch?

No, sorry, I was just illustrating the behavior with HEAD.
The important part of this is not the assigned alias
but the visible cast.

>> regression=# select data['2'], data[3] from t2;
>>  data | data
>> ------+------
>>     2 |    3
>> (1 row)

> So what would you want here? Do you want these columns to be called 2
> and 3?

No!!

> One thing that I didn't see you explicitly say: Do you agree that the
> new column names are actually better than the old ones?

No, I'm not at all convinced of that.  For these examples
I'd prefer something like "data_a", "data_b", etc.

That approach might also make it more palatable to process integer
literals this way (i.e. "data_2" etc), though I am not sure we want
to do that because of the increased blast radius.

            regards, tom lane



Re: Improving default column names/aliases of subscript text expressions

From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 19:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> No, sorry, I was just illustrating the behavior with HEAD.
> The important part of this is not the assigned alias
> but the visible cast.

Then I don't think I understand what you're trying to say. While I
think it would be good to not have an explicit cast impact the
explicit casts in a subscript as well, it does seem like a rather
niche edge case for people to write a query where that would matter.
If you're doing such explicit casts, you probably also want to set an
explicit alias. Case in point being ruleutils, where it explicitizes
both the implicit cast, and the implicit column alias.

> > So what would you want here? Do you want these columns to be called 2
> > and 3?
>
> No!!

Good, then we agree on that at least.

> > One thing that I didn't see you explicitly say: Do you agree that the
> > new column names are actually better than the old ones?
>
> No, I'm not at all convinced of that.  For these examples
> I'd prefer something like "data_a", "data_b", etc.

I did consider that naming scheme as well, but there are a few reasons:
1. That same naming scheme holds just as well for fields of composite
types. It seems inconsistent to only do it for subscripts. Our logic
now is to take the last field name in a series of field names. My POC
patch basically extends that to be the last field name OR
string-literal subscript. (to be clear changing the default column
names for fields of composite types like this seems out of the
question to me with regards to the amount of impact).
2. There's a hard limit of 63 characters in a column name due to
NAMEDATALEN, so putting the whole path in there won't fit in case of
somewhat long subscript names.
3. Even if we'd have unlimited length, now you end up with common
prefixes if you select multiple fields.
4. For the custom type that I'm implementing the subscripting for, I
really don't want such a prefix.

> That approach might also make it more palatable to process integer
> literals this way (i.e. "data_2" etc), though I am not sure we want
> to do that because of the increased blast radius.

I agree, that would be nice, but I don't think that's worth the
additional blast radius.



Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> 4. For the custom type that I'm implementing the subscripting for, I
> really don't want such a prefix.

Hm.  If we made this behavior type-specific then you could have what
you want without having to break any existing expectations at all.

However, after thinking a little longer I seem to recall that we've
previously looked into the idea of deriving the default aliases from
the post-parse-analysis tree.  We gave up because there were too many
cases where the behavior would change, or at least it looked unduly
painful to prevent that.  For instance, something that looks like a
function call in the raw tree could parse to half a dozen different
kinds of nodes.  So I'm not sure how we get there from here.

            regards, tom lane



Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Mon, 16 Dec 2024 at 21:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, after thinking a little longer I seem to recall that we've
>> previously looked into the idea of deriving the default aliases from
>> the post-parse-analysis tree.  We gave up because there were too many
>> cases where the behavior would change, or at least it looked unduly
>> painful to prevent that.

> I think you remember wrong, or things have changed drastically since.
> Because it only required fairly minimal changes to base the column
> name on the transformed expression, see the attached POC.

Oh, well if you're willing to cheat like that, sure ;-).  I was
speaking of replacing the existing logic with something that looked
only at the post-analysis tree.

I dunno, this is so obviously a single-purpose kluge that it's hard
to call it anything but a kluge.  I'm not convinced this is better
than writing out "SELECT data['a'] AS a, data['b'] AS b, ...".

In particular, it seems like what's going on here is that you
are using extensible subscripting because that's what's available,
but what you really wish you had is extensible field selection.
If you could write "SELECT (data).a, (data).b, ..." then the
existing FigureColname heuristics would do what you want already.
I know we kicked that idea around a little in the past, but
nobody has looked into it seriously.

            regards, tom lane



Re: Improving default column names/aliases of subscript text expressions

From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 21:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh, well if you're willing to cheat like that, sure ;-).  I was
> speaking of replacing the existing logic with something that looked
> only at the post-analysis tree.

Yeah, alright. That's not really something that I think we can do
without introducing some unintended behavioural changes.

> I dunno, this is so obviously a single-purpose kluge that it's hard
> to call it anything but a kluge.  I'm not convinced this is better
> than writing out "SELECT data['a'] AS a, data['b'] AS b, ...".

To be clear, the benefit of not having to add the alias manually gets
more meaningful when subscripts are not a single character like
data['first_name'], then both the additional typing gets more
significant as well as the chance for accidental typos in the alias.

Regarding it being a single purpose kludge. I don't really see a big
problem with doing certain specific column naming on the transformed
expression, and if that doesn't find a good name then we fall back to
the battle-tested naming based on the untransformed expression. You
even said that previously people wanted to improve certain other
naming too using the transformed expression, so it sounds like it's
not even single-purpose. So instead of seeing this as a kludge, I'd
look at it as the only way forward from the current situation we're
in, without having to worry about breaking unintended things.

> In particular, it seems like what's going on here is that you
> are using extensible subscripting because that's what's available,
> but what you really wish you had is extensible field selection.
> If you could write "SELECT (data).a, (data).b, ..." then the
> existing FigureColname heuristics would do what you want already.

There's a lot of extensibility that I would like to have in the
parser. But yes, extensible field selection would be very nice for the
things I'm working on. Although those parentheses don't look very
user-friendly, but I guess [1] would probably resolve that.

> I know we kicked that idea around a little in the past, but
> nobody has looked into it seriously.

Yeah, that's definitely an area I plan to look more seriously into
soon-ish. But that's a much bigger project, and this seemed like a
fairly easy win, both for what I'm working on[2] and for the built in
json indexing.

[1]: https://www.postgresql.org/message-id/8bb3af8a-796c-440f-b775-d05437b75e6f@eisentraut.org
[2]: https://github.com/duckdb/pg_duckdb