Thread: Re: Improving default column names/aliases of subscript text expressions
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > SELECT data['a'], data['b'], data['c'] FROM tj; > Gives the following output: > data │ data │ data > ──────┼───────┼──────────── > 123 │ "abc" │ [123, 456] > I'd much rather have it output: > a │ b │ c > ──────┼───────┼──────────── > 123 │ "abc" │ [123, 456] I dunno, this seems to be putting an undue amount of emphasis on one very specific usage pattern. Why should it matter whether the subscripts are string literals or not? What will happen when ruleutils decides to dump those expressions with the implicit cast to text being less implicit? regression=# create view v1 as SELECT data['a'] FROM tj; CREATE VIEW regression=# \d+ v1 View "public.v1" Column | Type | Collation | Nullable | Default | Storage | Description --------+-------+-----------+----------+---------+----------+------------- data | jsonb | | | | extended | View definition: SELECT data['a'::text] AS data FROM tj; I'm also wondering how such a rule interacts with string literals that don't get resolved as text: regression=# create table t2 (data int[]); CREATE TABLE regression=# insert into t2 values(array[1,2,3,4]); INSERT 0 1 regression=# select data['2'], data[3] from t2; data | data ------+------ 2 | 3 (1 row) > 1. Change the default alias/eref to be more useful for subscripts > (probably with a GUC to enable/disable this for people depending on > the old names) This would be a GUC that changes query semantics, which is a thing we've learned the hard way is evil. (It would not be any less evil in one of the other implementation approaches.) I think that to do something like this, we'd have to make a considered decision that the new way is so much better than the old that it's okay to break some people's queries. I don't say that bar is unreachable, but it seems high. One way to lower the bar would be to make the change affect fewer cases, like maybe only JSON subscripts. That points towards your idea of making the subscript implementation responsible. However, if memory serves we pick the column aliases before doing parse analysis, which'd make it hard to know which data type is involved. I do not really recall why it's done that way or whether it'd be practical to change. regards, tom lane
Re: Improving default column names/aliases of subscript text expressions
From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 16:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I dunno, this seems to be putting an undue amount of emphasis on > one very specific usage pattern. Why should it matter whether > the subscripts are string literals or not? What will happen when > ruleutils decides to dump those expressions with the implicit > cast to text being less implicit? Thanks for the quick response. To be clear, the attached patch was just a quick POC, it definitely needs more work if it's the direction I should pursue. Indeed adding support for explicit casts seems like a nice thing. > regression=# create view v1 as > SELECT data['a'] FROM tj; > CREATE VIEW > regression=# \d+ v1 > View "public.v1" > Column | Type | Collation | Nullable | Default | Storage | Description > --------+-------+-----------+----------+---------+----------+------------- > data | jsonb | | | | extended | > View definition: > SELECT data['a'::text] AS data > FROM tj; Are you sure you ran this with my patch? When I do this it actually fills in the "better" alias just fine, which makes sense because when the view query is parsed it doesn't have the cast yet. The cast only gets added by ruleutils, which also adds the explicit alias. > create view v1 as SELECT data['a'] FROM tj; CREATE VIEW > \d+ v1; View "public.v1" Column │ Type │ Collation │ Nullable │ Default │ Storage │ Description ────────┼───────┼───────────┼──────────┼─────────┼──────────┼───────────── a │ jsonb │ │ │ │ extended │ View definition: SELECT data['a'::text] AS a FROM tj; > I'm also wondering how such a rule interacts with string literals > that don't get resolved as text: > > regression=# create table t2 (data int[]); > CREATE TABLE > regression=# insert into t2 values(array[1,2,3,4]); > INSERT 0 1 > regression=# select data['2'], data[3] from t2; > data | data > ------+------ > 2 | 3 > (1 row) So what would you want here? Do you want these columns to be called 2 and 3? That shouldn't be too difficult to implement, but that seemed like it would possibly go a bit too far. > This would be a GUC that changes query semantics, which is a > thing we've learned the hard way is evil. (It would not be > any less evil in one of the other implementation approaches.) Hmm, okay. I didn't know that. > I think that to do something like this, we'd have to make a > considered decision that the new way is so much better than > the old that it's okay to break some people's queries. > I don't say that bar is unreachable, but it seems high. Totally fair, that's part of the reason I limited this to only string literals, to effectively only impact JSON (which only had subscript support since PG14) and not arrays. One thing that I didn't see you explicitly say: Do you agree that the new column names are actually better than the old ones? > One way to lower the bar would be to make the change affect > fewer cases, like maybe only JSON subscripts. That points > towards your idea of making the subscript implementation > responsible. However, if memory serves we pick the column > aliases before doing parse analysis, which'd make it hard > to know which data type is involved. I do not really recall > why it's done that way or whether it'd be practical to change. Oh, that does sound annoying indeed. I'll take a look at how hard that would be to change.
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > On Mon, 16 Dec 2024 at 16:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> View definition: >> SELECT data['a'::text] AS data >> FROM tj; > Are you sure you ran this with my patch? No, sorry, I was just illustrating the behavior with HEAD. The important part of this is not the assigned alias but the visible cast. >> regression=# select data['2'], data[3] from t2; >> data | data >> ------+------ >> 2 | 3 >> (1 row) > So what would you want here? Do you want these columns to be called 2 > and 3? No!! > One thing that I didn't see you explicitly say: Do you agree that the > new column names are actually better than the old ones? No, I'm not at all convinced of that. For these examples I'd prefer something like "data_a", "data_b", etc. That approach might also make it more palatable to process integer literals this way (i.e. "data_2" etc), though I am not sure we want to do that because of the increased blast radius. regards, tom lane
Re: Improving default column names/aliases of subscript text expressions
From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 19:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: > No, sorry, I was just illustrating the behavior with HEAD. > The important part of this is not the assigned alias > but the visible cast. Then I don't think I understand what you're trying to say. While I think it would be good to not have an explicit cast impact the explicit casts in a subscript as well, it does seem like a rather niche edge case for people to write a query where that would matter. If you're doing such explicit casts, you probably also want to set an explicit alias. Case in point being ruleutils, where it explicitizes both the implicit cast, and the implicit column alias. > > So what would you want here? Do you want these columns to be called 2 > > and 3? > > No!! Good, then we agree on that at least. > > One thing that I didn't see you explicitly say: Do you agree that the > > new column names are actually better than the old ones? > > No, I'm not at all convinced of that. For these examples > I'd prefer something like "data_a", "data_b", etc. I did consider that naming scheme as well, but there are a few reasons: 1. That same naming scheme holds just as well for fields of composite types. It seems inconsistent to only do it for subscripts. Our logic now is to take the last field name in a series of field names. My POC patch basically extends that to be the last field name OR string-literal subscript. (to be clear changing the default column names for fields of composite types like this seems out of the question to me with regards to the amount of impact). 2. There's a hard limit of 63 characters in a column name due to NAMEDATALEN, so putting the whole path in there won't fit in case of somewhat long subscript names. 3. Even if we'd have unlimited length, now you end up with common prefixes if you select multiple fields. 4. For the custom type that I'm implementing the subscripting for, I really don't want such a prefix. > That approach might also make it more palatable to process integer > literals this way (i.e. "data_2" etc), though I am not sure we want > to do that because of the increased blast radius. I agree, that would be nice, but I don't think that's worth the additional blast radius.
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > 4. For the custom type that I'm implementing the subscripting for, I > really don't want such a prefix. Hm. If we made this behavior type-specific then you could have what you want without having to break any existing expectations at all. However, after thinking a little longer I seem to recall that we've previously looked into the idea of deriving the default aliases from the post-parse-analysis tree. We gave up because there were too many cases where the behavior would change, or at least it looked unduly painful to prevent that. For instance, something that looks like a function call in the raw tree could parse to half a dozen different kinds of nodes. So I'm not sure how we get there from here. regards, tom lane
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > On Mon, 16 Dec 2024 at 21:09, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, after thinking a little longer I seem to recall that we've >> previously looked into the idea of deriving the default aliases from >> the post-parse-analysis tree. We gave up because there were too many >> cases where the behavior would change, or at least it looked unduly >> painful to prevent that. > I think you remember wrong, or things have changed drastically since. > Because it only required fairly minimal changes to base the column > name on the transformed expression, see the attached POC. Oh, well if you're willing to cheat like that, sure ;-). I was speaking of replacing the existing logic with something that looked only at the post-analysis tree. I dunno, this is so obviously a single-purpose kluge that it's hard to call it anything but a kluge. I'm not convinced this is better than writing out "SELECT data['a'] AS a, data['b'] AS b, ...". In particular, it seems like what's going on here is that you are using extensible subscripting because that's what's available, but what you really wish you had is extensible field selection. If you could write "SELECT (data).a, (data).b, ..." then the existing FigureColname heuristics would do what you want already. I know we kicked that idea around a little in the past, but nobody has looked into it seriously. regards, tom lane
Re: Improving default column names/aliases of subscript text expressions
From
Jelte Fennema-Nio
Date:
On Mon, 16 Dec 2024 at 21:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh, well if you're willing to cheat like that, sure ;-). I was > speaking of replacing the existing logic with something that looked > only at the post-analysis tree. Yeah, alright. That's not really something that I think we can do without introducing some unintended behavioural changes. > I dunno, this is so obviously a single-purpose kluge that it's hard > to call it anything but a kluge. I'm not convinced this is better > than writing out "SELECT data['a'] AS a, data['b'] AS b, ...". To be clear, the benefit of not having to add the alias manually gets more meaningful when subscripts are not a single character like data['first_name'], then both the additional typing gets more significant as well as the chance for accidental typos in the alias. Regarding it being a single purpose kludge. I don't really see a big problem with doing certain specific column naming on the transformed expression, and if that doesn't find a good name then we fall back to the battle-tested naming based on the untransformed expression. You even said that previously people wanted to improve certain other naming too using the transformed expression, so it sounds like it's not even single-purpose. So instead of seeing this as a kludge, I'd look at it as the only way forward from the current situation we're in, without having to worry about breaking unintended things. > In particular, it seems like what's going on here is that you > are using extensible subscripting because that's what's available, > but what you really wish you had is extensible field selection. > If you could write "SELECT (data).a, (data).b, ..." then the > existing FigureColname heuristics would do what you want already. There's a lot of extensibility that I would like to have in the parser. But yes, extensible field selection would be very nice for the things I'm working on. Although those parentheses don't look very user-friendly, but I guess [1] would probably resolve that. > I know we kicked that idea around a little in the past, but > nobody has looked into it seriously. Yeah, that's definitely an area I plan to look more seriously into soon-ish. But that's a much bigger project, and this seemed like a fairly easy win, both for what I'm working on[2] and for the built in json indexing. [1]: https://www.postgresql.org/message-id/8bb3af8a-796c-440f-b775-d05437b75e6f@eisentraut.org [2]: https://github.com/duckdb/pg_duckdb