Thread: Improving inferred query column names

Improving inferred query column names

From
Andres Freund
Date:
Hi,

A common annoyance when writing ad-hoc analytics queries is column naming once
aggregates are used.

Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writes

Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum

So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;


Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.

I think on a green field it'd be clearly better to do something like the
above.  What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names

Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...

Greetings,

Andres Freund



Re: Improving inferred query column names

From
Vladimir Churyukin
Date:
That is a good idea for simple cases, I'm just curious how it would look like for more complex cases (you can have all kinds of expressions as parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not very helpful.
Wouldn't it make more sense to just deduplicate the names by adding numerical postfixes, like sum_1, sum_2?
For backwards compatibility I guess you can have a GUC flag controlling that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with the flag) once the current version becomes unsupported. 
(or with a different deprecation policy, I'm not sure what is the general rule for breaking changes and deprecation currently).
If there is a clearly defined deprecation policy and a backwards compatibility option, it should be good, no? Just my 2 cents.

-Vladimir Churyukin

On Sat, Feb 11, 2023 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

A common annoyance when writing ad-hoc analytics queries is column naming once
aggregates are used.

Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writes

Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum

So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;


Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.

I think on a green field it'd be clearly better to do something like the
above.  What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names

Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...

Greetings,

Andres Freund


Re: Improving inferred query column names

From
Corey Huinker
Date:
On Sat, Feb 11, 2023 at 3:47 PM Vladimir Churyukin <vladimir@churyukin.com> wrote:
For backwards compatibility I guess you can have a GUC flag controlling that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with the flag) once the current version becomes unsupported. 

Seems more like a per-session setting than a GUC.

Here's a suggestion off the top of my head.

We create a session setting inferred_column_name_template.

The template takes a formatting directive %N which is just a counter

SET inferred_column_name_template = 'col_%N'

which would give you col_1, col_2, regardless of what kind of expression the columns were

We could introduce another directive, %T

SET inferred_column_name_template = '%T_%N'

which prints the datatype short name of the column. In this case, %N would increment per datatype, so text_1, integer_1, text_2, timestamptz_1, text_3

Getting fancier, we could introduce something less datatype centric, %F

SET inferred_column_name_template = '%F_%N'

Which would walk the following waterfall and stop on the first match
   1. The datatype short name if the expression is explicitly casted (either CAST or ::)
   2. the name of the function if the outermost expression was a function (aggregate, window, or scalar),  so sum_1, substr_1
   3. 'case' if the outermost expression was  case
   4. 'expr' if the expression was effectively an operator (  SELECT 3+4, 'a'  || 'b' etc)
   5. the datatype short name for anything that doesn't match any of the previous, and for explicit casts

Keeping track of all the %N counters could get silly, so maybe a %P which is simply the numeric column position of the column, so your result set would go like:  id, name, col_3, last_login, col_5.

We would have to account for the case where the user left either %N or %P out of the template, so one of them would be an implied suffix if both were absent, or we maybe go with 

SET inferred_column_name_prefix = '%F_';
SET inferred_column_name_counter = 'position';   /* position, counter, per_type_counter */

Or we just cook up a few predefined naming schemes, and let the user pick from those.

One caution I have is that I have seen several enterprise app database designs that have lots of user-customizable columns with names like varchar1, numeric4, etc. Presumably the user would know their environment and not pick a confusing template.

Re: Improving inferred query column names

From
Peter Eisentraut
Date:
On 11.02.23 20:24, Andres Freund wrote:
> Not useful column names:
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: sum, sum
> 
> So i often end up manually writing:
> SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;
> 
> Of course we can't infer useful column names for everything, but for something
> like this, it should't be too hard to do better. E.g. by combining the
> function name with the column name in the argument, if a single plain column
> is the argument.
> 
> I think on a green field it'd be clearly better to do something like the
> above.  What does give me pause is that it seems quite likely to break
> existing queries, and to a lesser degree, might break applications relying on
> inferred column names
> 
> Can anybody think of a good way out of that? It's not like that problem is
> going to go away at some point...

I think we should just do it and not care about what breaks.  There has 
never been any guarantee about these.

FWIW, "most" other SQL implementations appear to generate column names like

SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"

(various capitalization of course).

We had a colleague look into this a little while ago, and it got pretty 
tedious to implement this for all the expression types.  And, you know, 
the bikeshedding ...

But I'm all in favor of improving this.




Re: Improving inferred query column names

From
"David G. Johnston"
Date:
On Mon, Feb 20, 2023 at 8:08 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 11.02.23 20:24, Andres Freund wrote:
>
> I think on a green field it'd be clearly better to do something like the
> above.  What does give me pause is that it seems quite likely to break
> existing queries, and to a lesser degree, might break applications relying on
> inferred column names
>
> Can anybody think of a good way out of that? It's not like that problem is
> going to go away at some point...

I think we should just do it and not care about what breaks.  There has
never been any guarantee about these.


I'm going to toss a -1 into the ring but if this does go through a strong request that it be disabled via a GUC.  The ugliness of that option is why we shouldn't do this.

Defacto reality is still a reality we are on the hook for.

I too find the legacy design choice to be annoying but not so much that changing it seems like a good idea.

David J.

Re: Improving inferred query column names

From
Peter Eisentraut
Date:
On 20.02.23 16:17, David G. Johnston wrote:
>     I think we should just do it and not care about what breaks.  There has
>     never been any guarantee about these.
> 
> 
> I'm going to toss a -1 into the ring but if this does go through a 
> strong request that it be disabled via a GUC.  The ugliness of that 
> option is why we shouldn't do this.
> 
> Defacto reality is still a reality we are on the hook for.
> 
> I too find the legacy design choice to be annoying but not so much that 
> changing it seems like a good idea.

Well, a small backward compatibility GUC might not be too cumbersome.




Re: Improving inferred query column names

From
Andres Freund
Date:
Hi,

On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
> On 11.02.23 20:24, Andres Freund wrote:
> I think we should just do it and not care about what breaks.  There has
> never been any guarantee about these.
> 
> FWIW, "most" other SQL implementations appear to generate column names like
> 
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: "SUM(reads)", "SUM(writes)"

Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns.  sum_reads imo is more usable
than than "SUM(reads)".


> We had a colleague look into this a little while ago, and it got pretty
> tedious to implement this for all the expression types.

Hm, any chance that colleague could be pointed at this discussion and chime
in? It doesn't immediately look that hard to do substantially better than
today. Of course there's an approximately endless amount of effort that could
be poured into this, but even some fairly basic improvements seem like a big
win.


> And, you know, the bikeshedding ...

Indeed. I already started above :)

Greetings,

Andres Freund



Re: Improving inferred query column names

From
Andres Freund
Date:
Hi,

On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
> That is a good idea for simple cases, I'm just curious how it would look
> like for more complex cases (you can have all kinds of expressions as
> parameters for aggregate function calls).
> If it works only for simple cases, I think it would be confusing and not
> very helpful.

I don't think it needs to be perfect to be helpful.


> Wouldn't it make more sense to just deduplicate the names by adding
> numerical postfixes, like sum_1, sum_2?

That'd be considerably worse than what we do today imo, because any reordering
/ added aggregate would lead to everything else changing as well.


Greetings,

Andres Freund



Re: Improving inferred query column names

From
Vladimir Churyukin
Date:


On Wed, Feb 22, 2023, 12:40 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
> That is a good idea for simple cases, I'm just curious how it would look
> like for more complex cases (you can have all kinds of expressions as
> parameters for aggregate function calls).
> If it works only for simple cases, I think it would be confusing and not
> very helpful.

I don't think it needs to be perfect to be helpful.


It doesn't need to be perfect, but it needs to be consistent. So far you proposed a rule to replace () with _. What is the plan for expressions, how to convert them to names (with deduplication I guess?, because there could be 2 similar expressions mapped to the same name potentially).


> Wouldn't it make more sense to just deduplicate the names by adding
> numerical postfixes, like sum_1, sum_2?

That'd be considerably worse than what we do today imo, because any reordering
/ added aggregate would lead to everything else changing as well.


Ok, that I kinda agree with. Not necessarily worse overall, but worse for some cases. Well, the proposal above about keeping the names exactly the same as the full expressions is probably the best we can do then. It will take care of possible duplications and won't be position-sensitive. And will be consistent. The only issue is somewhat unusual column names that you will have to use quotes to refer to. But is that a real issue?

-Vladimir Churyukin

Re: Improving inferred query column names

From
Tom Lane
Date:
Vladimir Churyukin <vladimir@churyukin.com> writes:
> It doesn't need to be perfect, but it needs to be consistent. So far you
> proposed a rule to replace () with _. What is the plan for expressions, how
> to convert them to names (with deduplication I guess?, because there could
> be 2 similar expressions mapped to the same name potentially).

I do not think we need to do anything for arbitrary expressions.
The proposal so far was just to handle a function call wrapped
around something else by converting to the function name followed
by whatever we'd emit for the something else.  You cannot realistically
handle, say, operator expressions without emitting names that will
require quoting, which doesn't seem attractive.

And no, deduplication isn't on the table at all here.

            regards, tom lane



Re: Improving inferred query column names

From
Andres Freund
Date:
Hi,

On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
> Vladimir Churyukin <vladimir@churyukin.com> writes:
> > It doesn't need to be perfect, but it needs to be consistent. So far you
> > proposed a rule to replace () with _. What is the plan for expressions, how
> > to convert them to names (with deduplication I guess?, because there could
> > be 2 similar expressions mapped to the same name potentially).
> 
> I do not think we need to do anything for arbitrary expressions.

Exactly. It's not like they have a useful name today.  Nor are they unique.


> The proposal so far was just to handle a function call wrapped
> around something else by converting to the function name followed
> by whatever we'd emit for the something else.

Just to showcase that better, what I think we're discussing is changing:

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──────┬────────┬──────────┐
│ sum  │  sum   │ ?column? │
├──────┼────────┼──────────┤
│ 2774 │ 257896 │        2 │
└──────┴────────┴──────────┘
(1 row)

to

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──────────────┬───────────────┬──────────┐
│ sum_relpages │ sum_reltuples │ ?column? │
├──────────────┼───────────────┼──────────┤
│         2774 │        257896 │        2 │
└──────────────┴───────────────┴──────────┘
(1 row)


> You cannot realistically
> handle, say, operator expressions without emitting names that will
> require quoting, which doesn't seem attractive.

Well, it doesn't require much to be better than "?column?", which already
requires quoting...

We could just do something like printing <left>_<funcname>_<right>. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.

Whether that's worth it, or whether column name lengths would be too painful,
IDK.


> And no, deduplication isn't on the table at all here.

+1

Greetings,

Andres Freund



Re: Improving inferred query column names

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
>> The proposal so far was just to handle a function call wrapped
>> around something else by converting to the function name followed
>> by whatever we'd emit for the something else.

> SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
> ┌──────────────┬───────────────┬──────────┐
> │ sum_relpages │ sum_reltuples │ ?column? │
> ├──────────────┼───────────────┼──────────┤

So far so good, but what about multi-argument functions?
Do we do "f_x_y_z", and truncate wherever?  How well will this
work with nested function calls?

>> You cannot realistically
>> handle, say, operator expressions without emitting names that will
>> require quoting, which doesn't seem attractive.

> Well, it doesn't require much to be better than "?column?", which already
> requires quoting...

I think the point of "?column?" is to use something that nobody's going
to want to reference that way, quoted or otherwise.  The SQL spec says
(in SQL:2021, it's 7.16 <query specification> syntax rule 18) that if the
column expression is anything more complex than a simple column reference
(or SQL parameter reference, which I think we don't support) then the
column name is implementation-dependent, which is standards-ese for
"here be dragons".

BTW, SQL92 and SQL99 had a further constraint:

            c) Otherwise, the <column name> of the i-th column of the <query
              specification> is implementation-dependent and different
              from the <column name> of any column, other than itself, of
              a table referenced by any <table reference> contained in the
              SQL-statement.

We never tried to implement that literally, and now I'm glad we didn't
bother, because recent spec versions only say "implementation-dependent",
full stop.  In any case, the spec is clearly in the camp of "don't depend
on these column names".

> We could just do something like printing <left>_<funcname>_<right>. So
> something like avg(reltuples / relpages) would end up as
> avg_reltuples_float48div_relpages.
> Whether that's worth it, or whether column name lengths would be too painful,
> IDK.

I think you'd soon be hitting NAMEDATALEN limits ...

>> And no, deduplication isn't on the table at all here.

> +1

I remembered while looking at the spec that duplicate column names
in SELECT output are not only allowed but *required* by the spec.
If you write, say, "SELECT 1 AS x, 2 AS x, ..." then the column
names of those two columns are both "x", no wiggle room at all.
So I see little point in trying to deduplicate generated names,
even aside from the points you made.

            regards, tom lane



Re: Improving inferred query column names

From
Joe Conway
Date:
On 2/22/23 23:03, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> We could just do something like printing <left>_<funcname>_<right>. So
>> something like avg(reltuples / relpages) would end up as
>> avg_reltuples_float48div_relpages.
>> Whether that's worth it, or whether column name lengths would be too painful,
>> IDK.
> 
> I think you'd soon be hitting NAMEDATALEN limits ...

<flameproof_suit>

Probably an unpalatable idea, but if we did something like 
md5('avg(reltuples / relpages)') for the column name, it would be 
(reasonably) unique and deterministic. Not pretty, but possibly useful 
in some cases.

</flameproof_suit>

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improving inferred query column names

From
Peter Eisentraut
Date:
On 22.02.23 21:38, Andres Freund wrote:
> On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
>> On 11.02.23 20:24, Andres Freund wrote:
>> I think we should just do it and not care about what breaks.  There has
>> never been any guarantee about these.
>>
>> FWIW, "most" other SQL implementations appear to generate column names like
>>
>> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
>> column names: "SUM(reads)", "SUM(writes)"
> Hm, personally I don't like leaving in parens in the names, that makes it
> unnecessarily hard to reference the columns.  sum_reads imo is more usable
> than than "SUM(reads)".

If you want something without special characters, the example you gave 
is manageable, but what are you going to do with

SELECT a, b, a * b, a / b FROM ...

or

SELECT a, b, SUM(a * b) FROM ...

and so on.  What would be the actual rule to produce the output you want?

I think a question here is what "usable" means in this context.

If you want a name that you can refer to (in a client API, for example), 
you should give it a name explicitly.

I think the uses for the automatic names are that they look pretty and 
meaningful in visual output (psql, pgadmin, etc.).  In that context, I 
think it is ok to use special characters without limitation, since you 
are just going to look at the thing, not type it back in.