Thread: Re: improve EXPLAIN for wide tables
Sami Imseih <samimseih@gmail.com> writes: > Looking further into this improvement, I started to question if it is > necessary to make columns unique for EXPLAIN purposes? Yes, otherwise references to them elsewhere in the plan will be ambiguous. It looks like your proposal tries to dodge that by unique-ifying in some cases but not others, which strikes me as a totally random and confusing thing to do. Is there any reason to think that 52c707483 wasn't a sufficient response to this issue? regards, tom lane
> > Looking further into this improvement, I started to question if it is > > necessary to make columns unique for EXPLAIN purposes? > Yes, otherwise references to them elsewhere in the plan will be > ambiguous. Explain will qualify the column name with the table name such as the simple example below with the experimental patch. I have not been able to find cases, except for the failed test case that involves "?column?", in which the plan will result in ambiguous column names. postgres=# create table test (id int, id2 int); CREATE TABLE postgres=# create table test2 (id int, id2 int); CREATE TABLE postgres=# explain select * from test, test2 where test.id = test2.id order by 1; QUERY PLAN --------------------------------------------------------------------- Merge Join (cost=317.01..711.38 rows=25538 width=16) Merge Cond: (test.id = test2.id) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: test.id -> Seq Scan on test (cost=0.00..32.60 rows=2260 width=8) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: test2.id -> Seq Scan on test2 (cost=0.00..32.60 rows=2260 width=8) (8 rows) postgres=# explain verbose select * from test, test2 where test.id = test2.id order by 1; QUERY PLAN ---------------------------------------------------------------------------- Merge Join (cost=317.01..711.38 rows=25538 width=16) Output: test.id, test.id2, test2.id, test2.id2 Merge Cond: (test.id = test2.id) -> Sort (cost=158.51..164.16 rows=2260 width=8) Output: test.id, test.id2 Sort Key: test.id -> Seq Scan on public.test (cost=0.00..32.60 rows=2260 width=8) Output: test.id, test.id2 -> Sort (cost=158.51..164.16 rows=2260 width=8) Output: test2.id, test2.id2 Sort Key: test2.id -> Seq Scan on public.test2 (cost=0.00..32.60 rows=2260 width=8) Output: test2.id, test2.id2 Query Identifier: 2523359249885908438 (14 rows) > It looks like your proposal tries to dodge that by unique-ifying > in some cases but not others, which strikes me as a totally > random and confusing thing to do. The work is required for things like deparsing views because Postgres has to construct the underlying sql definition and part of that is it has to make columns unique to generate a valid sql, such as the case here [1]. For planning purposes, the column name must already be unambiguous. right? > Is there any reason to think that 52c707483 wasn't a sufficient > response to this issue? > > regards, tom lane In the case reported, similar to the one in the earlier attached experiment.sql, even with the hash table optimization, there is still significant overhead from the unique-ifying work. for such users, this can impact real-world performance particularly if they have enabled extensions such as auto_explain. [1] https://github.com/postgres/postgres/blob/master/src/test/regress/sql/create_view.sql#L550-L565 Regards, Sami
Sami Imseih <samimseih@gmail.com> writes: >>> Looking further into this improvement, I started to question if it is >>> necessary to make columns unique for EXPLAIN purposes? >> Yes, otherwise references to them elsewhere in the plan will be >> ambiguous. > Explain will qualify the column name with the table name such as the simple > example below with the experimental patch. But if the column names are ambiguous within the same RTE, how does table-qualification fix that? And it's within-the-same-RTE that we're concerned with here. > I have not been able to find cases, > except for the failed test case that involves "?column?", in which the > plan will result in ambiguous column names. It only takes one case to mean we have to deal with it ;-). But I'm fairly sure that there are many other cases, since the parser doesn't restrict the output names of a sub-SELECT to be unique. >> Is there any reason to think that 52c707483 wasn't a sufficient >> response to this issue? > In the case reported, similar to the one in the earlier attached > experiment.sql, even with the hash table optimization, there is still > significant > overhead from the unique-ifying work. for such users, this can impact real-world > performance particularly if they have enabled extensions such as auto_explain. This detail-free claim doesn't persuade me that we have to do something more. (And I'm not buying the idea that people should expect auto_explain to be free, anyway.) regards, tom lane
I had a thought about this: I don't think EXPLAIN is ever required to print the names of join alias variables (since the planner flattens all join alias variables to some kind of expression over their underlying columns). So we could skip assigning column names to join RTEs at all, if we know that it's EXPLAIN rather than view/rule decompilation. That might let us skip all the mess around unique-ifying JOIN USING column names, too. regards, tom lane
Sami Imseih <samimseih@gmail.com> writes: >> It only takes one case to mean we have to deal with it ;-). But I'm >> fairly sure that there are many other cases, since the parser doesn't >> restrict the output names of a sub-SELECT to be unique. > good point. I see the error in my original line of thinking now. > In fact, it's this simple to prove that we still need to unique-ify > something like this subquery is valid: > select * from (select 1 a, 2 a) as s Actually, I was expecting you to cite that as a counterexample ;-) because EXPLAIN doesn't show the sub-select's column names in such cases: =# explain verbose select * from (select 1 a, 2 a) as s; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=8) Output: 1, 2 (2 rows) You need something where we don't elide the SubqueryScan node to show there's an issue, for example: =# explain verbose select * from (select 1 a, 2 b, 3 b limit 4) as s where a < 3; QUERY PLAN ------------------------------------------------------- Subquery Scan on s (cost=0.00..0.02 rows=1 width=12) Output: s.a, s.b, s.b_1 Filter: (s.a < 3) -> Limit (cost=0.00..0.01 rows=1 width=12) Output: 1, 2, 3 -> Result (cost=0.00..0.01 rows=1 width=12) Output: 1, 2, 3 (7 rows) > I suspect that we can also skip RTE_RELATION, since columns must > be unique, but I am not sure it's worth the extra effort. At least my test > does not show any real benefit. Yeah, I was thinking of that too. Seems like it ought to be a noticeable improvement if the join case is. > I am attaching a patch that deals with the RTE_JOIN case. I'll take a look. Thanks for the test demonstrating that this makes a visible performance difference. regards, tom lane
I wrote: > Sami Imseih <samimseih@gmail.com> writes: >> I am attaching a patch that deals with the RTE_JOIN case. > I'll take a look. Thanks for the test demonstrating that > this makes a visible performance difference. Pushed with some simplification: we don't need a new flag, because none of the callers of set_simple_column_names need it to do anything with join RTEs. This is better anyway because set_relation_column_names' comment explicitly says it is not for join RTEs, and now we don't use it on them ever. I poked at the question of whether it's worth skipping unique-ification for relation RTEs, and I came to the same conclusion as you: it doesn't seem to be. The related code is down in the noise according to "perf" once we skip join RTEs. I think the reason the join RTEs are so expensive for this is that the upper ones get very wide in join nests like the example query. regards, tom lane