Re: Eager aggregation, take 3 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Eager aggregation, take 3 |
Date | |
Msg-id | CA+TgmoYQhSnR4zxBuG3VsCFNb094KY-ePQM1g2J9pa-1RvObbg@mail.gmail.com Whole thread Raw |
In response to | Re: Eager aggregation, take 3 (Richard Guo <guofenglinux@gmail.com>) |
List | pgsql-hackers |
On Wed, Nov 6, 2024 at 3:22 AM Richard Guo <guofenglinux@gmail.com> wrote: > Yeah, ordered aggregates could be a blocker. I think it might be best > to prevent the use of eager aggregation if root->numOrderedAggs > 0 > for now. > > I've been thinking about the window functions case, as Jian He also > mentioned it some time ago. It seems that the window function's > argument(s), as well as the PARTITION BY expression(s), are supposed > to appear in the GROUP BY clause or be used in an aggregate function. > And window functions are applied after the aggregation. So it seems > that there is no problem with window functions. But maybe I'm wrong. > > I hadn't considered the RLS case before, but I think you're right. To > simplify things, maybe for now we can just prevent pushing down the > aggregation if the query applies some RLS policy, by checking > query->hasRowSecurity. Particularly for the RLS case, I think we should be reluctant to disable the optimization entirely just because there might be a problem. We have existing infrastructure to keep security quals from being applied too late, and I suspect it's mostly applicable to this situation. Therefore, I suspect it might not be very much work to allow this optimization even when RLS is in use, as long as it wouldn't actually cause a violation of the RLS rules. If, upon investigation, you find some reason why we can't assess accurately whether pushing down a specific aggregate is a problem, then the approach that you propose is reasonable, but I think the question should be investigated. I don't like the idea of giving up on RLS-using queries completely without even trying to figure out how difficult it would be to do the right thing. I have similar but weaker feelings about ordered aggregates. Consider: explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3 where t1.id = t2.id and t2.id = t3.id group by 1; We can't partially aggregate t2, but we could partially aggregate t2 join t3. So this case is a lot like: explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where t1.id = t2.id and t2.id = t3.id group by 1; I don't know whether the patch handles the second case correctly right now, but that certainly seems like a case that has to work. We must be able to determine in such a case that the partial aggregate has to be above the t2-t3 join. And if we can determine that, then why can't basically the same logic handle the first case? There are certainly some differences. The first case not only needs the aggregate to be above the t2-t3 join but also needs the input data to be sorted, so we don't get the right behavior for ordered aggregates just by using the contents of the ORDER BY clause to decide at what level the partial aggregate can be applied. On the other hand, if we're looking at paths for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the stipulation that we need to use ordered paths or sorting doesn't make the code very much more complicated. I'm open to the conclusion that this is too much complexity but I'd rather not dismiss it instantly. Regarding window functions, you've said a few times now that you don't see the problem, but the more I think about it, the more obvious it seems to me that there are BIG problems. Consider this example from the documentation: SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary; I get a query plan like this: WindowAgg (cost=83.46..104.37 rows=1200 width=72) -> Sort (cost=83.37..86.37 rows=1200 width=40) Sort Key: depname -> Seq Scan on empsalary (cost=0.00..22.00 rows=1200 width=40) Already we see warning signs here. The WindowAgg node needs the input rows to be ordered, because it's going to average the salary for each group of rows with the same depname. So we have the same kinds of issues that we do for ordered aggregates, at the very least. But window aggregates are not just ordering-sensitive. They are also empowered to look at other rows in the frame. Consider the following example: create table names (n text); insert into names values ('Tom'), ('Dick'), ('Harry'); select n, lag(n, 1) over () from names; The result is: n | lag -------+------ Tom | Dick | Tom Harry | Dick I think it is pretty obvious that if any form of partial aggregation had been applied here, it would be impossible to correctly evaluate lag(). Or am I missing something? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: