Re: Todo: Teach planner to evaluate multiple windows in the optimal order - Mailing list pgsql-hackers

From Ankit Kumar Pandey
Subject Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Date
Msg-id e428fc75-3c56-7cfa-1ec8-5f57e96b1ced@gmail.com
Whole thread Raw
In response to Re: Todo: Teach planner to evaluate multiple windows in the optimal order  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Todo: Teach planner to evaluate multiple windows in the optimal order
List pgsql-hackers
On 07/01/23 17:28, David Rowley wrote:
> Your email client seems to be adding additional vertical space to your
> emails. I've removed the additional newlines in the quotes. Are you
> able to fix the client so it does not do that?
I have adjusted my mail client, hope it is better now?
> On Sun, 8 Jan 2023 at 00:10, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> >
> > On 07/01/23 09:58, David Rowley wrote:
> > > You also don't seem to be considering the fact that the query might
> > > have a DISTINCT clause.
> >
> > Major reason for this was that I am not exactly aware of what distinct
> > clause means (especially in
> >
> > context of window functions) and how it is different from other
> > sortClauses (partition, order by, group).
>
> I'm talking about the query's DISTINCT clause. i.e SELECT DISTINCT.
> If you look in the grouping_planner() function, you'll see that
> create_distinct_paths() is called between create_window_paths() and
> create_ordered_paths().

Yes just saw this and got what you meant.


> > Yes, this is a fair point. Multiple sort is actually beneficial in cases
> > like this, perhaps limits clause and runCondition should be no op too?
>
> I'm not sure what you mean by "no op". Do you mean not apply the optimization?
Yes, no op = no optimization. Sorry I didn't mention it before.
> > > I think the patch should also be using pathkeys_contained_in() and
> > > Lists of pathkeys rather than concatenating lists of SortGroupClauses
> > > together.  That should allow things to work correctly when a given
> > > pathkey has become redundant due to either duplication or a Const in
> > > the Eclass.
> >
> > Make sense, I actually duplicated that logic from
> > make_pathkeys_for_window. We should make this changes there as well because
> > if we have SELECT rank() OVER (PARTITION BY a ORDER BY a)
> > (weird example but you get the idea), it leads to duplicates in
> > window_sortclauses.
>
> It won't lead to duplicate pathkeys. Look in
> make_pathkeys_for_sortclauses() and what pathkey_is_redundant() does.
> Notice that it checks if the pathkey already exists in the list before
> appending.

Okay I see this, pathkey_is_redundant is much more smarter as well.

Replacing list_concat_copy with list_concat_unique in 
make_pathkeys_for_window

won't be of much benefit.

> > Agree with runConditions part but for limit clause, row reduction happens
> > at the last, so whether we use patched version or master version,
> > none of sorts would benefit/degrade from that, right?
>
> Maybe you're right. Just be aware that a sort done for a query with an
> ORDER BY LIMIT will perform a top-n sort. top-n sorts only need to
> store the top-n tuples and that can significantly reduce the memory
> required for sorting perhaps resulting in the sort fitting in memory
> rather than spilling out to disk.
>
> You might want to test this by having the leading sort column as an
> INT, and then the 2nd one as a long text column of maybe around two
> kilobytes. Make all the leading column values the same so that the
> comparison for the text column is always performed. Make the LIMIT
> small compared to the total number of rows, that should test the worse
> case.  Check the performance with and without the limitCount != NULL
> part of the patch that disables the optimization for LIMIT.

I checked this. For limit <<< total number of rows, top-n sort was

performed but when I changed limit to higher value (or no limit),

quick sort was performed.

Top-n sort was twice as fast.

Also, tested (first) patch version vs master, top-n sort

was twice as fast there as well (outputs mentioned below).

Current patch version (with limit excluded for optimizations)

explain (analyze ,costs off) select count(*) over (order by id) from tt 
order by id, name limit 1;
                                             QUERY PLAN
---------------------------------------------------------------------------------------------------
  Limit (actual time=1.718..1.719 rows=1 loops=1)
    ->  Incremental Sort (actual time=1.717..1.717 rows=1 loops=1)
          Sort Key: id, name
          Presorted Key: id
          Full-sort Groups: 1  Sort Method: top-N heapsort  Average 
Memory: 25kB  Peak Memory: 25kB
          ->  WindowAgg (actual time=0.028..0.036 rows=6 loops=1)
                ->  Sort (actual time=0.017..0.018 rows=6 loops=1)
                      Sort Key: id
                      Sort Method: quicksort  Memory: 25kB
                      ->  Seq Scan on tt (actual time=0.011..0.012 
rows=6 loops=1)
  Planning Time: 0.069 ms
  Execution Time: 1.799 ms

Earlier patch(which included limit clause for optimizations)

explain (analyze ,costs off) select count(*) over (order by id) from tt 
order by id, name limit 1;
                                  QUERY PLAN
----------------------------------------------------------------------------
  Limit (actual time=3.766..3.767 rows=1 loops=1)
    ->  WindowAgg (actual time=3.764..3.765 rows=1 loops=1)
          ->  Sort (actual time=3.749..3.750 rows=6 loops=1)
                Sort Key: id, name
                Sort Method: quicksort  Memory: 25kB
                ->  Seq Scan on tt (actual time=0.011..0.013 rows=6 loops=1)
  Planning Time: 0.068 ms
  Execution Time: 3.881 ms

I am just wondering though, why can we not do top-N sort

in optimized version if we include limit clause? Is top-N sort is 
limited to non strict sorting or

cases last operation before limit is sort? .

> > Is it okay
> > to add comments in test cases? I don't see it much on existing cases
> > so kind of reluctant to add but it makes intentions much more clear.
>
> I think tests should always have a comment to state what they're
> testing. Not many people seem to do that, unfortunately. The problem
> with not stating what the test is testing is that if, for example, the
> test is checking that the EXPLAIN output is showing a Sort, what if at
> some point in the future someone adjusts some costing code and the
> plan changes to an Index Scan.  If there's no comment to state that
> we're looking for a Sort plan, then the author of the patch that's
> adjusting the costs might just think it's ok to change the expected
> plan to an Index Scan.   I've seen this problem occur even when the
> comments *do* exist. There's just about no hope of such a test
> continuing to do what it's meant to if the comments don't exist.

Thanks for clarifying this out, I will freely add comments if that helps

to explain things better.


-- 
Regards,
Ankit Kumar Pandey




pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: [BUG] Logical replica crash if there was an error in a function.
Next
From: Dean Rasheed
Date:
Subject: Re: MERGE ... WHEN NOT MATCHED BY SOURCE