Re: PoC: Partial sort - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: PoC: Partial sort
Date
Msg-id CAPpHfdudKEit-5yCWrbRgTOuhNoEL3ZpxupvxNO_qfCGR=r4Hw@mail.gmail.com
Whole thread Raw
In response to Re: PoC: Partial sort  (Peter Geoghegan <pg@heroku.com>)
Responses Re: PoC: Partial sort  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: PoC: Partial sort  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
Hi, Peter!

I finally went over your review.

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
                                                                 QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
   ->  Partial sort  (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
         Sort Key: v1, v2
         Presorted Key: v1
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
 Planning time: 0.948 ms
 Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

I agree with your notes. In the attached version of path explain output was revised as you proposed.
 
Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

With partial sort we run multiple sorts in the same node. Ideally, we need to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple times. I checked how it works for now.

# explain analyze select (select sum(x.i) from (select i from generate_series(1,t * 1000) i order by i desc) x) from generate_series(1, 20, 1) t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series t  (cost=0.00..74853.92 rows=1000 width=4) (actual time=0.777..83.498 rows=20 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=74.83..74.84 rows=1 width=4) (actual time=4.173..4.173 rows=1 loops=20)
           ->  Sort  (cost=59.83..62.33 rows=1000 width=4) (actual time=2.822..3.361 rows=10500 loops=20)
                 Sort Key: i.i
                 Sort Method: quicksort  Memory: 1706kB
                 ->  Function Scan on generate_series i  (cost=0.01..10.01 rows=1000 width=4) (actual time=0.499..1.106 rows=10500 loops=20)
 Planning time: 0.080 ms
 Execution time: 83.625 ms
(9 rows)

# explain analyze select (select sum(x.i) from (select i from generate_series(1,t * 1000) i order by i desc) x) from generate_series(20, 1, -1) t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series t  (cost=0.00..74853.92 rows=1000 width=4) (actual time=11.414..86.127 rows=20 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=74.83..74.84 rows=1 width=4) (actual time=4.305..4.305 rows=1 loops=20)
           ->  Sort  (cost=59.83..62.33 rows=1000 width=4) (actual time=2.944..3.486 rows=10500 loops=20)
                 Sort Key: i.i
                 Sort Method: quicksort  Memory: 71kB
                 ->  Function Scan on generate_series i  (cost=0.01..10.01 rows=1000 width=4) (actual time=0.527..1.125 rows=10500 loops=20)
 Planning time: 0.080 ms
 Execution time: 86.165 ms
(9 rows)

In the case of subplan explain analyze gives us just information about last subplan run. This makes me uneasy. From one side, it's probably OK that partial sort behaves like subplan while showing information just about last sort run. From the other side, we need some better solution for that in general case.

 
Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Agree. I made 
 
Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+               case T_Sort:
+                       /* We shouldn't reach here without having plan node */
+                       Assert(node);
+                       /* With skipCols sort node holds only last bucket */
+                       if (node && ((Sort *)node)->skipCols == 0)
+                               return true;
+                       else
+                               return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without significant rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.
 
There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

        /*
+        * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+        * or EXEC_FLAG_MARK, because we hold only current bucket in
+        * tuplesortstate.
+        */
+       Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
                  EXEC_FLAG_BACKWARD |
+
                  EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

It has to be dynamic if we want to keep full sort and partial sort in the same node. If properties of full sort and partial sort are different then and they share same node then this properties of Sort node have to be dynamic.
Alternative idea I have is that Sort node should fallback to full sort if it sees any of above flags. But I'm not sure this is right. In some cases it might be cheaper to partial sort then materialize than fallback to full sort.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD().  _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.
 
Fixed. I've tried "#define COPY_PARSE_PLAN_TREES", now regression tests are passed with it.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

Fixed. 

What is this?

+               if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+                       elog(ERROR, "Sort");

It's just piece of junk I used for debug. Deleted.
 

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments.  Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

I've added some comments.
 
* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common().  I think that
comments should make this quite clear.

That's it. Explicit comment about that was added.
 
* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

Now compare_bifractional_path_costs() is together with 
 
* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch.  ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

Now grouping_planner() should call separate function estimate_pathkeys_groups() which is responsible for estimating number of groups.
 
* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.
 
Idea is that since partial sort fetches data per group then it would require fetching more data than fully presorted path.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

I've tried to make it more clear. partial_sort_path is renamed to presorted_path. 
 
Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

Unique paths occasionally can use this optimization.

# create table test as (select id, random() as v from generate_series(1,1000000) id);
# create index test_v_idx on test(v);

# explain select distinct v, id from test;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Unique  (cost=0.47..55104.41 rows=1000000 width=12)
   ->  Sort  (cost=0.47..50104.41 rows=1000000 width=12)
         Sort Key: v, id
         Presorted Key: v
         ->  Index Scan using test_v_idx on test  (cost=0.42..47604.41 rows=1000000 width=12)
(5 rows)

# explain select distinct id, v from test;
                                QUERY PLAN
---------------------------------------------------------------------------
 Unique  (cost=132154.34..139654.34 rows=1000000 width=12)
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: id, v
         ->  Seq Scan on test  (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

But it depends on attribute order. I could work out this case, but I would prefer some simple case to commit before. I already throw merge join optimization away for the sake of simplicity.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: checkpointer continuous flushing - V16
Next
From: Teodor Sigaev
Date:
Subject: Re: WIP: Upper planner pathification