Thread: Limited Sort
So I have a quick prototype of this and in fact it handles the common use case of a paged web page sorted on non-indexed columns very well. If you have only a small limit like most web pages often avoids external sorts and produces results 10-20x faster or more. Obviously by raising the size of the input data set and lowering the limit you can get an arbitrarily large speedup. The changes in tuplesort are really minimal. I refactored heapify since both the regular external sort and the bounded sort needed it, but if it hadn't been for that I think only about 25 lines would actually be added to tuplesort and only a handful changed. The part that's still bugging me is how nodeLimit communicates to nodeSort. What I've done for now is below. It bothers me that nodeLimit is peeking into nodeSort though. In particular it seems short sighted since nodeSort is far from the only node that could benefit from this information. If there's a HashAggregate for example it could stop creating new hash entries when the limit is reached. So it seems worthwhile having some sort of abstract API where other nodes could handle the information if they want without having to teach nodeLimit about every other node type that can. Index: src/backend/executor/nodeLimit.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeLimit.c,v retrieving revision 1.27 diff -c -r1.27 nodeLimit.c *** src/backend/executor/nodeLimit.c 26 Jul 2006 19:31:50 -0000 1.27 --- src/backend/executor/nodeLimit.c 14 Sep 2006 14:53:39 -0000 *************** *** 270,275 **** --- 270,295 ---- /* Reset position to start-of-scan */ node->position = 0; node->subSlot = NULL; + + /* This is a bit of a kluge + * + * I'm not aware of any abstract way to communicate between the two nodes. + * Perhaps I should add a new method in nodeSort like ExecAdviseSort and + * have a dispatcher in ExecNode that only has one branch. It isn't likely + * to be abstract enough to handle other forms of "advice" though. + */ + if (!node->noCount && (IsA(outerPlanState(node), SortState))) + { + SortState *sortState = (SortState*)outerPlanState(node); + int64 tuples_needed = node->count + node->offset; + + /* check for overflow */ + if (tuples_needed < 0) + return; + + sortState->bounded = true; + sortState->bound = tuples_needed; + } } I also think it might be easy to do what Martin suggested and trim the tapes to the limit as well. I can't see a big use case for this since you would have to have an extremely large input set to make it kick in before the final merge but there's no reason not to do it either. I don't think it adds any significant complexity. And lastly I find the idea of putting attention into OLAP functionality interesting. Does anyone have any ideas on where to start with that? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, 2006-09-18 at 14:26 +0100, Gregory Stark wrote: > And lastly I find the idea of putting attention into OLAP functionality > interesting. Does anyone have any ideas on where to start with that? In SQL:2003 the OLAP functionality mostly relies on the concept of sorted Windows over which "Windowed Aggregates" are defined. Windows can be defined as ( CURRENT ROW | ( UNBOUNDED | n ) PRECEDING )[( UNBOUNDED | n ) FOLLOWING] e.g. current situation: CURRENT ROW running average over last 10 rows10 PRECEDING running totalUNBOUNDED PRECEDING The logic is set for the Window and doesn't make much sense to put that in the aggregate. I imagine that we would do this by allowing the sort results to be held as an internal cursor, so we can rewind by the appropriate number. If preceding sort was an external sort, then we use a tuplestore, modified so that it acts as a fixed-length FIFO queue so we retain lookback capability. n FOLLOWING would be emulated by fetching n rows ahead so that the cursor was set at the correct place before starting calculation. UNBOUNDED PRECEDING can be handled by caching the output of the last final function in the query state, so that we do not need to have a lookback capability at all in that case. (This may need to have an additional function type definition, since it won't work with all functions). e.g. select val, sum(val) over (order by val rows unbounded preceding) from foo; val running_SUM_val 1 1 we cache final_func output =1 2 3 =2+1, we then cache it again=3 3 6 =3+3, we then cache it again=6 4 10 =4+6 OLAP is a lot more complex than I've made out, but the only other thing we need in tuplesort.c to support Windowed aggregates is:ORDER BY foo (NULLS FIRST | NULLS LAST) which we cannot yet specify. I'm not planning to work on this just yet, but there's always next year. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
When we was talking about optimizing 'ORDER BY .. LIMIT' with Oleg and Alvaro at conference, we was thinking to make new executor's node - Partial Sort. And planner may choose which nodes to use: nodeSort and nodeLimit or nodePartialSort. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > When we was talking about optimizing 'ORDER BY .. LIMIT' with Oleg and Alvaro > at conference, we was thinking to make new executor's node - Partial Sort. And > planner may choose which nodes to use: nodeSort and nodeLimit or > nodePartialSort. That's an interesting option too. Certainly it preserves the kind of separation between nodes that we have currently. It also might make costing the plan more practical. The downside would be that we would start having a proliferation of nodes. Several node types can take advantage of a limit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com