Thread: Windowing Function Patch Review -> Performance Comparison.
All, This is my first patch review for PostgreSQL. I did submit a patch last commit fest (Boyer-Moore) so I feel I should review one this commit fest. I'm quite new to PostgreSQL so please don't rely on me totally. I'll do my best. Heikki is also reviewing this patch which makes me feel better. My aim is to get the author has much feed back as quickly as possible. For this reason I'm going to be breaking down my reviews into the following topics. 1. Does patch apply cleanly? 2. Any compiler warnings? 3. Do the results follow the SQL standard? 4. Performance Comparison, does it perform better than alternate ways of doing things. Self joins, sub queries etc. 5. Performance, no comparison. How does it perform with larger tables? Things I probably won't attempt to review: Source code; best practises, making use of existing APIs etc. I'd rather leave that for Heikki and possibly others that join in reviewing this patch. It's not that I'm too lazy, just that I don't feel that I know the source well enough. Plus it's a complex patch. Really I should follow my list in order but I'm going to do number 4 first in order to get some quick feedback to the author. I've created some "real world" tests where windowing functions will be useful. I created some tables then populated with data. I then wrote 2 queries; 1 to make use of windowing functions, the other that uses a method without windowing functions. Test Results: Test Normal Windowing UOM Increase % Test 1 498.00 578.00 Trans/Sec 16.06% Test 2 336.00 481.00 Trans/Sec 43.15% Test 3 1.30 8.45 Trans/Sec 550.00% Test 4 424.00 629.00 Trans/Sec 48.35% Test 5 8.89 31052.69 Trans/Hour 349114.85% Test 6 253.00 305.00 Trans/Sec 20.55% (Please see attached document for the actual tests) Note: The above results will much depend on the set of data. Most of my tests use a very small volume of data. Test 3 and 5 use more data that the other tests. It's quite obvious that the more data there is in my tests the bigger the margin between the two methods becomes. I originally ran test 3 with 40000 rows to simulate a large marathon but the "normal query" was going to take hours... I reduced the rows to 10000. Obervations: Test 3 and 5 did not seem to make use of an index to get a sorted list of results. I disabled enable_seqscan but the planner still failed to choose index_scan. Is there any reason for this? Perhaps I'm missing something. Hitoshi, can you take a look at this? Tests: Please see attached file. Perhaps there were more efficient ways for certain queries, I just couldn't think of them... Please let me know if you feel I should be conducting the review in another way. David.
Attachment
Here is another way to solve "big marathon" without window functions (and many other kinds of "windowing" queries, especiallythose that do not specify "rows preceeding" etc.).<br /><br />It could be considered as a very dirty hack, howeverit could give you an insight on the performance of the "windowed" query with indexscan instead of seqscan.<br /><br/><pre class="src"><font color="blue">create</font> <font color="blue">function</font> var_set<span style="background-color:rgb(255, 249, 140);"></span> (<font color="blue">text</font>,text) <font color="blue">returns</font><font color="blue">text</font><br /> <font color="blue">as</font><br /><font color="red">'<br /> select set_config ('</font><font color="red">'public.'</font><fontcolor="red">'||$2||pg_backend_pid(), $1, false);<br />'</font> LANGUAGE <font color="red">'sql'</font>;<br/> <br /><font color="blue">create</font> <font color="blue">function</font> var_get (<font color="blue">text</font>) <fontcolor="blue">returns</font> <font color="blue">text</font><br /><font color="blue">as</font><br /><font color="red">'<br/> select current_setting('</font><font color="red">'public.'</font><font color="red">'||$1||pg_backend_pid());<br/>'</font> LANGUAGE <font color="red">'sql'</font>;<br /><br /><font color="blue">create</font>operator >>> (<font color="blue">procedure</font> = var_set, leftarg = <font color="blue">text</font>,rightarg = <font color="blue">text</font>);<br /> <font color="blue">create</font> operator <<< (<font color="blue">procedure</font> = var_get, rightarg = <font color="blue">text</font>);<br/></pre><br /><pre class="src"><font color="teal">-- init values</font><br /> <font color="blue">select</font> <font color="red">''</font>>>><font color="red">'prev_time'</font>, <font color="red">'0'</font>>>><fontcolor="red">'dense_rank'</font>;<br /> <br /><font color="teal">-- marathon query</font><br /><font color="blue">select</font> *<br /> <font color="blue">from</font>(<br /> <font color="blue">select</font> (((<font color="magenta">case</font> <font color="blue">when</font><font color="blue">time</font>::<font color="blue">text</font> = <<<<font color="red">'prev_time'</font><font color="blue">then</font> <font color="darkblue"><b>0</b></font> <font color="blue">else</font><font color="darkblue"><b>1</b></font> <font color="blue">end</font>)+(<<<<font color="red">'dense_rank'</font>)::int4)::<fontcolor="blue">text</font>>>><font color="red">'dense_rank'</font>)::int4<font color="blue">as</font> position, runnerid, <font color="blue">time</font><br/> <font color="blue">from</font> big_marathon<br /> <font color="blue">order</font><font color="blue">by</font> <font color="blue">time</font><br /> ) results<br /> <font color="blue">where</font>position=<font color="darkblue"><b>2</b></font><br /> </pre>Best regards,<br />Vladimir Sitnikov<br />
Just a small correction: there should be <font color="blue">time</font>::<font color="blue">text</font>>>><fontcolor="red">'prev_time'</font> for the calculations to be correct:<br /><pre class="src"><fontcolor="blue"><br /> select</font> *<br /> <font color="blue">from</font> (<br /> <font color="blue">select</font> (((<font color="magenta">case</font><font color="blue">when</font> <font color="blue">time</font>::<font color="blue">text</font>= <<<<font color="red">'prev_time'</font> <font color="blue">then</font> <font color="darkblue"><b>0</b></font><font color="blue">else</font> <font color="darkblue"><b>1</b></font> <font color="blue">end</font>)+(<<<<fontcolor="red">'dense_rank'</font>)::int4)::<font color="blue">text</font>>>><fontcolor="red">'dense_rank'</font>)::int4 <font color="blue">as</font> position, runnerid,<font color="blue">time</font>, <font color="blue">time</font>::<font color="blue">text</font>>>><fontcolor="red">'prev_time'</font><br /> <font color="blue">from</font> big_marathon<br/> <font color="blue">order</font> <font color="blue">by</font> <font color="blue">time</font><br /> ) results<br /> <font color="blue">where</font> position=<font color="darkblue"><b>2</b></font><br /> </pre><pre class="src"><font color="teal">-- meter_readings</font><br /><font color="blue">select</font> <font color="red">''</font>>>> <font color="red">'lag'</font>;<br /><br /><font color="blue">select</font> <font color="blue">date</font>,reading::<font color="blue">numeric</font>-(<font color="magenta">case</font> lag <font color="blue">when</font><font color="red">''</font> <font color="blue">then</font> <font color="#777777">null</font> <fontcolor="blue">else</font> lag <font color="blue">end</font>)::<font color="blue">numeric</font> <font color="blue">as</font>used<br /> <font color="blue">from</font> (<br /> <font color="blue">select</font> <font color="blue">date</font>,<<<<font color="red">'lag'</font> <font color="blue">as</font> lag, reading::<font color="blue">text</font>>>><font color="red">'lag'</font> <font color="blue">as</font> reading<br /> <font color="blue">from</font>meter_readings<br /> <font color="blue">order</font> <font color="blue">by</font> <font color="blue">date</font><br/> ) <font color="blue">as</font> t<br /> <font color="blue">order</font> <font color="blue">by</font>used <font color="blue">asc</font> nulls <font color="blue">last</font> limit <font color="darkblue"><b>1</b></font><br/> </pre><br />Best regards,<br />Vladimir Sitnikov<br />
2008/11/2 David Rowley <dgrowley@gmail.com>: > Obervations: > > Test 3 and 5 did not seem to make use of an index to get a sorted list of > results. I disabled enable_seqscan but the planner still failed to choose > index_scan. Is there any reason for this? Perhaps I'm missing something. > Hitoshi, can you take a look at this? Ah, good point. Maybe it's because I haven't paid attention to choose index_scan for upper sort node. I just put the sort node whatever the downer node is, so it might be needed to sink the information down to scan choice process that we use sort node upper. Could someone point me out how to do it, or which part of the existing code would be a good guide? > Tests: > > Please see attached file. Perhaps there were more efficient ways for certain > queries, I just couldn't think of them... > > Please let me know if you feel I should be conducting the review in another > way. Thanks for your test. Didn't post publicly, I've also tested real problems and performed better than I thought. If you can afford it, could you add selfjoin cases? It's like: -- normal SELECT t1.id, t1.grp, count(t2) + 1 AS row_number FROM t t1 INNER JOIN t t2 ON t1.grp = t2.grp AND t1.id > t2.id; -- windowing SELECT id, grp, row_number() OVER (PARTITION grp ORDER BY id) FROM t; Regards, -- Hitoshi Harada
Hitoshi Harada Wrote: > Thanks for your test. Didn't post publicly, I've also tested real > problems and performed better than I thought. If you can afford it, > could you add selfjoin cases? It's like: Ok, did self joins with some. I don't know if it's possible with all. Test Sub query Self join Vladimir Windowing UOM Increase % Test 1 498.00 N/A N/A 578.00 Trans/Sec 16.06% Test 2 336.00 424.00 182.78 481.00 Trans/Sec 13.44% Test 3 1.30 7.59 1.90 8.45 Trans/Sec 11.33% Test 4 424.00 361.00 182.00 629.00 Trans/Sec 48.35% Test 5 8.89 N/A 5844.16 31052.69 Trans/Hour 431.35% Test 6 253.00 N/A N/A 305.00 Trans/Sec 20.55% See attached for details. The increase % column is now: Window / max ( Sub query, self join, Vladimir ) - 1 Vladimir, I've included your tests too. I understand that people will probably use this method as sometimes there is little choice to get the performance that is required. Hitoshi Harada Wrote: > 2008/11/2 David Rowley <dgrowley@gmail.com>: > > Obervations: > > > > Test 3 and 5 did not seem to make use of an index to get a sorted list > of > > results. I disabled enable_seqscan but the planner still failed to > choose > > index_scan. Is there any reason for this? Perhaps I'm missing something. > > Hitoshi, can you take a look at this? > Ah, good point. Maybe it's because I haven't paid attention to choose > index_scan for upper sort node. I just put the sort node whatever the > downer node is, so it might be needed to sink the information down to > scan choice process that we use sort node upper. Could someone point > me out how to do it, or which part of the existing code would be a > good guide? I know you need to wait for an answer about this, so I'd like to delay any further performance tests until that's sorted out as it should affect performance of larger tables quite a bit. David.
Attachment
2008/11/2 David Rowley <dgrowley@gmail.com>: > Hitoshi Harada Wrote: >> 2008/11/2 David Rowley <dgrowley@gmail.com>: >> > Obervations: >> > >> > Test 3 and 5 did not seem to make use of an index to get a sorted list >> of >> > results. I disabled enable_seqscan but the planner still failed to >> choose >> > index_scan. Is there any reason for this? Perhaps I'm missing something. >> > Hitoshi, can you take a look at this? > >> Ah, good point. Maybe it's because I haven't paid attention to choose >> index_scan for upper sort node. I just put the sort node whatever the >> downer node is, so it might be needed to sink the information down to >> scan choice process that we use sort node upper. Could someone point >> me out how to do it, or which part of the existing code would be a >> good guide? > > I know you need to wait for an answer about this, so I'd like to delay any > further performance tests until that's sorted out as it should affect > performance of larger tables quite a bit. > I found how to do it, though it's only on the case you gave. Thinking about the planner optimization of the Window nodes (and its attached Sort nodes), we must consider the execution order of more than one node. In the test case we only take care of only one window, but there may be more window/sort node sets, which is too difficult to choose the best execution order including the downer indexscan, mergejoin in subquery and sort-based GROUP BY. So I didn't touch the complicated planner jungle. I rewrote the patch so that only the given bottom window's sort can consider indexscan. Deeper optimizations are over my capability. Attach is a delta patch against the last one. Also see the git diff: http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=bbba638f721a7e1d11cb3ee6af3bc1d7d3c11aa8;hp=48b73ee574779a14a3c36d373d8544d59a5b8b46 Regards, -- Hitoshi Harada
Attachment
Hitoshi Harada wrote: > >> > Test 3 and 5 did not seem to make use of an index to get a sorted > list > >> of > >> > results. I disabled enable_seqscan but the planner still failed to > >> choose > >> > index_scan. Is there any reason for this? Perhaps I'm missing > something. > >> > Hitoshi, can you take a look at this? > > > >> Ah, good point. Maybe it's because I haven't paid attention to choose > >> index_scan for upper sort node. I just put the sort node whatever the > >> downer node is, so it might be needed to sink the information down to > >> scan choice process that we use sort node upper. Could someone point > >> me out how to do it, or which part of the existing code would be a > >> good guide? > > > > I know you need to wait for an answer about this, so I'd like to delay > any > > further performance tests until that's sorted out as it should affect > > performance of larger tables quite a bit. > > > > I found how to do it, though it's only on the case you gave. Thinking > about the planner optimization of the Window nodes (and its attached > Sort nodes), we must consider the execution order of more than one > node. In the test case we only take care of only one window, but there > may be more window/sort node sets, which is too difficult to choose > the best execution order including the downer indexscan, mergejoin in > subquery and sort-based GROUP BY. So I didn't touch the complicated > planner jungle. I rewrote the patch so that only the given bottom > window's sort can consider indexscan. Deeper optimizations are over my > capability. I've just looked into what some other implementations do. Sybase seems to do exactly what you've done. It only looks at the first window clause in the query. Oracle seems to use the index regardless to the position of the window clause. To me personally what you've done seems fine for now. Perhaps something could be done later to improve on this. Maybe someone else has ideas about how to do it? It seems quite similar to SELECT MAX(idxcol),MAX(idxcol2) where the planner often makes use of 2 indexes when available, yet this case is probably far more simple as there is always just 1 row. Costing would likely be more complex with the windowing functions version. Good work. I'll continue with more benchmarks soon. David.
Hitoshi Harada wrote: > I found how to do it, though it's only on the case you gave. Thinking > about the planner optimization of the Window nodes (and its attached > Sort nodes), we must consider the execution order of more than one > node. In the test case we only take care of only one window, but there > may be more window/sort node sets, which is too difficult to choose > the best execution order including the downer indexscan, mergejoin in > subquery and sort-based GROUP BY. So I didn't touch the complicated > planner jungle. I rewrote the patch so that only the given bottom > window's sort can consider indexscan. Deeper optimizations are over my > capability. Sorry for the delay on this. I've updated the benchmark results using the new patch you sent today. I did a dump and re-load of the tables, since some of the numbers are randomly generated I wouldn't want to compare them to the old results for any of the tests. This is a complete new list with the CVS head as of this morning. Test Sub Query Self Join Vladimir Windowing UOM Window over Best alt Test 1 504 N/A N/A 568 TPS 12.7% Test 2 340.9 425 182 450.38 TPS 6.0% Test 3 1.304 8.12 1.963 7.52 TPS -7.4% Test 4 422 365 195 630 TPS 49.3% Test 5 8.874 N/A 5825 31203 TPH 435.6% Test 6 251 N/A N/A 300 TPS 19.5% Only test 3 and 5 made use of the index scan, performance dropped slightly on test 3 but there's not much point in paying much attention to that since we're probably close to the cross over point between a seqscan and indexscan where the planner's decision is not as critical. Certain Self join methods I used don't implement the exact requirements I've stated at the top of the test. For example the meter reading for self join requires no days to be missed. Maybe multi window optimisation is one for 8.5's TODO I've attached the test scripts. David.
Attachment
Hitoshi Harada wrote: > I found how to do it, though it's only on the case you gave. Thinking > about the planner optimization of the Window nodes (and its attached > Sort nodes), we must consider the execution order of more than one > node. In the test case we only take care of only one window, but there > may be more window/sort node sets, which is too difficult to choose > the best execution order including the downer indexscan, mergejoin in > subquery and sort-based GROUP BY. So I didn't touch the complicated > planner jungle. I rewrote the patch so that only the given bottom > window's sort can consider indexscan. Deeper optimizations are over my > capability. After more playing around with a few queries and testing some performance of larger tables. I discovered something strange in the plan for this query. david=# explain select date,lag(date,1) over (order by date) from meter_Readings order by date; QUERY PLAN ---------------------------------------------------------------------------- --------------------------------Sort (cost=1038.73..1063.74 rows=10001 width=4) Sort Key: date -> Window (cost=0.00..374.27rows=10001 width=4) -> Index Scan using meter_readings_pkey on meter_readings (cost=0.00..299.27 rows=10001 width=4) (4 rows) Is the final sort still required? Is it not already sorted in the window? The table I was testing on split the sort to disk, I would probably not have noticed otherwise. David.
2008/11/10 David Rowley <dgrowley@gmail.com>: > Hitoshi Harada wrote: >> I found how to do it, though it's only on the case you gave. Thinking >> about the planner optimization of the Window nodes (and its attached >> Sort nodes), we must consider the execution order of more than one >> node. In the test case we only take care of only one window, but there >> may be more window/sort node sets, which is too difficult to choose >> the best execution order including the downer indexscan, mergejoin in >> subquery and sort-based GROUP BY. So I didn't touch the complicated >> planner jungle. I rewrote the patch so that only the given bottom >> window's sort can consider indexscan. Deeper optimizations are over my >> capability. > > After more playing around with a few queries and testing some performance of > larger tables. I discovered something strange in the plan for this query. > > > david=# explain select date,lag(date,1) over (order by date) from > meter_Readings order by date; > QUERY PLAN > ---------------------------------------------------------------------------- > -------------------------------- > Sort (cost=1038.73..1063.74 rows=10001 width=4) > Sort Key: date > -> Window (cost=0.00..374.27 rows=10001 width=4) > -> Index Scan using meter_readings_pkey on meter_readings > (cost=0.00..299.27 rows=10001 width=4) > (4 rows) > > Is the final sort still required? Is it not already sorted in the window? > Oh, I forgot to mention about it. This behavior is also fixed and it works without sort on the window now. I don't remember at all why I did so and there's no comment around that but regression tests showed there is no preblem without it. Regards, -- Hitoshi Harada
Hitoshi Harada wrote: > > david=# explain select date,lag(date,1) over (order by date) from > > meter_Readings order by date; > > QUERY PLAN > > ------------------------------------------------------------------------ > ---- > > -------------------------------- > > Sort (cost=1038.73..1063.74 rows=10001 width=4) > > Sort Key: date > > -> Window (cost=0.00..374.27 rows=10001 width=4) > > -> Index Scan using meter_readings_pkey on meter_readings > > (cost=0.00..299.27 rows=10001 width=4) > > (4 rows) > > > > Is the final sort still required? Is it not already sorted in the > window? > > > > Oh, I forgot to mention about it. This behavior is also fixed and it > works without sort on the window now. I don't remember at all why I > did so and there's no comment around that but regression tests showed > there is no preblem without it. > > Regards, > > -- > Hitoshi Harada Fantastic news! That will speed up the few test queries I have quite a bit the sort was splitting to disk so performance was dropping quite a bit. This might be a good time to say that the hardware that I'm testing on is ultra slow. 600 Mhz Pentium III with only 28MB shared buffers. I've done some more performance tests with the patch. Realising that I really need to be comparing the performance with something else I decided to have a query process a large table with then without a windowing function just to see how much the query slows. Of course there is going to be an overhead using a tuple store. I just wanted to see how much. My results are not really very interesting, so I'll just include them in the bottom of the email for those who want to see. Casting my mind back to when the patch was always doing a sort before a window with an order by even when a btree index was there, it's really come a long way. I've little idea how difficult it would be to implement better planning for the following. I suppose if it's difficult then it's probably better to wait for the patch to be commited first, or maybe it's something for 8.5. SELECT department, SUM(Salary), ROW_NUMBER() OVER (ORDER BY department), SUM(SUM(salary)) OVER (ORDER BY departmentDESC) FROM employees GROUP BY department ORDER BY department; This query performs more sorts than really is needed. I suppose the most efficient way to process it would be to process the 2nd window first then the 1st before returning the results in the same order as the 1st window. Currently the plan looks like: QUERY PLAN ---------------------------------------------------------------------------- -----------------Sort (cost=1.33..1.34 rows=3 width=9) Sort Key: department -> Window (cost=1.25..1.31 rows=3 width=9) -> Sort (cost=1.25..1.26 rows=3 width=9) Sort Key: department -> Window (cost=1.17..1.23rows=3 width=9) -> Sort (cost=1.17..1.18 rows=3 width=9) SortKey: department -> HashAggregate (cost=1.10..1.15 rows=3 width=9) -> Seq Scan on employees (cost=0.00..1.06 rows=6 width=9) Maybe it's possible to sort the processing order of the windows based on the ORDER BY clauses putting any that match the ORDER BY of the final results last. I've not looked into this in much detail. Currently I cannot see any scenario where it would be bad. What do you think? David CREATE TABLE bigtable ( id SERIAL NOT NULL PRIMARY KEY, timestamp TIMESTAMP NOT NULL ); -- about 383MB of data INSERT INTO bigtable (timestamp) SELECT NOW() + (CAST(RANDOM() * 10 AS INT) || ' secs')::INTERVAL FROM generate_series(1,10000000); CREATE INDEX bigtable_timestamp_idx ON bigtable (timestamp); VACUUM ANALYZE bigtable; -- base line test david=# SELECT COUNT(*) FROM (select id,timestamp from bigtable order by id) t; count ----------10000000 (1 row) Time: 72862.238 ms -- lag test david=# SELECT COUNT(*) FROM (select id,LAG(timestamp,1) OVER (order by id) from bigtable order by id) t; count ----------10000000 (1 row) Time: 257713.350 ms david=# SELECT COUNT(*) FROM (select id,NTILE(10) OVER (order by id) from bigtable order by id) t; count ----------10000000 (1 row) Time: 183131.425 ms
2008/11/15 David Rowley <dgrowley@gmail.com>: > Hitoshi Harada wrote: >> > david=# explain select date,lag(date,1) over (order by date) from >> > meter_Readings order by date; >> > QUERY PLAN >> > ------------------------------------------------------------------------ >> ---- >> > -------------------------------- >> > Sort (cost=1038.73..1063.74 rows=10001 width=4) >> > Sort Key: date >> > -> Window (cost=0.00..374.27 rows=10001 width=4) >> > -> Index Scan using meter_readings_pkey on meter_readings >> > (cost=0.00..299.27 rows=10001 width=4) >> > (4 rows) >> > >> > Is the final sort still required? Is it not already sorted in the >> window? >> > >> >> Oh, I forgot to mention about it. This behavior is also fixed and it >> works without sort on the window now. I don't remember at all why I >> did so and there's no comment around that but regression tests showed >> there is no preblem without it. >> >> Regards, >> >> -- >> Hitoshi Harada > > Fantastic news! That will speed up the few test queries I have quite a bit > the sort was splitting to disk so performance was dropping quite a bit. This > might be a good time to say that the hardware that I'm testing on is ultra > slow. 600 Mhz Pentium III with only 28MB shared buffers. > > I've done some more performance tests with the patch. Realising that I > really need to be comparing the performance with something else I decided to > have a query process a large table with then without a windowing function > just to see how much the query slows. Of course there is going to be an > overhead using a tuple store. I just wanted to see how much. My results are > not really very interesting, so I'll just include them in the bottom of the > email for those who want to see. Thanks for your test code. I attach the result of your test to which another query is added. The added row_number() query has row buffer strategy that doesn't hold tuplestore but only scans and returns rows. So the difference between row_number(), 44 sec, and ntile(), 61 sec, cases roughly shows how much tuplestore adds overhead. I had supposed the row_number() case would be more efficient but still we can see it works as an optimization. > Casting my mind back to when the patch was always doing a sort before a > window with an order by even when a btree index was there, it's really come > a long way. I've little idea how difficult it would be to implement better > planning for the following. I suppose if it's difficult then it's probably > better to wait for the patch to be commited first, or maybe it's something > for 8.5. Yeah, the planner around group by, order, distinct, indexscan, and window is quite complicated. Though I think I can manage to do it if there left enough time, but at first the basic design should be qualified by core hackers. I am waiting for subsequent review and responses from Heikki and others. > SELECT department, > SUM(Salary), > ROW_NUMBER() OVER (ORDER BY department), > SUM(SUM(salary)) OVER (ORDER BY department DESC) > FROM employees > GROUP BY department ORDER BY department; > > This query performs more sorts than really is needed. I suppose the most > efficient way to process it would be to process the 2nd window first then > the 1st before returning the results in the same order as the 1st window. > > Currently the plan looks like: > > QUERY PLAN > ---------------------------------------------------------------------------- > ----------------- > Sort (cost=1.33..1.34 rows=3 width=9) > Sort Key: department > -> Window (cost=1.25..1.31 rows=3 width=9) > -> Sort (cost=1.25..1.26 rows=3 width=9) > Sort Key: department > -> Window (cost=1.17..1.23 rows=3 width=9) > -> Sort (cost=1.17..1.18 rows=3 width=9) > Sort Key: department > -> HashAggregate (cost=1.10..1.15 rows=3 > width=9) > -> Seq Scan on employees (cost=0.00..1.06 > rows=6 width=9) > > > Maybe it's possible to sort the processing order of the windows based on the > ORDER BY clauses putting any that match the ORDER BY of the final results > last. I've not looked into this in much detail. Currently I cannot see any > scenario where it would be bad. > > What do you think? > The patch doesn't have infrastracture to relocate each window node yet. When relocating them we must re-number wfunc->winref so it's a bit work, though possible. Also, I can imagine to fix only above case but I'm afraid without having great overall sketch of planner side-effect would come up... For intance, what if the window query is a subquery of group with sort strategy that assumes its subplan doesn't sort. Maybe in that case upper group by doesn't notice the sort key change done by window node relocating in subplan. Regards, -- Hitoshi Harada $ uname -srpio Linux 2.6.9-55.0.9.ELsmp i686 i386 GNU/Linux # cpuinfo Intel(R) Xeon(R) CPU X3210 @ 2.13GHz sample=# explain analyze select id, timestamp from bigtable order by id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------Index Scanusing bigtable_pkey on bigtable (cost=0.00..286808.13 rows=10000000 width=12) (actual time=0.021..11996.336 rows=10000000 loops=1)Total runtime: 20924.200 ms (2 rows) sample=# explain analyze select id, row_number() OVER (order by id) from bigtable order by id; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------Window (cost=0.00..361808.13 rows=10000000 width=4) (actual time=0.080..35265.786 rows=10000000 loops=1) -> Index Scan using bigtable_pkey on bigtable (cost=0.00..286808.13 rows=10000000 width=4) (actual time=0.075..13279.216 rows=10000000 loops=1)Total runtime: 44067.815 ms (3 rows) sample=# explain analyze select id,LAG(timestamp,1) over (order by id) from bigtable order by id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------Window (cost=0.00..411808.13 rows=10000000 width=12) (actual time=30240.715..70439.057 rows=10000000 loops=1) -> Index Scan using bigtable_pkey on bigtable (cost=0.00..286808.13 rows=10000000 width=12) (actual time=0.077..15302.919 rows=10000000 loops=1)Total runtime: 79658.314 ms (3 rows) sample=# explain analyze select id, ntile(10) OVER (order by id) from bigtable order by id; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------Window (cost=0.00..386808.13 rows=10000000 width=4) (actual time=25158.467..52250.418 rows=10000000 loops=1) -> Index Scan using bigtable_pkey on bigtable (cost=0.00..286808.13 rows=10000000 width=4) (actual time=0.075..13088.061 rows=10000000 loops=1)Total runtime: 61275.279 ms (3 rows)