Thread: "micro bucket sort" ...
hello all ... i am bugged with a small issue which is basically like this ... test=# create table t_test as select x, x % 5 as y from generate_series(1, 1000000) AS x; SELECT test=# create index idx_aaaaa on t_test (x) ; CREATE INDEX test=# ANALYZE ; ANALYZE test=# explain analyze select * from t_test order by x; QUERYPLAN ------------------------------------------------------------------------------------------------------------------------------------Index Scanusing idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.057..311.832 rows=1000000 loops=1)Totalruntime: 392.943 ms (2 rows) we know that we get sorted output from the index and thus we do the index traversal here ... if you add a condition to the sorting you will naturally get a sort in postgres because y is clearly now known to be sorted. test=# explain analyze select * from t_test order by x, y; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------Sort (cost=141431.84..143931.84 rows=1000000 width=8) (actual time=1086.014..1271.257 rows=1000000 loops=1) Sort Key: x, y SortMethod: external sort Disk: 17608kB -> Seq Scan on t_test (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.024..143.474rows=1000000 loops=1)Total runtime: 1351.848 ms (5 rows) same with limit ... test=# explain analyze select * from t_test order by x, y limit 20; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------Limit (cost=41034.64..41034.69 rows=20 width=8) (actual time=317.939..317.943 rows=20 loops=1) -> Sort (cost=41034.64..43534.64rows=1000000 width=8) (actual time=317.934..317.936 rows=20 loops=1) Sort Key: x, y Sort Method: top-N heapsort Memory: 26kB -> Seq Scan on t_test (cost=0.00..14425.00 rows=1000000 width=8) (actualtime=0.019..144.109 rows=1000000 loops=1)Total runtime: 317.995 ms (6 rows) now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here. what makes it even more funny: i don't have enough space to do the resort of the entire thing (X TB). so, a more expensive index traversal is my only option. my question is: is there already a concept out there to make this work or does anybody know of a patch out there addressingan issue like that? some idea is heavily appreciated. it seems our sort key infrastructure is not enough for this. many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Wed, 2010-08-11 at 14:21 +0200, Hans-Jürgen Schönig wrote: > my question is: is there already a concept out there to make this work > or does anybody know of a patch out there addressing an issue like > that? > some idea is heavily appreciated. it seems our sort key infrastructure > is not enough for this. Already discussed as a "partial sort". Thinks its on the TODO. SMOP. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010: > same with limit ... > > > test=# explain analyze select * from t_test order by x, y limit 20; But if you put the limit in a subquery which is ordered by the known-indexed condition, it is very fast: alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y; QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Sort (cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1) Sort Key: t_test.x, t_test.y Sort Method: quicksort Memory: 26kB -> Limit (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20 loops=1) -> Index Scan using idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.046..0.098rows=20 loops=1)Total runtime: 0.425 ms (6 filas) I guess it boils down to being able to sort a smaller result set. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010: >> test=# explain analyze select * from t_test order by x, y limit 20; > But if you put the limit in a subquery which is ordered by the > known-indexed condition, it is very fast: > alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y; That's not guaranteed to give you the right 20 rows, though. Consider the case where there are > 20 rows having the minimal x value. regards, tom lane
as tom pointed out - this is not possible. there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible which itdoes in some other cases. the partial sort thing simon pointed out is what is needed at this point. many thanks, hans On Aug 11, 2010, at 5:29 PM, Alvaro Herrera wrote: > Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010: > >> same with limit ... >> >> >> test=# explain analyze select * from t_test order by x, y limit 20; > > But if you put the limit in a subquery which is ordered by the > known-indexed condition, it is very fast: > > alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y; > QUERY PLAN > ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── > Sort (cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1) > Sort Key: t_test.x, t_test.y > Sort Method: quicksort Memory: 26kB > -> Limit (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20 loops=1) > -> Index Scan using idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.046..0.098rows=20 loops=1) > Total runtime: 0.425 ms > (6 filas) > > > I guess it boils down to being able to sort a smaller result set. > > -- > Álvaro Herrera <alvherre@commandprompt.com> > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
2010/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>: > as tom pointed out - this is not possible. > there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible whichit does in some other cases. I came up with this: explain analyze select * from (select * from t_test order by x limit all)s order by x, y limit 20; which uses index scan for column x and top-N heapsort for outer ORDER BY, though it's slower than "ORDER BY x LIMIT 20" case. If it chooses external sort for "ORDER BY x, y" LIMIT ALL likely wins while looses if quicksort is chosen. Regards, -- Hitoshi Harada
2010/8/11 Hans-Jürgen Schönig <postgres@cybertec.at>: > now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here. > Is it just me or is this description of the problem not very specific? Can you give more examples of your queries and explain what kind of plan you're looking for for them? -- greg