Thread: "micro bucket sort" ...

"micro bucket sort" ...

From
Hans-Jürgen Schönig
Date:
hello all ...

i am bugged with a small issue which is basically like this ...

test=# create table t_test as select x, x % 5 as y from generate_series(1, 1000000) AS x;
SELECT
test=# create index idx_aaaaa on t_test (x) ;
CREATE INDEX
test=# ANALYZE ;
ANALYZE
test=# explain analyze select * from t_test order by x;
QUERYPLAN                                                              

------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing idx_aaaaa on t_test  (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.057..311.832 rows=1000000
loops=1)Totalruntime: 392.943 ms 
(2 rows)

we know that we get sorted output from the index and thus we do the index traversal here ...
if you add a condition to the sorting you will naturally get a sort in postgres because y is clearly now known to be
sorted.

test=# explain analyze select * from t_test order by x, y;                                                      QUERY
PLAN                                                       

------------------------------------------------------------------------------------------------------------------------Sort
(cost=141431.84..143931.84 rows=1000000 width=8) (actual time=1086.014..1271.257 rows=1000000 loops=1)  Sort Key: x, y
SortMethod:  external sort  Disk: 17608kB  ->  Seq Scan on t_test  (cost=0.00..14425.00 rows=1000000 width=8) (actual
time=0.024..143.474rows=1000000 loops=1)Total runtime: 1351.848 ms 
(5 rows)


same with limit ...


test=# explain analyze select * from t_test order by x, y limit 20;
   QUERY PLAN                                                           

------------------------------------------------------------------------------------------------------------------------------Limit
(cost=41034.64..41034.69 rows=20 width=8) (actual time=317.939..317.943 rows=20 loops=1)  ->  Sort
(cost=41034.64..43534.64rows=1000000 width=8) (actual time=317.934..317.936 rows=20 loops=1)        Sort Key: x, y
 Sort Method:  top-N heapsort  Memory: 26kB        ->  Seq Scan on t_test  (cost=0.00..14425.00 rows=1000000 width=8)
(actualtime=0.019..144.109 rows=1000000 loops=1)Total runtime: 317.995 ms 
(6 rows)

now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here.
what makes it even more funny: i don't have enough space  to do the resort of the entire thing (X TB).
so, a more expensive index traversal is my only option.

my question is: is there already a concept out there to make this work or does anybody know of a patch out there
addressingan issue like that? 
some idea is heavily appreciated. it seems our sort key infrastructure is not enough for this.
many thanks,
    hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: "micro bucket sort" ...

From
Simon Riggs
Date:
On Wed, 2010-08-11 at 14:21 +0200, Hans-Jürgen Schönig wrote:
> my question is: is there already a concept out there to make this work
> or does anybody know of a patch out there addressing an issue like
> that?
> some idea is heavily appreciated. it seems our sort key infrastructure
> is not enough for this.

Already discussed as a "partial sort". Thinks its on the TODO.

SMOP.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: "micro bucket sort" ...

From
Alvaro Herrera
Date:
Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010:

> same with limit ...
> 
> 
> test=# explain analyze select * from t_test order by x, y limit 20;

But if you put the limit in a subquery which is ordered by the
known-indexed condition, it is very fast:

alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;
                                         QUERY PLAN                                                                
 

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Sort
(cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1)  Sort Key: t_test.x, t_test.y  Sort
Method: quicksort  Memory: 26kB  ->  Limit  (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20
loops=1)       ->  Index Scan using idx_aaaaa on t_test  (cost=0.00..30408.36 rows=1000000 width=8) (actual
time=0.046..0.098rows=20 loops=1)Total runtime: 0.425 ms
 
(6 filas)


I guess it boils down to being able to sort a smaller result set.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: "micro bucket sort" ...

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010:
>> test=# explain analyze select * from t_test order by x, y limit 20;

> But if you put the limit in a subquery which is ordered by the
> known-indexed condition, it is very fast:

> alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;

That's not guaranteed to give you the right 20 rows, though.  Consider
the case where there are > 20 rows having the minimal x value.
        regards, tom lane


Re: "micro bucket sort" ...

From
PostgreSQL - Hans-Jürgen Schönig
Date:
as tom pointed out - this is not possible.
there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible which
itdoes in some other cases. 
the partial sort thing simon pointed out is what is needed at this point.
many thanks,
    hans



On Aug 11, 2010, at 5:29 PM, Alvaro Herrera wrote:

> Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010:
>
>> same with limit ...
>>
>>
>> test=# explain analyze select * from t_test order by x, y limit 20;
>
> But if you put the limit in a subquery which is ordered by the
> known-indexed condition, it is very fast:
>
> alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;
>                                                               QUERY PLAN
                  
>
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
> Sort  (cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1)
>   Sort Key: t_test.x, t_test.y
>   Sort Method:  quicksort  Memory: 26kB
>   ->  Limit  (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20 loops=1)
>         ->  Index Scan using idx_aaaaa on t_test  (cost=0.00..30408.36 rows=1000000 width=8) (actual
time=0.046..0.098rows=20 loops=1) 
> Total runtime: 0.425 ms
> (6 filas)
>
>
> I guess it boils down to being able to sort a smaller result set.
>
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



Re: "micro bucket sort" ...

From
Hitoshi Harada
Date:
2010/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
> as tom pointed out - this is not possible.
> there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible
whichit does in some other cases. 

I came up with this:

explain analyze select * from (select * from t_test order by x limit
all)s order by x, y limit 20;

which uses index scan for column x and top-N heapsort for outer ORDER
BY, though it's slower than "ORDER BY x LIMIT 20" case. If it chooses
external sort for "ORDER BY x, y" LIMIT ALL  likely wins while looses
if quicksort is chosen.


Regards,


--
Hitoshi Harada


Re: "micro bucket sort" ...

From
Greg Stark
Date:
2010/8/11 Hans-Jürgen Schönig <postgres@cybertec.at>:
> now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here.
>

Is it just me or is this description of the problem not very specific?
Can you give more examples of your queries and explain what kind of
plan you're looking for for them?

--
greg