Home > mailing lists
Re: the best way to get the first record from each group - Mailing list pgsql-sql

From	PFC
Subject	Re: the best way to get the first record from each group
Date	February 8, 2005 02:17:43
Msg-id	opsluwr7fpth1vuj@musicbox Whole thread Raw
In response to	the best way to get the first record from each group ("q2005" <q2005@tpg.com.au>)
Responses	Re: the best way to get the first record from each group
List	pgsql-sql
Tree view
I don't really gr0k your field names so I'll use an easier example :

CREATE TABLE groups ( group_id SERIAL PRIMARY KEY, group_name TEXT NULL )  
WITHOUT OIDS;

CREATE TABLE people ( user_id SERIAL PRIMARY KEY,
group_id INTEGER NOT NULL REFERENCES groups(group_id),
score INTEGER NOT NULL )
WITHOUT OIDS;

CREATE INDEX people_scored ON people( group_id, score );

... put 1K rows into groups, vacuum analyze
... put 128K rows into people, vacuum analyze

So you want the user in each group with the highest score (or the lowest  
subno... thats the same).

0-- DISTINCT ON
----------------------------------------------------------------------------------------
SELECT DISTINCT ON (group_id) group_id, user_id, score FROM people ORDER  
BY group_id, score; Unique  (cost=0.00..4968.17 rows=996 width=12) (actual  
time=0.144..539.667 rows=1000 loops=1)   ->  Index Scan using people_scored on people  (cost=0.00..4640.49  
rows=131072 width=12) (actual time=0.141..454.893 rows=131072 loops=1) Total runtime: 540.212 ms
----------------------------------------------------------------------------------------
It works but is about the slowest thing imaginable : index-scanning (or  
sorting) the entire table.DISTINCT ON can be very convenient nonetheless !
seq scan => disqualified.

1-- min(), max()
max() will give you the score but it won't give you the user_id so you  
have to resort to a trick just like you did.And it does a seq scan => disqualified.

2-- custom aggregate
You could possibly write an aggregate which takes a row from people as an  
argument, or an ARRAY[score,user_id] and which acts like max and returns  
the ARRAY[score,user_id] with the highest score so you can have its  
user_id. As array comparison works as expected in pgsql, you could use  
max(), unfortunately for you, max() does not work on integer arrays (Why  
is that so ?) so this solution needs you to write a custom aggregate.
Note that this would still need a seq scan, and chould be slower than the  
DISTINCT => disqualified.

2-- subquery
The problem is that you'll need a list of your groups. If you don't have  
a table, you'll have to extract them from the table people with a (SELECT  
group_id FROM people GROUP BY group_id) which is a sequential scan. So  
I'll presume there is a groups table, which is why I put a 'REFERENCES  
groups(group_id)' in the table declaration above.
To get the best score in a group of id GID we write :

----------------------------------------------------------------------------------------
SELECT user_id FROM people WHERE group_id=5 ORDER BY group_id DESC, score  
DESC LIMIT 1;

Limit  (cost=0.00..3.69 rows=1 width=12) (actual time=0.054..0.055 rows=1  
loops=1)   ->  Index Scan Backward using people_scored on people   
(cost=0.00..480.02 rows=130 width=12) (actual time=0.051..0.051 rows=1  
loops=1)         Index Cond: (group_id = 5) Total runtime: 0.143 ms
----------------------------------------------------------------------------------------
To get the best scores for all groups we apply this SELECT to all groups.You see now why we need a groups table to
precalculatethe groups.
 

----------------------------------------------------------------------------------------
SELECT g.group_id, (SELECT user_id FROM people WHERE group_id=g.group_id  
ORDER BY group_id DESC, score DESC LIMIT 1) as user_id FROM groups g;
 Seq Scan on groups g  (cost=0.00..3702.48 rows=1000 width=4) (actual  
time=0.079..18.942 rows=1000 loops=1)   SubPlan     ->  Limit  (cost=0.00..3.69 rows=1 width=12) (actual  
time=0.014..0.014 rows=1 loops=1000)           ->  Index Scan Backward using people_scored on people   
(cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1  
loops=1000)                 Index Cond: (group_id = $0) Total runtime: 19.475 ms
----------------------------------------------------------------------------------------
Note that the subselect here can only yield ONE column so another join  
comes in to get the score :

-- Take 1
----------------------------------------------------------------------------------------
SELECT * FROM people WHERE user_id IN (SELECT (SELECT user_id FROM people  
WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as  
user_id FROM groups g);
 Nested Loop  (cost=21.19..10418.45 rows=1000 width=12) (actual  
time=29.851..87.289 rows=1000 loops=1)   ->  HashAggregate  (cost=17.50..3704.98 rows=1000 width=4) (actual  
time=29.789..32.174 rows=1000 loops=1)         ->  Seq Scan on groups g  (cost=0.00..15.00 rows=1000 width=4)  
(actual time=0.119..27.982 rows=1000 loops=1)               SubPlan                 ->  Limit  (cost=0.00..3.69 rows=1
width=12)(actual  
 
time=0.023..0.023 rows=1 loops=1000)                       ->  Index Scan Backward using people_scored on  
people  (cost=0.00..486.75 rows=132 width=12) (actual time=0.020..0.020  
rows=1 loops=1000)                             Index Cond: (group_id = $0)   ->  Index Scan using people_pkey on people
(cost=3.69..6.70 rows=1  
 
width=12) (actual time=0.020..0.022 rows=1 loops=1000)         Index Cond: (people.user_id = (subplan))         SubPlan
         ->  Limit  (cost=0.00..3.69 rows=1 width=12) (actual  
 
time=0.027..0.027 rows=1 loops=1000)                 ->  Index Scan Backward using people_scored on people   
(cost=0.00..486.75 rows=132 width=12) (actual time=0.024..0.024 rows=1  
loops=1000)                       Index Cond: (group_id = $0)           ->  Limit  (cost=0.00..3.69 rows=1 width=12)
(neverexecuted)                 ->  Index Scan Backward using people_scored on people   
 
(cost=0.00..486.75 rows=132 width=12) (never executed)                       Index Cond: (group_id = $0) Total runtime:
88.039ms
 
----------------------------------------------------------------------------------------

-- Take 2
----------------------------------------------------------------------------------------
SELECT p.group_id, p.user_id, p.score FROM people p, (SELECT g.group_id,  
(SELECT user_id FROM people WHERE group_id=g.group_id ORDER BY group_id  
DESC, score DESC LIMIT 1) as user_id FROM groups g) as top WHERE p.user_id  
= top.user_id;
 Nested Loop  (cost=3.69..10415.95 rows=1000 width=12) (actual  
time=0.200..57.387 rows=1000 loops=1)   ->  Seq Scan on groups g  (cost=0.00..15.00 rows=1000 width=4) (actual  
time=0.051..1.556 rows=1000 loops=1)   ->  Index Scan using people_pkey on people p  (cost=3.69..6.70 rows=1  
width=12) (actual time=0.015..0.016 rows=1 loops=1000)         Index Cond: (p.user_id = (subplan))         SubPlan
    ->  Limit  (cost=0.00..3.69 rows=1 width=12) (actual  
 
time=0.032..0.032 rows=1 loops=1000)                 ->  Index Scan Backward using people_scored on people   
(cost=0.00..486.75 rows=132 width=12) (actual time=0.029..0.029 rows=1  
loops=1000)                       Index Cond: (group_id = $0)           ->  Limit  (cost=0.00..3.69 rows=1 width=12)
(neverexecuted)                 ->  Index Scan Backward using people_scored on people   
 
(cost=0.00..486.75 rows=132 width=12) (never executed)                       Index Cond: (group_id = $0) Total runtime:
58.004ms
 
----------------------------------------------------------------------------------------
Why is the Limited plan displayed twice with the second 'never executed',  
I have no idea.Now let's kill that JOIN thanks to a neat postgres feature :  
ARRAY[score,user_id]

-- Take 3
----------------------------------------------------------------------------------------
SELECT group_id, top.us[1] as user_id, top.us[2] as user_id FROM (SELECT  
g.group_id, (SELECT ARRAY[user_id,score] FROM people WHERE  
group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as us FROM  
groups g) as top;
---------------------------------------------------------------------------------------- Seq Scan on groups g
(cost=0.00..7389.95rows=1000 width=4) (actual   
time=0.109..39.410 rows=1000 loops=1)   SubPlan     ->  Limit  (cost=0.00..3.69 rows=1 width=12) (actual  
time=0.014..0.014 rows=1 loops=1000)           ->  Index Scan Backward using people_scored on people   
(cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1  
loops=1000)                 Index Cond: (group_id = $0)     ->  Limit  (cost=0.00..3.69 rows=1 width=12) (actual  
time=0.016..0.017 rows=1 loops=1000)           ->  Index Scan Backward using people_scored on people   
(cost=0.00..486.75 rows=132 width=12) (actual time=0.014..0.014 rows=1  
loops=1000)                 Index Cond: (group_id = $0) Total runtime: 39.956 ms
----------------------------------------------------------------------------------------
Nice !However, here is something fishy : using "top.us[1] as user_id,  
top.us[2]" actually executes the subselect TWICE !!!!! Can anybody point  
the reason to me ? It doubles the query execution time, as the outer  
SELECT is just a dumb wrapper.
Even with this, it is more than an order of magnitude faster than the  
DISTINCT ONI hope I haven't messed it up and that you get the same result ?
Note that the subselect here can only yield ONE row, so you can't select  
the first N, only the first 1... unless you make it yield a 2-dimension  
array (with array_accum). It's doable, but there is a nicer solution ...

----------------------------------------------------------------------------------------Stored procs :

-- Take 1
----------------------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION topn( INTEGER ) RETURNS SETOF people LANGUAGE  
PLPGSQL AS $$
DECLARE _nrows ALIAS FOR $1; _gid groups; _p people;
BEGIN FOR _gid IN SELECT * FROM groups LOOP  FOR _p IN SELECT * FROM people WHERE group_id=_gid.group_id ORDER BY  
group_id DESC, score DESC LIMIT _nrows LOOP   RETURN NEXT _p;  END LOOP; END LOOP; RETURN;
END;
$$;

Here being a stored proc, we can specify the number of records to return  
in each group. Luxury...

test=# EXPLAIN ANALYZE SELECT * FROM topn(1);                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------- Function
Scanon topn  (cost=0.00..12.50 rows=1000 width=12) (actual  
 
time=90.759..91.927 rows=1000 loops=1) Total runtime: 92.371 ms
(2 lignes)
Note, 92 ms to do 1K SELECT's, not that bad !


The top3 by categories :

test=# SELECT * FROM topn(3) LIMIT 10; user_id | group_id | score
---------+----------+-------  120224 |        1 |   980   53244 |        1 |   977  101540 |        1 |   971   73022 |
      2 |   994   52085 |        2 |   992   43474 |        2 |   990   94110 |        3 |   982   96491 |        3 |
975  96387 |        3 |   964   89660 |        4 |   993
 
(10 lignes)

Restraining the groups to query in the function would improve speed a lot.

-- Take 2
----------------------------------------------------------------------------------------

Let's get rid of the groups table.
This one simulates an index skip scan on the people table to extract the  
groups as it goes.

CREATE OR REPLACE FUNCTION topn_nogroups( INTEGER ) RETURNS SETOF people  
LANGUAGE PLPGSQL AS $$
DECLARE _nrows ALIAS FOR $1; _gid INTEGER; _p people;
BEGIN -- get lowest gid _gid := group_id FROM people ORDER BY group_id ASC LIMIT 1; WHILE _gid IS NOT NULL LOOP  FOR _p
INSELECT * FROM people WHERE group_id=_gid ORDER BY group_id  
 
DESC, score DESC LIMIT _nrows LOOP   RETURN NEXT _p;  END LOOP;  -- get next group_id  _gid := group_id FROM people
WHEREgroup_id > _gid ORDER BY group_id ASC  
 
LIMIT 1; END LOOP; RETURN;
END;
$$;

test=# SELECT * FROM topn_nogroups(4) LIMIT 10; user_id | group_id | score
---------+----------+-------  120224 |        1 |   980   53244 |        1 |   977  101540 |        1 |   971   69530 |
      1 |   966   73022 |        2 |   994   52085 |        2 |   992   43474 |        2 |   990    4022 |        2 |
989  94110 |        3 |   982   96491 |        3 |   975
 
(10 lignes)

test=# EXPLAIN ANALYZE SELECT * FROM topn_nogroups(1);                                                       QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------
FunctionScan on topn_nogroups  (cost=0.00..12.50 rows=1000 width=12)  
 
(actual time=152.869..154.049 rows=1000 loops=1) Total runtime: 154.493 ms

-- Take 3
----------------------------------------------------------------------------------------

If we get accept returning only one row per group we can remove one query  
per loop iteration

CREATE OR REPLACE FUNCTION top_nogroups( ) RETURNS SETOF people LANGUAGE  
PLPGSQL AS $$
DECLARE _gid INTEGER; _p people;
BEGIN -- get lowest gid _gid := group_id FROM people ORDER BY group_id DESC LIMIT 1; IF _gid IS NULL THEN RETURN; END
IF;_gid := _gid + 1; LOOP  SELECT INTO _p * FROM people WHERE group_id<_gid ORDER BY group_id DESC,  
 
score DESC LIMIT 1;  IF NOT FOUND THEN RETURN; END IF;  _gid = _p.group_id;  RETURN NEXT _p; END LOOP; RETURN;
END;
$$;

test=# EXPLAIN ANALYZE SELECT * FROM top_nogroups();
---------------------------------------------------------------------------------------------------------------------
FunctionScan on top_nogroups  (cost=0.00..12.50 rows=1000 width=12)  
 
(actual time=59.086..60.319 rows=1000 loops=1) Total runtime: 60.712 ms
(2 lignes)


Performance is about the same level as the first query with the subselect  
(even though the subselect is executed twice) BUT you don't need the  
groups table. You might like that.


I just love postgres. Think you have a problem, it has an elegant  
solution. It even has some features it doesn't have, you just have to look  
under the carpet and you find them (like the index skip scan).

Anyway, it was fun to experiment with that !
pgsql-sql by date:
From: "Premsun Choltanwanich"
Date: 08 February 2005, 01:35:19
Subject: Re: How can I use large object on PostgreSQL Linux
From: Michael Fuhr
Date: 08 February 2005, 03:33:43
Subject: Re: the best way to get the first record from each group
Re: the best way to get the first record from each group - Mailing list pgsql-sql

Previous

Next