Re: the best way to get the first record from each group - Mailing list pgsql-sql
From | PFC |
---|---|
Subject | Re: the best way to get the first record from each group |
Date | |
Msg-id | opsluwr7fpth1vuj@musicbox Whole thread Raw |
In response to | the best way to get the first record from each group ("q2005" <q2005@tpg.com.au>) |
Responses |
Re: the best way to get the first record from each group
|
List | pgsql-sql |
I don't really gr0k your field names so I'll use an easier example : CREATE TABLE groups ( group_id SERIAL PRIMARY KEY, group_name TEXT NULL ) WITHOUT OIDS; CREATE TABLE people ( user_id SERIAL PRIMARY KEY, group_id INTEGER NOT NULL REFERENCES groups(group_id), score INTEGER NOT NULL ) WITHOUT OIDS; CREATE INDEX people_scored ON people( group_id, score ); ... put 1K rows into groups, vacuum analyze ... put 128K rows into people, vacuum analyze So you want the user in each group with the highest score (or the lowest subno... thats the same). 0-- DISTINCT ON ---------------------------------------------------------------------------------------- SELECT DISTINCT ON (group_id) group_id, user_id, score FROM people ORDER BY group_id, score; Unique (cost=0.00..4968.17 rows=996 width=12) (actual time=0.144..539.667 rows=1000 loops=1) -> Index Scan using people_scored on people (cost=0.00..4640.49 rows=131072 width=12) (actual time=0.141..454.893 rows=131072 loops=1) Total runtime: 540.212 ms ---------------------------------------------------------------------------------------- It works but is about the slowest thing imaginable : index-scanning (or sorting) the entire table.DISTINCT ON can be very convenient nonetheless ! seq scan => disqualified. 1-- min(), max() max() will give you the score but it won't give you the user_id so you have to resort to a trick just like you did.And it does a seq scan => disqualified. 2-- custom aggregate You could possibly write an aggregate which takes a row from people as an argument, or an ARRAY[score,user_id] and which acts like max and returns the ARRAY[score,user_id] with the highest score so you can have its user_id. As array comparison works as expected in pgsql, you could use max(), unfortunately for you, max() does not work on integer arrays (Why is that so ?) so this solution needs you to write a custom aggregate. Note that this would still need a seq scan, and chould be slower than the DISTINCT => disqualified. 2-- subquery The problem is that you'll need a list of your groups. If you don't have a table, you'll have to extract them from the table people with a (SELECT group_id FROM people GROUP BY group_id) which is a sequential scan. So I'll presume there is a groups table, which is why I put a 'REFERENCES groups(group_id)' in the table declaration above. To get the best score in a group of id GID we write : ---------------------------------------------------------------------------------------- SELECT user_id FROM people WHERE group_id=5 ORDER BY group_id DESC, score DESC LIMIT 1; Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.054..0.055 rows=1 loops=1) -> Index Scan Backward using people_scored on people (cost=0.00..480.02 rows=130 width=12) (actual time=0.051..0.051 rows=1 loops=1) Index Cond: (group_id = 5) Total runtime: 0.143 ms ---------------------------------------------------------------------------------------- To get the best scores for all groups we apply this SELECT to all groups.You see now why we need a groups table to precalculatethe groups. ---------------------------------------------------------------------------------------- SELECT g.group_id, (SELECT user_id FROM people WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as user_id FROM groups g; Seq Scan on groups g (cost=0.00..3702.48 rows=1000 width=4) (actual time=0.079..18.942 rows=1000 loops=1) SubPlan -> Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.014..0.014 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1 loops=1000) Index Cond: (group_id = $0) Total runtime: 19.475 ms ---------------------------------------------------------------------------------------- Note that the subselect here can only yield ONE column so another join comes in to get the score : -- Take 1 ---------------------------------------------------------------------------------------- SELECT * FROM people WHERE user_id IN (SELECT (SELECT user_id FROM people WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as user_id FROM groups g); Nested Loop (cost=21.19..10418.45 rows=1000 width=12) (actual time=29.851..87.289 rows=1000 loops=1) -> HashAggregate (cost=17.50..3704.98 rows=1000 width=4) (actual time=29.789..32.174 rows=1000 loops=1) -> Seq Scan on groups g (cost=0.00..15.00 rows=1000 width=4) (actual time=0.119..27.982 rows=1000 loops=1) SubPlan -> Limit (cost=0.00..3.69 rows=1 width=12)(actual time=0.023..0.023 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.020..0.020 rows=1 loops=1000) Index Cond: (group_id = $0) -> Index Scan using people_pkey on people (cost=3.69..6.70 rows=1 width=12) (actual time=0.020..0.022 rows=1 loops=1000) Index Cond: (people.user_id = (subplan)) SubPlan -> Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.027..0.027 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.024..0.024 rows=1 loops=1000) Index Cond: (group_id = $0) -> Limit (cost=0.00..3.69 rows=1 width=12) (neverexecuted) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (never executed) Index Cond: (group_id = $0) Total runtime: 88.039ms ---------------------------------------------------------------------------------------- -- Take 2 ---------------------------------------------------------------------------------------- SELECT p.group_id, p.user_id, p.score FROM people p, (SELECT g.group_id, (SELECT user_id FROM people WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as user_id FROM groups g) as top WHERE p.user_id = top.user_id; Nested Loop (cost=3.69..10415.95 rows=1000 width=12) (actual time=0.200..57.387 rows=1000 loops=1) -> Seq Scan on groups g (cost=0.00..15.00 rows=1000 width=4) (actual time=0.051..1.556 rows=1000 loops=1) -> Index Scan using people_pkey on people p (cost=3.69..6.70 rows=1 width=12) (actual time=0.015..0.016 rows=1 loops=1000) Index Cond: (p.user_id = (subplan)) SubPlan -> Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.032..0.032 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.029..0.029 rows=1 loops=1000) Index Cond: (group_id = $0) -> Limit (cost=0.00..3.69 rows=1 width=12) (neverexecuted) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (never executed) Index Cond: (group_id = $0) Total runtime: 58.004ms ---------------------------------------------------------------------------------------- Why is the Limited plan displayed twice with the second 'never executed', I have no idea.Now let's kill that JOIN thanks to a neat postgres feature : ARRAY[score,user_id] -- Take 3 ---------------------------------------------------------------------------------------- SELECT group_id, top.us[1] as user_id, top.us[2] as user_id FROM (SELECT g.group_id, (SELECT ARRAY[user_id,score] FROM people WHERE group_id=g.group_id ORDER BY group_id DESC, score DESC LIMIT 1) as us FROM groups g) as top; ---------------------------------------------------------------------------------------- Seq Scan on groups g (cost=0.00..7389.95rows=1000 width=4) (actual time=0.109..39.410 rows=1000 loops=1) SubPlan -> Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.014..0.014 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.011..0.011 rows=1 loops=1000) Index Cond: (group_id = $0) -> Limit (cost=0.00..3.69 rows=1 width=12) (actual time=0.016..0.017 rows=1 loops=1000) -> Index Scan Backward using people_scored on people (cost=0.00..486.75 rows=132 width=12) (actual time=0.014..0.014 rows=1 loops=1000) Index Cond: (group_id = $0) Total runtime: 39.956 ms ---------------------------------------------------------------------------------------- Nice !However, here is something fishy : using "top.us[1] as user_id, top.us[2]" actually executes the subselect TWICE !!!!! Can anybody point the reason to me ? It doubles the query execution time, as the outer SELECT is just a dumb wrapper. Even with this, it is more than an order of magnitude faster than the DISTINCT ONI hope I haven't messed it up and that you get the same result ? Note that the subselect here can only yield ONE row, so you can't select the first N, only the first 1... unless you make it yield a 2-dimension array (with array_accum). It's doable, but there is a nicer solution ... ----------------------------------------------------------------------------------------Stored procs : -- Take 1 ---------------------------------------------------------------------------------------- CREATE OR REPLACE FUNCTION topn( INTEGER ) RETURNS SETOF people LANGUAGE PLPGSQL AS $$ DECLARE _nrows ALIAS FOR $1; _gid groups; _p people; BEGIN FOR _gid IN SELECT * FROM groups LOOP FOR _p IN SELECT * FROM people WHERE group_id=_gid.group_id ORDER BY group_id DESC, score DESC LIMIT _nrows LOOP RETURN NEXT _p; END LOOP; END LOOP; RETURN; END; $$; Here being a stored proc, we can specify the number of records to return in each group. Luxury... test=# EXPLAIN ANALYZE SELECT * FROM topn(1); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Function Scanon topn (cost=0.00..12.50 rows=1000 width=12) (actual time=90.759..91.927 rows=1000 loops=1) Total runtime: 92.371 ms (2 lignes) Note, 92 ms to do 1K SELECT's, not that bad ! The top3 by categories : test=# SELECT * FROM topn(3) LIMIT 10; user_id | group_id | score ---------+----------+------- 120224 | 1 | 980 53244 | 1 | 977 101540 | 1 | 971 73022 | 2 | 994 52085 | 2 | 992 43474 | 2 | 990 94110 | 3 | 982 96491 | 3 | 975 96387 | 3 | 964 89660 | 4 | 993 (10 lignes) Restraining the groups to query in the function would improve speed a lot. -- Take 2 ---------------------------------------------------------------------------------------- Let's get rid of the groups table. This one simulates an index skip scan on the people table to extract the groups as it goes. CREATE OR REPLACE FUNCTION topn_nogroups( INTEGER ) RETURNS SETOF people LANGUAGE PLPGSQL AS $$ DECLARE _nrows ALIAS FOR $1; _gid INTEGER; _p people; BEGIN -- get lowest gid _gid := group_id FROM people ORDER BY group_id ASC LIMIT 1; WHILE _gid IS NOT NULL LOOP FOR _p INSELECT * FROM people WHERE group_id=_gid ORDER BY group_id DESC, score DESC LIMIT _nrows LOOP RETURN NEXT _p; END LOOP; -- get next group_id _gid := group_id FROM people WHEREgroup_id > _gid ORDER BY group_id ASC LIMIT 1; END LOOP; RETURN; END; $$; test=# SELECT * FROM topn_nogroups(4) LIMIT 10; user_id | group_id | score ---------+----------+------- 120224 | 1 | 980 53244 | 1 | 977 101540 | 1 | 971 69530 | 1 | 966 73022 | 2 | 994 52085 | 2 | 992 43474 | 2 | 990 4022 | 2 | 989 94110 | 3 | 982 96491 | 3 | 975 (10 lignes) test=# EXPLAIN ANALYZE SELECT * FROM topn_nogroups(1); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ FunctionScan on topn_nogroups (cost=0.00..12.50 rows=1000 width=12) (actual time=152.869..154.049 rows=1000 loops=1) Total runtime: 154.493 ms -- Take 3 ---------------------------------------------------------------------------------------- If we get accept returning only one row per group we can remove one query per loop iteration CREATE OR REPLACE FUNCTION top_nogroups( ) RETURNS SETOF people LANGUAGE PLPGSQL AS $$ DECLARE _gid INTEGER; _p people; BEGIN -- get lowest gid _gid := group_id FROM people ORDER BY group_id DESC LIMIT 1; IF _gid IS NULL THEN RETURN; END IF;_gid := _gid + 1; LOOP SELECT INTO _p * FROM people WHERE group_id<_gid ORDER BY group_id DESC, score DESC LIMIT 1; IF NOT FOUND THEN RETURN; END IF; _gid = _p.group_id; RETURN NEXT _p; END LOOP; RETURN; END; $$; test=# EXPLAIN ANALYZE SELECT * FROM top_nogroups(); --------------------------------------------------------------------------------------------------------------------- FunctionScan on top_nogroups (cost=0.00..12.50 rows=1000 width=12) (actual time=59.086..60.319 rows=1000 loops=1) Total runtime: 60.712 ms (2 lignes) Performance is about the same level as the first query with the subselect (even though the subselect is executed twice) BUT you don't need the groups table. You might like that. I just love postgres. Think you have a problem, it has an elegant solution. It even has some features it doesn't have, you just have to look under the carpet and you find them (like the index skip scan). Anyway, it was fun to experiment with that !