Nested query performance issue - Mailing list pgsql-performance

From Glenn Maynard
Subject Nested query performance issue
Date
Msg-id d18085b50904081409xcd4dc21rbc5ea6f71a61dd1a@mail.gmail.com
Whole thread Raw
Responses Re: Nested query performance issue  (Віталій Тимчишин <tivv00@gmail.com>)
Re: Nested query performance issue  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-performance
(This is related to an earlier post on -sql.)

I'm querying for the N high scores for each game, with two tables:
scores and games.

CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
game_id INTEGER REFERENCES game (id));
-- test data: 1000 games, 100000 scores
INSERT INTO game (id) select generate_series(1,1000);
INSERT INTO score (game_id, score) select game.id, random() from game,
generate_series(1,100);
CREATE INDEX score_idx1 ON score (game_id, score desc);
ANALYZE;

This query retrieves the single highest score for each game, but
doesn't allow any more than that--I can't get the top five scores for
each game.  However, it's very fast: on the above test data, it runs
in 25ms on my system.  With 1000000 scores, it takes 40ms.

SELECT s.* FROM score s
WHERE s.id IN (
   -- Get the high scoring score ID for each game:
   SELECT
   (
       -- Get the high score for game g:
       SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 1
   )
   FROM game g
);


This rewrite allows getting the top N scores.  Unfortunately, this one
takes 950ms for the same data.  With 1000000 scores, it takes 14800ms.

SELECT s.* FROM score s, game g
WHERE s.game_id = g.id AND
 s.id IN (
    SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score
DESC LIMIT 1
 );


This seems simple: for each game, search for the highest score, and
then scan the tree to get the next N-1 highest scores.  The first
version does just that, but the second one is doing a seq scan over
score.

I do want to be able to use a LIMIT higher than 1, which only works
with the second form.  Any suggestions of how to get the efficient
scanning of the first while being able to use a LIMIT greater than 1?

(It'd even be faster to make several calls to the first version,
varying an OFFSET to get each high score--but that's terrible.)

--
Glenn Maynard

pgsql-performance by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Best replication solution?
Next
From: Віталій Тимчишин
Date:
Subject: Re: Nested query performance issue