Thread: Nested query performance issue

Nested query performance issue

From

Glenn Maynard

Date:

08 April 2009, 18:09:28

(This is related to an earlier post on -sql.)

I'm querying for the N high scores for each game, with two tables:
scores and games.

CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
game_id INTEGER REFERENCES game (id));
-- test data: 1000 games, 100000 scores
INSERT INTO game (id) select generate_series(1,1000);
INSERT INTO score (game_id, score) select game.id, random() from game,
generate_series(1,100);
CREATE INDEX score_idx1 ON score (game_id, score desc);
ANALYZE;

This query retrieves the single highest score for each game, but
doesn't allow any more than that--I can't get the top five scores for
each game.  However, it's very fast: on the above test data, it runs
in 25ms on my system.  With 1000000 scores, it takes 40ms.

SELECT s.* FROM score s
WHERE s.id IN (
   -- Get the high scoring score ID for each game:
   SELECT
   (
       -- Get the high score for game g:
       SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 1
   )
   FROM game g
);


This rewrite allows getting the top N scores.  Unfortunately, this one
takes 950ms for the same data.  With 1000000 scores, it takes 14800ms.

SELECT s.* FROM score s, game g
WHERE s.game_id = g.id AND
 s.id IN (
    SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score
DESC LIMIT 1
 );


This seems simple: for each game, search for the highest score, and
then scan the tree to get the next N-1 highest scores.  The first
version does just that, but the second one is doing a seq scan over
score.

I do want to be able to use a LIMIT higher than 1, which only works
with the second form.  Any suggestions of how to get the efficient
scanning of the first while being able to use a LIMIT greater than 1?

(It'd even be faster to make several calls to the first version,
varying an OFFSET to get each high score--but that's terrible.)

--
Glenn Maynard

Re: Nested query performance issue

From

Віталій Тимчишин

Date:

08 April 2009, 18:30:54

2009/4/9 Glenn Maynard <glennfmaynard@gmail.com>

(This is related to an earlier post on -sql.)

I'm querying for the N high scores for each game, with two tables:
scores and games.

CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
game_id INTEGER REFERENCES game (id));
-- test data: 1000 games, 100000 scores
INSERT INTO game (id) select generate_series(1,1000);
INSERT INTO score (game_id, score) select game.id, random() from game,
generate_series(1,100);
CREATE INDEX score_idx1 ON score (game_id, score desc);
ANALYZE;

How about

select s1.*
from score s1 join score s2 on s1.game_id=s2.game_id and s2.score >= s1.score
group by s1.*
having count(s2.*) <= N

Note: you can have problems if you have same scores - you will loose last group that overlap N

In any case, you don't need to join game since all you need is game_id you already have in score.

P.S. EXPLAIN ANALYZE could help

Best regards, Vitalii Tymchyshyn

Re: Nested query performance issue

From

Glenn Maynard

Date:

08 April 2009, 18:55:00

(I didn't notice that I ended up with "score.score" in this test case.  Oops.)

2009/4/8 Віталій Тимчишин <tivv00@gmail.com>:
> How about
>
> select s1.*
> from score s1 join score s2 on s1.game_id=s2.game_id and s2.score >=
> s1.score
> group by s1.*
> having count(s2.*) <= N

I can see what this is doing, but I'm getting:

ERROR:  could not identify an ordering operator for type score
HINT:  Use an explicit ordering operator or modify the query.

I'm not sure why; if I replace s1.* and s2.* with s1.id and s2.id it
works, but then I only get IDs.

Unfortunately, with N = 1 this takes 8100ms (vs. 950ms and 25ms)...

--
Glenn Maynard

Re: Nested query performance issue

From

Віталій Тимчишин

Date:

09 April 2009, 06:25:36

OK, got to my postgres. Here you are:

create or replace function explode_array(in_array anyarray) returns setof anyelement as
$$
    select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
$$
language sql immutable;

SELECT s.* FROM score s
WHERE s.id IN (
select
-- Get the high scoring score ID for each game:
explode_array(ARRAY(
      -- Get the high score for game g:
      SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY
s2.score DESC LIMIT 5
))
FROM game g
);

It takes ~64ms for me

Best regards, Vitaliy Tymchyshyn

Re: Nested query performance issue

From

Heikki Linnakangas

Date:

09 April 2009, 08:29:30

Glenn Maynard wrote:
> This rewrite allows getting the top N scores.  Unfortunately, this one
> takes 950ms for the same data.  With 1000000 scores, it takes 14800ms.
>
> SELECT s.* FROM score s, game g
> WHERE s.game_id = g.id AND
>  s.id IN (
>     SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score
> DESC LIMIT 1
>  );

You don't really need the join with game here, simplifying this into:

SELECT s.* FROM score s
WHERE s.id IN (
     SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY
s2.score
DESC LIMIT 1
);

I don't think it makes it any faster, though.

You can also do this in a very nice and clean fashion using the upcoming
PG 8.4 window functions:

SELECT * FROM (
   SELECT s.*, rank() OVER (PARTITION BY s.game_id ORDER BY score DESC)
AS rank FROM score s
) AS sub WHERE rank <= 5;

but I'm not sure how much faster it is. At least here on my laptop it
does a full index scan on score, which may or may not be faster than
just picking the top N values for each game using the index.

> This seems simple: for each game, search for the highest score, and
> then scan the tree to get the next N-1 highest scores.  The first
> version does just that, but the second one is doing a seq scan over
> score.

You can do that approach with a SQL function:

CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score
LANGUAGE SQL AS $$
SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2
$$;

SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id
FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub;

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Nested query performance issue

From

Glenn Maynard

Date:

09 April 2009, 20:42:21

On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> SELECT s.* FROM score s, game g
>> WHERE s.game_id = g.id AND
>>  s.id IN (
>>    SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score
>> DESC LIMIT 1
>>  );
>
> You don't really need the join with game here, simplifying this into:
>
> SELECT s.* FROM score s
> WHERE s.id IN (
>    SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY s2.score
> DESC LIMIT 1
> );
>
> I don't think it makes it any faster, though.

It's about 10% faster for me.  I'm surprised the planner can't figure
out that this join is redundant.

> SELECT * FROM (
>  SELECT s.*, rank() OVER (PARTITION BY s.game_id ORDER BY score DESC) AS
> rank FROM score s
> ) AS sub WHERE rank <= 5;
>
> but I'm not sure how much faster it is. At least here on my laptop it does a
> full index scan on score, which may or may not be faster than just picking
> the top N values for each game using the index.

I'll definitely check this out when 8.4 is released.

> You can do that approach with a SQL function:
>
> CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score LANGUAGE
> SQL AS $$
> SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2
> $$;
>
> SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id
> FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub;
("as ts", for anyone trying this at home)

Thanks--this one runs in 32ms, which seems about right compared
against the original fast LIMIT 1 version.

I see a slight improvement if I mark the function stable: 31.9ms to
31.2; minor but consistent.  Just out of curiosity, any explanations
for this difference?  I don't see any change in the resulting query
plan, but the plan doesn't enter the function call.

--
Glenn Maynard

Re: Nested query performance issue

From

Glenn Maynard

Date:

09 April 2009, 20:42:45

2009/4/9 Віталій Тимчишин <tivv00@gmail.com>:
> create or replace function explode_array(in_array anyarray) returns setof
> anyelement as
> $$
>     select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
> $$
> language sql immutable;

I tried using an ARRAY like this, but didn't quite figure out the
explode_array piece.

--
Glenn Maynard

Re: Nested query performance issue

From

Greg Smith

Date:

09 April 2009, 20:59:55

On Thu, 9 Apr 2009, tiv00 wrote:

> create or replace function explode_array(in_array anyarray) returns setof anyelement as
> $$
>     select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
> $$
> language sql immutable;

Note that you can make this function a bit more general by using
array_lower as the bottom bound:

create or replace function explode_array(in_array anyarray) returns setof anyelement as
$$
      select ($1)[s] from generate_series
        (array_lower($1, 1), array_upper($1, 1)) as s;
$$
language sql immutable;

While you won't run into them in most situations, it is possible to create
arrays where the lower bound isn't 1 by using the subscript syntax.  The
example in the manual even shows that somewhat odd possibilities like
assigning something to "myarray[-2:7]" works.

As already pointed out, once you're in 8.4 the windowing functions might
be a better fit here, but 8.4 does have "unnest" built-in that replaces
the need to code this sort of thing yourself.  You might want to name this
function accordingly to match that upcoming standard (or not, depending on
whether you want to avoid or be reminding of the potential for using the
built-in).  See
http://www.depesz.com/index.php/2008/11/14/waiting-for-84-array-aggregate-and-array-unpacker/
for some examples.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Nested query performance issue

From

Glenn Maynard

Date:

10 April 2009, 03:11:32

On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> CREATE FUNCTION topnscores(game_id int , n int) RETURNS SETOF score LANGUAGE
> SQL AS $$
> SELECT * FROM score s WHERE s.game_id = $1 ORDER BY score DESC LIMIT $2
> $$;
>
> SELECT (sub.ts).id, (sub.ts).score, (sub.ts).game_id
> FROM (SELECT topnscores(g.id, 5) ts FROM game g) sub;

The inner query:

SELECT topnscores(g.id, 5) ts FROM game g

http://www.postgresql.org/docs/8.3/static/xfunc-sql.html says this is
deprecated (though no deprecation warning is being generated):

> Currently, functions returning sets can also be called in the select list of a query. For each row that the query
generatesby itself, the function returning set is invoked, and an output row is generated for each element of the
function'sresult set. Note, however, that this capability is deprecated and might be removed in future releases. 

It doesn't say how else to write this, though, and it's not obvious to
me.  "SELECT ts.* FROM topnscores(g.id, 5) AS ts, game g" doesn't work
("function expression in FROM cannot refer to other relations of same
query level").  Is there an equivalent way to do this so I won't have
deprecation looming over my back?  I'm likely to become very dependent
on this pattern.

--
Glenn Maynard

Re: Nested query performance issue

From

Tom Lane

Date:

10 April 2009, 10:42:32

Glenn Maynard <glennfmaynard@gmail.com> writes:
> http://www.postgresql.org/docs/8.3/static/xfunc-sql.html says this is
> deprecated (though no deprecation warning is being generated):

>> Currently, functions returning sets can also be called in the select list of a query. For each row that the query
generatesby itself, the function returning set is invoked, and an output row is generated for each element of the
function'sresult set. Note, however, that this capability is deprecated and might be removed in future releases. 

The way to parse that is "we don't like this and we will get rid of it
if we can ever figure out a good substitute".  Right now there is no
100% substitute, so it stays.  (In fact, 8.4 will extend the feature so
it works in cases that don't work today, like for PL functions.)

There are, however, good reasons not to like it, such as the rather
questionable behavior if there's more than one SRF in the same select
list.  Don't complain if you run into that wart.

            regards, tom lane

Re: Nested query performance issue

From

Matthew Wakeling

Date:

14 April 2009, 06:33:32

On Thu, 9 Apr 2009, Glenn Maynard wrote:
> On Thu, Apr 9, 2009 at 7:29 AM, Heikki Linnakangas wrote:
>>> SELECT s.* FROM score s, game g
>>> WHERE s.game_id = g.id AND
>>>  s.id IN (
>>>    SELECT s2.id FROM score s2 WHERE s2.game_id=g.id ORDER BY s2.score
>>> DESC LIMIT 1
>>>  );
>>
>> You don't really need the join with game here, simplifying this into:
>>
>> SELECT s.* FROM score s
>> WHERE s.id IN (
>>    SELECT s2.id FROM score s2 WHERE s2.game_id=s.game_id ORDER BY s2.score
>> DESC LIMIT 1
>> );
>>
>> I don't think it makes it any faster, though.
>
> It's about 10% faster for me.  I'm surprised the planner can't figure
> out that this join is redundant.

Because the join isn't redundant? You're making the assumption that for
every score.game_id there is exactly one game.id that matches. Of course,
you may have a unique constraint and foreign key/trigger that ensures
this.

Matthew

--
 The third years are wandering about all worried at the moment because they
 have to hand in their final projects. Please be sympathetic to them, say
 things like "ha-ha-ha", but in a sympathetic tone of voice
                                        -- Computer Science Lecturer

Re: Nested query performance issue

From

Glenn Maynard

Date:

14 April 2009, 07:04:25

On Tue, Apr 14, 2009 at 5:33 AM, Matthew Wakeling <matthew@flymine.org> wrote:
>> It's about 10% faster for me.  I'm surprised the planner can't figure
>> out that this join is redundant.
>
> Because the join isn't redundant? You're making the assumption that for
> every score.game_id there is exactly one game.id that matches. Of course,
> you may have a unique constraint and foreign key/trigger that ensures this.

That's the definition of the tables I gave.

CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY); -- pk implies unique
CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL,
game_id INTEGER REFERENCES game (id));

(I don't think it makes any difference to whether this can be
optimized, but adding NOT NULL back to game_id doesn't change it,
either.)

--
Glenn Maynard

Re: Nested query performance issue

From

Merlin Moncure

Date:

14 April 2009, 10:46:35

2009/4/9 Віталій Тимчишин <tivv00@gmail.com>:
> OK, got to my postgres. Here you are:
>
> create or replace function explode_array(in_array anyarray) returns setof
> anyelement as
> $$
>     select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
> $$
> language sql immutable;
>

in 8.4, this will be replaced by the built in 'unnest'.  Also we have array_agg.

merlin