SELECT DISTINCT Performance Issue - Mailing list pgsql-performance

From K C Lau
Subject SELECT DISTINCT Performance Issue
Date
Msg-id 6.2.1.2.0.20050602190756.05ab2c70@localhost
Whole thread Raw
Responses Re: SELECT DISTINCT Performance Issue  (PFC <lists@boutiquenumerique.com>)
Re: SELECT DISTINCT Performance Issue  (George Essig <george.essig@gmail.com>)
List pgsql-performance
Hi All,

We are testing PostgreSQL 8.0.3 on MS Windows for porting an OLTP system
from MS SqlServer.

We got a major performance issue which seems to boil down to the following
type of query:

select DISTINCT ON (PlayerID) PlayerID,AtDate from Player  where
PlayerID='22220' order by PlayerID desc, AtDate desc;
The Player table has primary key (PlayerID, AtDate) representing data over
time and the query gets the latest data for a player.

With enable_seqscan forced off (which I'm not sure if that should be done
for a production system), the average query still takes a very long time to
return a record:

esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
Player
  where PlayerID='22220' order by PlayerID desc, AtDate desc;
  Unique  (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
rows=1 loops=1)
    ->  Index Scan Backward using pk_player on player  (cost=0.00..2505.55
rows=8
43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
          Index Cond: ((playerid)::text = '22220'::text)
  Total runtime: 187.000 ms

It appears that all the 1227 data records for that player were searched,
even when doing a backward index scan. I would presume that, after locating
the index for the highest AtDate, only the first data record needs to be
retrieved.

The following summary of tests seems to confirm my observation. They were
done on a quiet system (MS Windows 2000 Server, P4 3.0GHz with
Hyperthreading, 1GB Memory, PostgreSQL shared_buffers = 50000), starting
with a test database before doing a vacuum:

set enable_seqscan = off;
select        Total runtime: 187.000 ms
again:        Total runtime: 78.000 ms
vacuum analyze verbose player;
select        Total runtime: 47.000 ms
again:        Total runtime: 47.000 ms
reindex table player;
select        Total runtime: 78.000 ms
again:        Total runtime: 63.000 ms
cluster pk_player on player;
select        Total runtime: 16.000 ms
again:        Total runtime: 0.000 ms
set enable_seqscan = on;
analyze verbose player;
select        Total runtime: 62.000 ms
again:        Total runtime: 78.000 ms

Previously, we have also tried to use LIMIT 1 instead of DISTINCT, but the
performance was no better:
select PlayerID,AtDate from Player where PlayerID='22220' order by PlayerID
desc, AtDate desc LIMIT 1

Any clue or suggestions would be most appreciated. If you need further info
or the full explain logs, please let me know.

Regards,
KC.



pgsql-performance by date:

Previous
From: matthew@tocr.com
Date:
Subject: Re: How to avoid database bloat
Next
From: stig erikson
Date:
Subject: Re: How to avoid database bloat