Thread: SELECT DISTINCT performance issue
Hi All, We are testing PostgreSQL 8.0.3 on MS Windows for porting an OLTP system from MS SqlServer. We got a major performance issue which seems to boil down to the following type of query: select DISTINCT ON (PlayerID) PlayerID,AtDate from Player where PlayerID='22220' order by PlayerID desc, AtDate desc; The Player table has primary key (PlayerID, AtDate) representing data over time and the query gets the latest data for a player. With enable_seqscan forced off (which I'm not sure if that should be done for a production system), the average query still takes a very long time to return a record: esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from Player where PlayerID='22220' order by PlayerID desc, AtDate desc; Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000 rows=1 loops=1) -> Index Scan Backward using pk_player on player (cost=0.00..2505.55 rows=8 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1) Index Cond: ((playerid)::text = '22220'::text) Total runtime: 187.000 ms It appears that all the 1227 data records for that player were searched, even when doing a backward index scan. I would presume that, after locating the index for the highest AtDate, only the first data record needs to be retrieved. The following summary of tests seems to confirm my observation, as the query returns quickly only after the table was clustered. The tests were done on a quiet system (MS Windows 2000 Server, P4 3.0GHz with Hyperthreading, 1GB Memory, PostgreSQL shared_buffers = 50000), starting with a test database before doing a vacuum: set enable_seqscan = off; select Total runtime: 187.000 ms again: Total runtime: 78.000 ms vacuum analyze verbose player; select Total runtime: 47.000 ms again: Total runtime: 47.000 ms reindex table player; select Total runtime: 78.000 ms again: Total runtime: 63.000 ms cluster pk_player on player; select Total runtime: 16.000 ms again: Total runtime: 0.000 ms set enable_seqscan = on; analyze verbose player; select Total runtime: 62.000 ms again: Total runtime: 78.000 ms Previously, we have also tried to use LIMIT 1 instead of DISTINCT, but the performance was no better: select PlayerID,AtDate from Player where PlayerID='22220' order by PlayerID desc, AtDate desc LIMIT 1 Any clue or suggestions would be most appreciated. If you need further info or the full explain logs, please let me know. Regards, KC Lau. ps. This problem probably should go to pgsql-performance mailing list, but I have sent this email to pgsql-performance@postgresql.org 3 times over the last 2 days and all of them seemed to have lost, even though I am receiving emails from both lists.
K C Lau <kclau60@netvigator.com> writes: > esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from > Player > where PlayerID='22220' order by PlayerID desc, AtDate desc; > Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000 > rows=1 loops=1) > -> Index Scan Backward using pk_player on player (cost=0.00..2505.55 > rows=8 > 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1) > Index Cond: ((playerid)::text = '22220'::text) > Total runtime: 187.000 ms > It appears that all the 1227 data records for that player were searched, > even when doing a backward index scan. I would presume that, after locating > the index for the highest AtDate, only the first data record needs to be > retrieved. If you'd said LIMIT 1, it indeed would have stopped sooner. Since you did not, it had to scan for more outputs. regards, tom lane
At 13:06 05/06/05, Tom Lane wrote: >K C Lau <kclau60@netvigator.com> writes: > > esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from > > Player > > where PlayerID='22220' order by PlayerID desc, AtDate desc; > > Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000 > > rows=1 loops=1) > > -> Index Scan Backward using pk_player on player (cost=0.00..2505.55 > > rows=8 > > 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1) > > Index Cond: ((playerid)::text = '22220'::text) > > Total runtime: 187.000 ms > > > It appears that all the 1227 data records for that player were searched, > > even when doing a backward index scan. I would presume that, after > locating > > the index for the highest AtDate, only the first data record needs to be > > retrieved. > >If you'd said LIMIT 1, it indeed would have stopped sooner. Since you >did not, it had to scan for more outputs. > > regards, tom lane I added LIMIT 1 to the query and it worked fine for a direct query on the table. However, our system uses views and joined views extensively and I still got the performance problem after many tries and finally modifying the view to: create or replace view VCurPlayer (...) as select a.... from Player a where a.AtDate = (select b.AtDate from Player b where b.PlayerID = a.PlayerID and b.AtDate = (select DISTINCT ON (c.PlayerID) c.AtDate from Player c where c.PlayerID = b.PlayerID and c.PlayerID = a.PlayerID order by c.PlayerID desc, c.AtDate desc LIMIT 1) order by b.PlayerID desc, b.AtDate desc LIMIT 1); esdt=> explain analyze select PlayerID,AtDate,version from VCurPlayer where PlayerID='22220'; Index Scan using pk_player on player a (cost=0.00..3969606927.13 rows=59 width =27) (actual time=328.000..328.000 rows=1 loops=1) Index Cond: ((playerid)::text = '22220'::text) Filter: ((atdate)::text = ((subplan))::text) SubPlan -> Limit (cost=0.00..337263.45 rows=1 width=23) (actual time=0.180..0.180 rows=1 loops=1743) -> Index Scan Backward using pk_player on player b (cost=0.00..1652 5909.12 rows=49 width=23) (actual time=0.180..0.180 rows=1 loops=1743) Index Cond: ((playerid)::text = ($1)::text) Filter: ((atdate)::text = ((subplan))::text) SubPlan -> Limit (cost=0.00..1697.53 rows=1 width=23) (actual time= 0.072..0.072 rows=1 loops=1743) -> Unique (cost=0.00..1697.53 rows=1 width=23) (actua l time=0.063..0.063 rows=1 loops=1743) -> Index Scan Backward using pk_player on player c (cost=0.00..1695.32 rows=885 width=23) (actual time=0.063..0.063 rows=1 loop s=1743) Index Cond: (((playerid)::text = ($0)::text ) AND ((playerid)::text = ($1)::text)) Total runtime: 328.000 ms Two subqueries are needed, whereas a single subquery would return no rows. Please note that when we first ported our system to PostgreSQL 7.1.1 a few years ago, we used LIMIT 1 (without DISTINCT) and we did not encounter this performance problem. Our client is currently using MS Windows (using SELECT TOP 1 construct) with no such problem either. Any suggestions? Regards, KC.