Re: SELECT DISTINCT performance issue - Mailing list pgsql-general

From K C Lau
Subject Re: SELECT DISTINCT performance issue
Date
Msg-id 6.2.1.2.0.20050606121632.02c346d0@localhost
Whole thread Raw
In response to Re: SELECT DISTINCT performance issue  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
At 13:06 05/06/05, Tom Lane wrote:
>K C Lau <kclau60@netvigator.com> writes:
> > esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
> > Player
> >   where PlayerID='22220' order by PlayerID desc, AtDate desc;
> >   Unique  (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
> > rows=1 loops=1)
> >     ->  Index Scan Backward using pk_player on player  (cost=0.00..2505.55
> > rows=8
> > 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
> >           Index Cond: ((playerid)::text = '22220'::text)
> >   Total runtime: 187.000 ms
>
> > It appears that all the 1227 data records for that player were searched,
> > even when doing a backward index scan. I would presume that, after
> locating
> > the index for the highest AtDate, only the first data record needs to be
> > retrieved.
>
>If you'd said LIMIT 1, it indeed would have stopped sooner.  Since you
>did not, it had to scan for more outputs.
>
>                         regards, tom lane

I added LIMIT 1 to the query and it worked fine for a direct query on the
table.

However, our system uses views and joined views extensively and I still got
the performance problem after many tries and finally modifying the view to:

create or replace view VCurPlayer (...) as
select a.... from Player a where a.AtDate =
(select b.AtDate from Player b where b.PlayerID = a.PlayerID and b.AtDate =
(select DISTINCT ON (c.PlayerID) c.AtDate from Player c
  where c.PlayerID = b.PlayerID and c.PlayerID = a.PlayerID
  order by c.PlayerID desc, c.AtDate desc LIMIT 1)
  order by b.PlayerID desc, b.AtDate desc LIMIT 1);

esdt=> explain analyze select PlayerID,AtDate,version from VCurPlayer
  where PlayerID='22220';

  Index Scan using pk_player on player a  (cost=0.00..3969606927.13 rows=59
width
=27) (actual time=328.000..328.000 rows=1 loops=1)
    Index Cond: ((playerid)::text = '22220'::text)
    Filter: ((atdate)::text = ((subplan))::text)
    SubPlan
      ->  Limit  (cost=0.00..337263.45 rows=1 width=23) (actual
time=0.180..0.180
  rows=1 loops=1743)
            ->  Index Scan Backward using pk_player on player
b  (cost=0.00..1652
5909.12 rows=49 width=23) (actual time=0.180..0.180 rows=1 loops=1743)
                  Index Cond: ((playerid)::text = ($1)::text)
                  Filter: ((atdate)::text = ((subplan))::text)
                  SubPlan
                    ->  Limit  (cost=0.00..1697.53 rows=1 width=23) (actual
time=
0.072..0.072 rows=1 loops=1743)
                          ->  Unique  (cost=0.00..1697.53 rows=1 width=23)
(actua
l time=0.063..0.063 rows=1 loops=1743)
                                ->  Index Scan Backward using pk_player on
player
  c  (cost=0.00..1695.32 rows=885 width=23) (actual time=0.063..0.063
rows=1 loop
s=1743)
                                      Index Cond: (((playerid)::text =
($0)::text
) AND ((playerid)::text = ($1)::text))
  Total runtime: 328.000 ms

Two subqueries are needed, whereas a single subquery would return no rows.

Please note that when we first ported our system to PostgreSQL 7.1.1 a few
years ago, we used LIMIT 1 (without DISTINCT) and we did not encounter this
performance problem. Our client is currently using MS Windows (using SELECT
TOP 1 construct) with no such problem either.

Any suggestions?

Regards, KC.


pgsql-general by date:

Previous
From: Michael Fuhr
Date:
Subject: Re: survey of the postgresql communiity
Next
From: John Sidney-Woollett
Date:
Subject: Download mirrors not found?