Re: sequential scan on select distinct - Mailing list pgsql-performance

From Pierre-Frédéric Caillaud
Subject Re: sequential scan on select distinct
Date
Msg-id opsff1mmp8cq72hf@musicbox
Whole thread Raw
In response to sequential scan on select distinct  (Ole Langbehn <ole@freiheit.com>)
Responses Re: sequential scan on select distinct
Re: sequential scan on select distinct
List pgsql-performance
    You could try :

    explain analyze select "land" from "customer_dim" group by "land";
    It will be a lot faster but I can't make it use the index on my machine...

    Example :

    create table dummy as (select id, id%255 as number from a large table
with 1M rows);
    so we have a table with 256 (0-255) disctinct "number" values.

--------------------------------------------------------------------------------
=> explain analyze select distinct number from dummy;
  Unique  (cost=69.83..74.83 rows=200 width=4) (actual
time=13160.490..14414.004 rows=255 loops=1)
    ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual
time=13160.483..13955.792 rows=1000000 loops=1)
          Sort Key: number
          ->  Seq Scan on dummy  (cost=0.00..20.00 rows=1000 width=4)
(actual time=0.052..1759.145 rows=1000000 loops=1)
  Total runtime: 14442.872 ms

=>    Horribly slow because it has to sort 1M rows for the Unique.

--------------------------------------------------------------------------------
=> explain analyze select number from dummy group by number;
  HashAggregate  (cost=22.50..22.50 rows=200 width=4) (actual
time=1875.214..1875.459 rows=255 loops=1)
    ->  Seq Scan on dummy  (cost=0.00..20.00 rows=1000 width=4) (actual
time=0.107..1021.014 rows=1000000 loops=1)
  Total runtime: 1875.646 ms

=>    A lot faster because it HashAggregates instead of sorting (but still
seq scan)

--------------------------------------------------------------------------------
Now :
create index dummy_idx on dummy(number);
Let's try again.
--------------------------------------------------------------------------------
explain analyze select distinct number from dummy;
  Unique  (cost=0.00..35301.00 rows=200 width=4) (actual
time=0.165..21781.732 rows=255 loops=1)
    ->  Index Scan using dummy_idx on dummy  (cost=0.00..32801.00
rows=1000000 width=4) (actual time=0.162..21154.752 rows=1000000 loops=1)
  Total runtime: 21782.270 ms

=> Index scan the whole table. argh. I should have ANALYZized.
--------------------------------------------------------------------------------
explain analyze select number from dummy group by number;
  HashAggregate  (cost=17402.00..17402.00 rows=200 width=4) (actual
time=1788.425..1788.668 rows=255 loops=1)
    ->  Seq Scan on dummy  (cost=0.00..14902.00 rows=1000000 width=4)
(actual time=0.048..960.063 rows=1000000 loops=1)
  Total runtime: 1788.855 ms
=>    Still the same...
--------------------------------------------------------------------------------
Let's make a function :
The function starts at the lowest number and advances to the next number
in the index until they are all exhausted.


CREATE OR REPLACE FUNCTION sel_distinct()
    RETURNS SETOF INTEGER
    LANGUAGE plpgsql
    AS '
DECLARE
    pos INTEGER;
BEGIN
    SELECT INTO pos number FROM dummy ORDER BY number ASC LIMIT 1;
    IF NOT FOUND THEN
        RAISE NOTICE ''no records.'';
        RETURN;
    END IF;

    LOOP
        RETURN NEXT pos;
        SELECT INTO pos number FROM dummy WHERE number>pos ORDER BY number ASC
LIMIT 1;
        IF NOT FOUND THEN
            RETURN;
        END IF;
    END LOOP;
END;
';

explain analyze select * from sel_distinct();
  Function Scan on sel_distinct  (cost=0.00..12.50 rows=1000 width=4)
(actual time=215.472..215.696 rows=255 loops=1)
  Total runtime: 215.839 ms

    That's better !
--------------------------------------------------------------------------------
Why not use DESC instead of ASC ?

CREATE OR REPLACE FUNCTION sel_distinct()
    RETURNS SETOF INTEGER
    LANGUAGE plpgsql
    AS '
DECLARE
    pos INTEGER;
BEGIN
    SELECT INTO pos number FROM dummy ORDER BY number DESC LIMIT 1;
    IF NOT FOUND THEN
        RAISE NOTICE ''no records.'';
        RETURN;
    END IF;

    LOOP
        RETURN NEXT pos;
        SELECT INTO pos number FROM dummy WHERE number<pos ORDER BY number DESC
LIMIT 1;
        IF NOT FOUND THEN
            RETURN;
        END IF;
    END LOOP;
END;
';

explain analyze select * from sel_distinct();
  Function Scan on sel_distinct  (cost=0.00..12.50 rows=1000 width=4)
(actual time=13.500..13.713 rows=255 loops=1)
  Total runtime: 13.857 ms

    Hum hum ! Again, a lot better !
    Index scan backwards seems a lot faster than index scan forwards. Why, I
don't know, but here you go from 15 seconds to 14 milliseconds...

    I don't know WHY (oh why) postgres does not use this kind of strategy
when distinct'ing an indexed field... Anybody got an idea ?



pgsql-performance by date:

Previous
From: "Alban Médici (NetCentrex)"
Date:
Subject: stats on cursor and query execution troubleshooting
Next
From: Alan Stange
Date:
Subject: Re: Excessive context switching on SMP Xeons