Re: How slow is distinct - 2nd - Mailing list pgsql-sql
From | Michael Contzen |
---|---|
Subject | Re: How slow is distinct - 2nd |
Date | |
Msg-id | 3DA17C30.70D81D14@dohle.com Whole thread Raw |
In response to | How slow is distinct - 2nd ("Michael Contzen" <Michael.Contzen@dohle.com>) |
List | pgsql-sql |
Bruno Wolff III schrieb: > > On Tue, Oct 01, 2002 at 14:18:50 +0200, > Michael Contzen <Michael.Contzen@dohle.com> wrote: > > Here the table: > > > > mc=# \d egal > > Table "public.egal" > > Column | Type | Modifiers > > --------+---------+----------- > > i | integer | > > > > mc=# select count(*) from egal; > > count > > --------- > > 7227744 > > (1 row) > > > > mc=# select count(distinct i) from egal; > > count > > ------- > > 67 > > (1 row) > > This suggests that the best way to do this is with a hash instead of a sort. > > If you have lots of memory you might try increasing the sort memory size. Hello, ok, sort_mem was still set to the default (=1024). I've increased it to sort_mem=10240 which results to: (same machine, same data, etc.) time echo "select distinct i from egal;"|psql mc >/dev/null real 2m30.667s user 0m0.000s sys 0m0.010s If I set sort_mem=1024000: time echo "select distinct i from egal;"|psql mc >/dev/null real 0m52.274s user 0m0.020s sys 0m0.000s wow, in comparison to nearly 5 minutes before this is quite good speedup. But: All the work could be done in memory as the processor load shows (output of top, which shows the following output during all the time) PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND8310 postgres 17 0 528M 528M 2712 R 99.9 13.5 0:11 postmaster Even it nearly performs 5 times faster than before with 1M memory, postgres is still 8 times slower than oracle. Further increasing of sort_mem to 4096000 doesn't reduce the time, as the cpu load cannot increased any more :-) But increasing the memory in that way is not realy a solution: Normaly not all the data fits into memory. In our application I guess 10%. Oracle has even less memory: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND8466 oracle 14 0 11052 10M 10440 R 99.9 0.2 0:01 oracle (this 10M session memory plus 32M shared memory pool not shown here). This shows to me, that oracle uses a quite different algorithm for this task. May be it uses some hashing-like algorithm first without sorting before. I don't know oracle enough, perhaps this is that "sort unique" step in the planners output. I think, first Postgres sorts all the data, which results to temporary data of the same size than before and which needs to be written to disk at least once, and after that postgres does the unique operation, right? If I can do any more tests to oracle or postgres, let me know. Kind regards, Michael Contzen