Re: How slow is distinct - 2nd - Mailing list pgsql-sql

From Michael Contzen
Subject Re: How slow is distinct - 2nd
Date
Msg-id 3DA17C30.70D81D14@dohle.com
Whole thread Raw
In response to How slow is distinct - 2nd  ("Michael Contzen" <Michael.Contzen@dohle.com>)
List pgsql-sql
Bruno Wolff III schrieb:
> 
> On Tue, Oct 01, 2002 at 14:18:50 +0200,
>   Michael Contzen <Michael.Contzen@dohle.com> wrote:
> > Here the table:
> >
> > mc=# \d egal
> >      Table "public.egal"
> >  Column |  Type   | Modifiers
> > --------+---------+-----------
> >  i      | integer |
> >
> > mc=# select count(*) from egal;
> >   count
> > ---------
> >  7227744
> > (1 row)
> >
> > mc=# select count(distinct i) from egal;
> >  count
> > -------
> >     67
> > (1 row)
> 
> This suggests that the best way to do this is with a hash instead of a sort.
> 
> If you have lots of memory you might try increasing the sort memory size.

Hello,

ok, sort_mem was still set to the default (=1024). I've increased it to  sort_mem=10240
which results to: (same machine, same data, etc.)

time echo "select distinct i from egal;"|psql mc >/dev/null

real    2m30.667s
user    0m0.000s
sys     0m0.010s

If I set sort_mem=1024000:

time echo "select distinct i from egal;"|psql mc >/dev/null
real    0m52.274s
user    0m0.020s
sys     0m0.000s

wow, in comparison to nearly 5 minutes before this is quite good
speedup.

But: 

All the work could be done in memory as the processor load shows (output
of top, which shows the following output during all the time)

 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND8310 postgres  17   0  528M 528M  2712 R    99.9
13.5  0:11 postmaster
 

Even it nearly performs 5 times faster than before with 1M memory,
postgres is still
8 times slower than oracle. Further increasing of sort_mem to 4096000
doesn't reduce the
time, as the cpu load cannot increased any more :-)

But increasing the memory in that way is not realy a solution: Normaly
not all the data
fits into memory. In our application I guess 10%.

Oracle has even less memory:
 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND8466 oracle    14   0 11052  10M 10440 R    99.9
0.2  0:01 oracle
 

(this 10M session memory plus 32M shared memory pool not shown here).

This shows to me, that oracle uses a quite different algorithm for this
task. May be it
uses some hashing-like algorithm first without sorting before. I don't
know oracle enough, perhaps this is that "sort unique" step in the
planners output. 
I think, first Postgres sorts all the data, which results to temporary
data of the same
size than before and which needs to be written to disk at least once,
and after that postgres does the unique operation, right? 

If I can do any more tests to oracle or postgres, let me know.

Kind regards,

Michael Contzen


pgsql-sql by date:

Previous
From: "Josh Berkus"
Date:
Subject: Re: [NOVICE] update question
Next
From: "Ian Harding"
Date:
Subject: Re: Temporary tables and indexes