Thread: How slow is distinct - 2nd

How slow is distinct - 2nd

From

"Michael Contzen"

Date:

04 October 2002, 08:48:38

Hello,

I posted some observations to the performance of postgres some weeks
ago.
The problem with the poor performance of "select distinct" still exists,
but I tried to worked out some reproducable results in a less complicated
way than in my first postings.

'select distinct' preforms on Oracle  about 30 times faster than in
Postgres.

Well, only 'select distinct' is slow. All other parts of our application
using Postgres
perform quite well on Postgres (with really big amounts of data).


(All tests on Postgres 7.3b1 and Oracle 9.0.1.3.0.
Hardware: 4 * 1800 MHz Xeon, 3 * 100GB IDE Software-Raid0)
Suse-Linux, 2.4.18 SMP-kernel)


Here the table:

mc=# \d egal
     Table "public.egal"
 Column |  Type   | Modifiers
--------+---------+-----------
 i      | integer |

mc=# select count(*) from egal;
  count
---------
 7227744
(1 row)

mc=# select count(distinct i) from egal;
 count
-------
    67
(1 row)

(The integers have values between 0 and 99 are an extract of our
warehouse and nearly distributed randomly)

The last query - count(distinct) - takes 1m30s while oracle needs 0m7s
for the same query and data. Getting the distinct rows and measuring the
time results to

time echo "select distinct i from egal;"|psql >/dev/null

real    4m52.108s
user    0m0.000s
sys     0m0.010s

Here I don't understand the difference in performance between "select
distinct i" and "select count(distinct i)" - Oracle takes constantly 7s
for each query:

time echo "select distinct i from egal;"|sqlplus xxxx/yyyy >/dev/null

real    0m6.979s
user    0m0.020s
sys     0m0.030s

In the first case (count(distinct)) postgres produces a temporary file
of 86 MB in pgsql_tmp, in the second case (select distinct) the
temp-file increases to 260 MB. (this is even larger than the table size
of egal which is 232 MB)

I think the planner has not many choices for this query so it results to
mc=# explain select distinct (i) from egal;
                                QUERY
PLAN
---------------------------------------------------------------------------
 Unique  (cost=1292118.29..1328257.01 rows=722774 width=4)
   ->  Sort  (cost=1292118.29..1310187.65 rows=7227744 width=4)
         Sort Key: i
         ->  Seq Scan on egal  (cost=0.00..104118.44 rows=7227744
width=4)
(4 rows)


Which looks similar to oracle's plan:
QPLAN
---------------
  SORT UNIQUE
    TABLE ACCESS FULL EGAL

Our problem is that things getting worse when the table size increases,
as described in my first mails. (Especially when the temp-file is split
into parts) Reorganizing/normalizing the data is no solultion, because
our application is just for analyzing the data.

I tried out the trigger-idea: inserting a value into a table with an
unique key only if still does't exists, it works, but .. select distinct
is a much faster solution.

An other way I tried is an PL/TCL function which stores all different
values into a assoziative array in memory, by far the fastest solution -
but returning the different values makes problems to me (I shouldn't
say, but I put it into a file and "copy" it back). And it's not very
nice - not as nice as "select distinct" :)

Kind regards,

Michael Contzen

Attachment

winmail.dat

Re: How slow is distinct - 2nd

From

Tom Lane

Date:

04 October 2002, 11:12:26

"Michael Contzen" <Michael.Contzen@dohle.com> writes:
> [ select distinct takes a long time on 7+ million rows ]

What do you have sort_mem set to?  The default value is mighty small,
and that would translate directly to poor performance in DISTINCT.

Still though, the speed differential against Oracle is surprising,
given that they don't seem to be using a fundamentally different
implementation.  Time to get out the profiler ...
        regards, tom lane

Re: How slow is distinct - 2nd

From

Bruno Wolff III

Date:

04 October 2002, 11:32:56

On Tue, Oct 01, 2002 at 14:18:50 +0200, Michael Contzen <Michael.Contzen@dohle.com> wrote:
> Here the table:
> 
> mc=# \d egal
>      Table "public.egal"
>  Column |  Type   | Modifiers 
> --------+---------+-----------
>  i      | integer | 
> 
> mc=# select count(*) from egal;
>   count  
> ---------
>  7227744
> (1 row)
> 
> mc=# select count(distinct i) from egal;
>  count 
> -------
>     67
> (1 row)

This suggests that the best way to do this is with a hash instead of a sort.

If you have lots of memory you might try increasing the sort memory size.

Re: How slow is distinct - 2nd

From

Michael Contzen

Date:

08 October 2002, 12:32:29

Bruno Wolff III schrieb:
> 
> On Tue, Oct 01, 2002 at 14:18:50 +0200,
>   Michael Contzen <Michael.Contzen@dohle.com> wrote:
> > Here the table:
> >
> > mc=# \d egal
> >      Table "public.egal"
> >  Column |  Type   | Modifiers
> > --------+---------+-----------
> >  i      | integer |
> >
> > mc=# select count(*) from egal;
> >   count
> > ---------
> >  7227744
> > (1 row)
> >
> > mc=# select count(distinct i) from egal;
> >  count
> > -------
> >     67
> > (1 row)
> 
> This suggests that the best way to do this is with a hash instead of a sort.
> 
> If you have lots of memory you might try increasing the sort memory size.

Hello,

ok, sort_mem was still set to the default (=1024). I've increased it to  sort_mem=10240
which results to: (same machine, same data, etc.)

time echo "select distinct i from egal;"|psql mc >/dev/null

real    2m30.667s
user    0m0.000s
sys     0m0.010s

If I set sort_mem=1024000:

time echo "select distinct i from egal;"|psql mc >/dev/null
real    0m52.274s
user    0m0.020s
sys     0m0.000s

wow, in comparison to nearly 5 minutes before this is quite good
speedup.

But: 

All the work could be done in memory as the processor load shows (output
of top, which shows the following output during all the time)

 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND8310 postgres  17   0  528M 528M  2712 R    99.9
13.5  0:11 postmaster

Even it nearly performs 5 times faster than before with 1M memory,
postgres is still
8 times slower than oracle. Further increasing of sort_mem to 4096000
doesn't reduce the
time, as the cpu load cannot increased any more :-)

But increasing the memory in that way is not realy a solution: Normaly
not all the data
fits into memory. In our application I guess 10%.

Oracle has even less memory:
 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND8466 oracle    14   0 11052  10M 10440 R    99.9
0.2  0:01 oracle

(this 10M session memory plus 32M shared memory pool not shown here).

This shows to me, that oracle uses a quite different algorithm for this
task. May be it
uses some hashing-like algorithm first without sorting before. I don't
know oracle enough, perhaps this is that "sort unique" step in the
planners output. 
I think, first Postgres sorts all the data, which results to temporary
data of the same
size than before and which needs to be written to disk at least once,
and after that postgres does the unique operation, right? 

If I can do any more tests to oracle or postgres, let me know.

Kind regards,

Michael Contzen