Thread: How slow is distinct - 2nd
Hello, I posted some observations to the performance of postgres some weeks ago. The problem with the poor performance of "select distinct" still exists, but I tried to worked out some reproducable results in a less complicated way than in my first postings. 'select distinct' preforms on Oracle about 30 times faster than in Postgres. Well, only 'select distinct' is slow. All other parts of our application using Postgres perform quite well on Postgres (with really big amounts of data). (All tests on Postgres 7.3b1 and Oracle 9.0.1.3.0. Hardware: 4 * 1800 MHz Xeon, 3 * 100GB IDE Software-Raid0) Suse-Linux, 2.4.18 SMP-kernel) Here the table: mc=# \d egal Table "public.egal" Column | Type | Modifiers --------+---------+----------- i | integer | mc=# select count(*) from egal; count --------- 7227744 (1 row) mc=# select count(distinct i) from egal; count ------- 67 (1 row) (The integers have values between 0 and 99 are an extract of our warehouse and nearly distributed randomly) The last query - count(distinct) - takes 1m30s while oracle needs 0m7s for the same query and data. Getting the distinct rows and measuring the time results to time echo "select distinct i from egal;"|psql >/dev/null real 4m52.108s user 0m0.000s sys 0m0.010s Here I don't understand the difference in performance between "select distinct i" and "select count(distinct i)" - Oracle takes constantly 7s for each query: time echo "select distinct i from egal;"|sqlplus xxxx/yyyy >/dev/null real 0m6.979s user 0m0.020s sys 0m0.030s In the first case (count(distinct)) postgres produces a temporary file of 86 MB in pgsql_tmp, in the second case (select distinct) the temp-file increases to 260 MB. (this is even larger than the table size of egal which is 232 MB) I think the planner has not many choices for this query so it results to mc=# explain select distinct (i) from egal; QUERY PLAN --------------------------------------------------------------------------- Unique (cost=1292118.29..1328257.01 rows=722774 width=4) -> Sort (cost=1292118.29..1310187.65 rows=7227744 width=4) Sort Key: i -> Seq Scan on egal (cost=0.00..104118.44 rows=7227744 width=4) (4 rows) Which looks similar to oracle's plan: QPLAN --------------- SORT UNIQUE TABLE ACCESS FULL EGAL Our problem is that things getting worse when the table size increases, as described in my first mails. (Especially when the temp-file is split into parts) Reorganizing/normalizing the data is no solultion, because our application is just for analyzing the data. I tried out the trigger-idea: inserting a value into a table with an unique key only if still does't exists, it works, but .. select distinct is a much faster solution. An other way I tried is an PL/TCL function which stores all different values into a assoziative array in memory, by far the fastest solution - but returning the different values makes problems to me (I shouldn't say, but I put it into a file and "copy" it back). And it's not very nice - not as nice as "select distinct" :) Kind regards, Michael Contzen
Attachment
"Michael Contzen" <Michael.Contzen@dohle.com> writes: > [ select distinct takes a long time on 7+ million rows ] What do you have sort_mem set to? The default value is mighty small, and that would translate directly to poor performance in DISTINCT. Still though, the speed differential against Oracle is surprising, given that they don't seem to be using a fundamentally different implementation. Time to get out the profiler ... regards, tom lane
On Tue, Oct 01, 2002 at 14:18:50 +0200, Michael Contzen <Michael.Contzen@dohle.com> wrote: > Here the table: > > mc=# \d egal > Table "public.egal" > Column | Type | Modifiers > --------+---------+----------- > i | integer | > > mc=# select count(*) from egal; > count > --------- > 7227744 > (1 row) > > mc=# select count(distinct i) from egal; > count > ------- > 67 > (1 row) This suggests that the best way to do this is with a hash instead of a sort. If you have lots of memory you might try increasing the sort memory size.
Bruno Wolff III schrieb: > > On Tue, Oct 01, 2002 at 14:18:50 +0200, > Michael Contzen <Michael.Contzen@dohle.com> wrote: > > Here the table: > > > > mc=# \d egal > > Table "public.egal" > > Column | Type | Modifiers > > --------+---------+----------- > > i | integer | > > > > mc=# select count(*) from egal; > > count > > --------- > > 7227744 > > (1 row) > > > > mc=# select count(distinct i) from egal; > > count > > ------- > > 67 > > (1 row) > > This suggests that the best way to do this is with a hash instead of a sort. > > If you have lots of memory you might try increasing the sort memory size. Hello, ok, sort_mem was still set to the default (=1024). I've increased it to sort_mem=10240 which results to: (same machine, same data, etc.) time echo "select distinct i from egal;"|psql mc >/dev/null real 2m30.667s user 0m0.000s sys 0m0.010s If I set sort_mem=1024000: time echo "select distinct i from egal;"|psql mc >/dev/null real 0m52.274s user 0m0.020s sys 0m0.000s wow, in comparison to nearly 5 minutes before this is quite good speedup. But: All the work could be done in memory as the processor load shows (output of top, which shows the following output during all the time) PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND8310 postgres 17 0 528M 528M 2712 R 99.9 13.5 0:11 postmaster Even it nearly performs 5 times faster than before with 1M memory, postgres is still 8 times slower than oracle. Further increasing of sort_mem to 4096000 doesn't reduce the time, as the cpu load cannot increased any more :-) But increasing the memory in that way is not realy a solution: Normaly not all the data fits into memory. In our application I guess 10%. Oracle has even less memory: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND8466 oracle 14 0 11052 10M 10440 R 99.9 0.2 0:01 oracle (this 10M session memory plus 32M shared memory pool not shown here). This shows to me, that oracle uses a quite different algorithm for this task. May be it uses some hashing-like algorithm first without sorting before. I don't know oracle enough, perhaps this is that "sort unique" step in the planners output. I think, first Postgres sorts all the data, which results to temporary data of the same size than before and which needs to be written to disk at least once, and after that postgres does the unique operation, right? If I can do any more tests to oracle or postgres, let me know. Kind regards, Michael Contzen