Home > mailing lists

Re: Group by more efficient than distinct? - Mailing list pgsql-performance

From	PFC
Subject	Re: Group by more efficient than distinct?
Date	April 21, 2008 20:29:57
Msg-id	op.t9zif2yocigqcu@apollo13.peufeu.com Whole thread Raw
In response to	Re: Group by more efficient than distinct? (Francisco Reyes <lists@stringsutils.com>)
Responses	Re: Group by more efficient than distinct?
List	pgsql-performance

Tree view

On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes
<lists@stringsutils.com> wrote:

> PFC writes:
>
>> - If you process up to some percentage of your RAM worth of data,
>> hashing  is going to be a lot faster
>
> Thanks for the excellent breakdown and explanation. I will try and get
> sizes of the tables in question and how much memory the machines have.

    Actually, the memory used by the hash depends on the number of distinct
values, not the number of rows which are processed...
    Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

    In both cases the hash only holds discinct values. So if you have 1
million rows to process but only 10 distinct values of "a", the hash will
only contain those 10 values (and the counts), so it will be very small
and fast, it will absorb a huge seq scan without problem. If however, you
have (say) 100 million distinct values for a, using a hash would be a bad
idea. As usual, divide the size of your RAM by the number of concurrent
connections or something.
    Note that "a" could be a column, several columns, anything, the size of
the hash will be proportional to the number of distinct values, ie. the
number of rows returned by the query, not the number of rows processed
(read) by the query. Same with hash joins etc, that's why when you join a
very small table to a large one Postgres likes to use seq scan + hash join
on the small table.


>>     - If you need DISTINCT ON, well, you're stuck with the Sort
>>     - So, for the time being, you can replace DISTINCT with GROUP BY...
>
> Have seen a few of those already on some code (new job..) so for those
> it is a matter of having a good disk subsystem?

    Depends on your RAM, sorting in RAM is always faster than sorting on disk
of course, unless you eat all the RAM and trash the other processes.
Tradeoffs...

pgsql-performance by date:

From: Alvaro Herrera
Date: 21 April 2008, 13:54:33
Subject: Re: Re: [HACKERS] [COMMITTERS] pgsql: Fix TransactionIdIsCurrentTransactionId() to use binary search

From: Mark Mielke
Date: 21 April 2008, 20:50:29
Subject: Re: Group by more efficient than distinct?

Re: Group by more efficient than distinct? - Mailing list pgsql-performance

Previous

Next