Thread: Really SLOW using GROUP BY ...!?

Really SLOW using GROUP BY ...!?

From

Hervé Piedvache

Date:

08 November 2000, 15:56:39

Hi,

I would like to know if someone have a solution for me ...

I have a table with today about 2,8 millions records.
The table have good indexes, in our case on idcond, and points !

When I do :
select sum(points) from gains;

With Oracle : 8 sec
With PostGreSQL : 10 sec

OK for this it's ok ...

But when I do :
select sum(points) from gains group by idcond;

With Oracle : 22 sec
With PostGreSQL : about 3 minutes !!!

I have done a vacuum analyse of the table gains ... just before testing
...
What can I do to optimize that result ?

Why the group by function is so slow ??

Thanks for your help !

Regards,
--
Hervé

RE: Really SLOW using GROUP BY ...!?

From

"Mikheev, Vadim"

Date:

08 November 2000, 16:11:56

> But when I do :
> select sum(points) from gains group by idcond;
>
> With Oracle : 22 sec
> With PostGreSQL : about 3 minutes !!!

Try

select sum(points) from gains where idcond >= _minimum_id_cond_value_
group by idcond;

to see if forced index usage will help. Unfortunately, PG will anyway
try to sort result before grouping, but probably this trick will help
somehow. Also, use -S 2048 (or more) backend arg to increase sort
memory size.

Vadim

Re: Really SLOW using GROUP BY ...!?

From

Tom Lane

Date:

08 November 2000, 16:27:29

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> Try
> select sum(points) from gains where idcond >= _minimum_id_cond_value_
> group by idcond;

> to see if forced index usage will help. Unfortunately, PG will anyway
> try to sort result before grouping,

I beg your pardon?

regression=# set enable_seqscan TO on;
SET VARIABLE
regression=# explain select sum(unique1) from tenk1 group by hundred;
NOTICE:  QUERY PLAN:

Aggregate  (cost=997.39..1047.39 rows=1000 width=8)
  ->  Group  (cost=997.39..1022.39 rows=10000 width=8)
        ->  Sort  (cost=997.39..997.39 rows=10000 width=8)
              ->  Seq Scan on tenk1  (cost=0.00..333.00 rows=10000 width=8)

EXPLAIN
regression=# set enable_seqscan TO off;
SET VARIABLE
regression=# explain select sum(unique1) from tenk1 group by hundred;
NOTICE:  QUERY PLAN:

Aggregate  (cost=0.00..1119.88 rows=1000 width=8)
  ->  Group  (cost=0.00..1094.88 rows=10000 width=8)
        ->  Index Scan using tenk1_hundred on tenk1  (cost=0.00..1069.88 rows=10000 width=8)

EXPLAIN

Unfortunately neither of these plans is likely to be especially speedy
on ~3 million rows.  The index scan will just thrash the disk, unless
the table has been clustered recently --- and given the deficiencies of
our CLUSTER implementation, I'd hesitate to recommend using it.

I have a personal TODO item to see about implementing group + aggregate
with a hash table of active aggregate values, per a suggestion recently
from devik@cdi.cz.  That would allow this query to be done with a
sequential scan and no sort, which is probably what Oracle is doing.
Won't happen for 7.1 though ...

            regards, tom lane

RE: Really SLOW using GROUP BY ...!?

From

"Mikheev, Vadim"

Date:

08 November 2000, 16:43:56

> I beg your pardon?

Get it - I didn't know about enable_seqscan -:)

> Unfortunately neither of these plans is likely to be especially speedy
> on ~3 million rows.  The index scan will just thrash the disk, unless
> the table has been clustered recently --- and given the
> deficiencies of our CLUSTER implementation, I'd hesitate to recommend
> using it.

Easy to try - why don't do.

> I have a personal TODO item to see about implementing group +
> aggregate with a hash table of active aggregate values, per a
> suggestion recently from devik@cdi.cz.  That would allow this
> query to be done with a sequential scan and no sort, which is
> probably what Oracle is doing. Won't happen for 7.1 though ...

Well, definitely good approach. But for the moment increasing
sort memory is only hope and it should help.

Vadim

Re: Really SLOW using GROUP BY ...!?

From

Hervé Piedvache

Date:

08 November 2000, 16:45:04

HI,

"Mikheev, Vadim" a écrit :
>
> > But when I do :
> > select sum(points) from gains group by idcond;
> >
> > With Oracle : 22 sec
> > With PostGreSQL : about 3 minutes !!!
>
> Try
>
> select sum(points) from gains where idcond >= _minimum_id_cond_value_
> group by idcond;
>
> to see if forced index usage will help. Unfortunately, PG will anyway
> try to sort result before grouping, but probably this trick will help
> somehow. Also, use -S 2048 (or more) backend arg to increase sort
> memory size.

I'm using -S 512000 ;))

The result I have is not good ;)

pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

:(

So I vacuum analyze my table ...

Then I get : exactly the same result : about 3 min !

and my table have 2729276 records

The EXPLAIN of you request give :
NOTICE:  QUERY PLAN:

Aggregate  (cost=488140.30..501773.03 rows=272655 width=8)
  ->  Group  (cost=488140.30..494956.67 rows=2726547 width=8)
        ->  Sort  (cost=488140.30..488140.30 rows=2726547 width=8)
              ->  Seq Scan on gains  (cost=0.00..62890.95 rows=2726547
width=8)


And with my classical request :

NOTICE:  QUERY PLAN:

Aggregate  (cost=481763.55..495409.93 rows=272928 width=8)
  ->  Group  (cost=481763.55..488586.74 rows=2729276 width=8)
        ->  Sort  (cost=481763.55..481763.55 rows=2729276 width=8)
              ->  Seq Scan on gains  (cost=0.00..56067.76 rows=2729276
width=8)


Seems to have no effect !?

--
Hervé Piedvache

Elma Ingenierie Informatique
6, rue du Faubourg Saint-Honoré
F-75008 - Paris - France
http://www.elma.fr
Tel: +33-1-44949901
Fax: +33-1-44949902
Email: herve@elma.fr

RE: Really SLOW using GROUP BY ...!?

From

"Mikheev, Vadim"

Date:

08 November 2000, 17:21:12

> > somehow. Also, use -S 2048 (or more) backend arg to increase sort
> > memory size.
>
> I'm using -S 512000 ;))

512Mb?! Isn't it tooo much?

> The result I have is not good ;)
>
> pqReadData() -- backend closed the channel unexpectedly.
> This probably means the backend terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> :(
>
> So I vacuum analyze my table ...
>
> Then I get : exactly the same result : about 3 min !

Seems we have to wait new aggr + group implementation, sorry -:(

Vadim

Re: Really SLOW using GROUP BY ...!?

From

Hervé Piedvache

Date:

09 November 2000, 04:51:28

Hi Tom,

Tom Lane a écrit :
>
> Unfortunately neither of these plans is likely to be especially speedy
> on ~3 million rows.  The index scan will just thrash the disk, unless
> the table has been clustered recently --- and given the deficiencies of
> our CLUSTER implementation, I'd hesitate to recommend using it.

Sorry but I don't understand ... you tell me to not use the CLUSTER
implementation ?
What is the risk of using it ?

What can I do to solve my group by slower trouble ? Just waiting you
implement the option you talk after... ?

Group by is a classical SQL command, what can I do to circumvent this
problem ? Other SQL method ?

Thanks for your reply,


> I have a personal TODO item to see about implementing group + aggregate
> with a hash table of active aggregate values, per a suggestion recently
> from devik@cdi.cz.  That would allow this query to be done with a
> sequential scan and no sort, which is probably what Oracle is doing.
> Won't happen for 7.1 though ...
>
>                         regards, tom lane

Regards,
--
Hervé