Re: Low perfomance SUM and Group by large databse - Mailing list pgsql-performance

From Sergio Charpinel Jr.
Subject Re: Low perfomance SUM and Group by large databse
Date
Msg-id AANLkTin5Qo_mFZuhOqVVZFobSHYp2-lzY4Cg8YKfUxYc@mail.gmail.com
Whole thread Raw
In response to Re: Low perfomance SUM and Group by large databse  ("Sergio Charpinel Jr." <sergiocharpinel@gmail.com>)
Responses Re: Low perfomance SUM and Group by large databse
List pgsql-performance
Hi,

One more question about two specifics query behavior: If I add "AND (ip_dst = x.x.x.x)", it uses another plan and take a much more time. In both of them, I'm using WHERE clause. Why this behavior?

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto, bytes, packets, flows FROM "acct_2010_26" WHERE "stamp_inserted">='2010-06-28 09:07' AND "stamp_inserted"<'2010-06-29 08:07' AND (ip_dst = '8.8.8.8') ORDER BY bytes DESC LIMIT 50 OFFSET 0;
                                                                                                     QUERY PLAN                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=496332.56..496332.69 rows=50 width=50) (actual time=125390.523..125390.540 rows=50 loops=1)
   ->  Sort  (cost=496332.56..496351.35 rows=7517 width=50) (actual time=125390.520..125390.525 rows=50 loops=1)
         Sort Key: bytes
         ->  Index Scan using acct_2010_26_pk on acct_2010_26  (cost=0.00..495848.62 rows=7517 width=50) (actual time=0.589..125385.680 rows=1011 loops=1)
               Index Cond: ((stamp_inserted >= '2010-06-28 09:07:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-29 08:07:00'::timestamp without time zone) AND (ip_dst = '8.8.8.8'::inet))
 Total runtime: 125390.711 ms
(6 registros)


=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto, bytes, packets, flows FROM "acct_2010_26" WHERE "stamp_inserted">='2010-06-28 09:07' AND "stamp_inserted"<'2010-06-29 08:07' ORDER BY bytes DESC LIMIT 50 OFFSET 0;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..268.25 rows=50 width=50) (actual time=0.150..70.780 rows=50 loops=1)
   ->  Index Scan Backward using ibytes_acct_2010_26 on acct_2010_26  (cost=0.00..133240575.70 rows=24835384 width=50) (actual time=0.149..70.762 rows=50 loops=1)
         Filter: ((stamp_inserted >= '2010-06-28 09:07:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-29 08:07:00'::timestamp without time zone))
 Total runtime: 70.830 ms
(4 registros)


Thanks in advance.

2010/6/23 Sergio Charpinel Jr. <sergiocharpinel@gmail.com>
Craig, Russel,

I appreciate your help.

Thanks.

2010/6/22 Russell Smith <mr-russ@pws.com.au>

On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
> Hi,
>
[snip]
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
> "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
> "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
> port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;
>
>                QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3998662.81..3998662.94 rows=50 width=50) (actual
> time=276981.107..276981.133 rows=50 loops=1)
>    ->  Sort  (cost=3998662.81..4001046.07 rows=953305 width=50)
> (actual time=276981.105..276981.107 rows=50 loops=1)
>          Sort Key: sum(bytes)
>          ->  GroupAggregate  (cost=3499863.27..3754872.33 rows=953305
> width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
>                ->  Sort  (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
>                      Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto

You are having to sort and aggregate a large number of rows before you
can get the top 50.  That's 9 million rows in this case, width 50 =
400MB+ sort.  That's going to be slow as you are going to have to sort
it on disk unless you bump up sort mem to 500Mb (bad idea).  So unless
you have really fast storage for temporary tables it's going to take a
while.  About 2.5 minutes you are experiencing at the moment is probably
not too bad.

I'm sure improvements have been made in the area since 8.1 and if you
are able to upgrade to 8.4 which is also offered by Centos5 now, you
might get benefit there.  I can't remember the specific benefits, but I
believe sorting speed has improved, your explain analyze will also give
you more information about what's going on with disk/memory sorting.

>                      ->  Seq Scan on acct_2010_25
>  (cost=0.00..352648.10 rows=9533049 width=50) (actual
> time=0.038..50860.391 rows=9494165 loops=1)
>                            Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 278791.661 ms
> (9 registros)
>
> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto LIMIT 50 OFFSET 0;
>
>             QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3395202.50..3395213.12 rows=50 width=42) (actual
> time=106261.359..106261.451 rows=50 loops=1)
>    ->  GroupAggregate  (cost=3395202.50..3602225.48 rows=974226
> width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
>          ->  Sort  (cost=3395202.50..3419558.14 rows=9742258 width=42)
> (actual time=106261.107..106261.169 rows=176 loops=1)
>                Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
>                ->  Seq Scan on acct_2010_25  (cost=0.00..367529.72
> rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)
>                      Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 109911.882 ms
> (7 registros)
>
>
> The server has 2 Intel(R) Xeon(R) CPU  E5430 @ 2.66GHz and 16GB RAM.
> I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
> increased checkpoint_segments to 50).

Checkpoint segments won't help you as the number of segments is about
writing to the database and how fast that can happen.

>
> What can I change to increase performance?

Increasing sort-memory (work_mem) will give you speed benefits even
though you are going to disk.  I don't know how much spare memory you
have, but trying other values between 8MB and 128MB may be useful just
for the specific query runs.  If you can afford 512Mb for each of the
two sorts, go for that, but it's dangerous as mentioned due to the risk
of using more RAM than you have.  work_mem allocates that amount of
memory per sort.

If you are running these queries all the time, a summary table the
produces there reports on a regular basis, maybe daily or even hourly
would be useful.  Basically the large amount of information that needs
to be processed and sorted is what's taking all the time here.

Regards

Russell



--
Sergio Roberto Charpinel Jr.



--
Sergio Roberto Charpinel Jr.

pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: PostgreSQL as a local in-memory cache
Next
From: Merlin Moncure
Date:
Subject: Re: pgbench results on a new server