Home > mailing lists

Thread: Low perfomance SUM and Group by large databse

Low perfomance SUM and Group by large databse

From

"Sergio Charpinel Jr."

Date:

21 June 2010, 14:42:24

Hi,

I'm getting low performance on SUM and GROUP BY queries.

How can I improve my database to perform such queries.

Here is my table schema:

=> \d acct_2010_25

Tabela "public.acct_2010_25"

Coluna | Tipo | Modificadores

----------------+-----------------------------+------------------------------------------------------------------------

ip_src | inet | not null default '0.0.0.0'::inet

ip_dst | inet | not null default '0.0.0.0'::inet

as_src | bigint | not null default 0

as_dst | bigint | not null default 0

port_src | integer | not null default 0

port_dst | integer | not null default 0

tcp_flags | smallint | not null default 0

ip_proto | smallint | not null default 0

packets | integer | not null

flows | integer | not null default 0

bytes | bigint | not null

stamp_inserted | timestamp without time zone | not null default '0001-01-01 00:00:00 BC'::timestamp without time zone

stamp_updated | timestamp without time zone |

Índices:

"acct_2010_25_pk" PRIMARY KEY, btree (stamp_inserted, ip_src, ip_dst, port_src, port_dst, ip_proto)

"ibytes_acct_2010_25" btree (bytes)

Here is my one query example (could add pk to flow and packet fields):

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=3998662.81..3998662.94 rows=50 width=50) (actual time=276981.107..276981.133 rows=50 loops=1)

-> Sort (cost=3998662.81..4001046.07 rows=953305 width=50) (actual time=276981.105..276981.107 rows=50 loops=1)

Sort Key: sum(bytes)

-> GroupAggregate (cost=3499863.27..3754872.33 rows=953305 width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)

-> Sort (cost=3499863.27..3523695.89 rows=9533049 width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)

Sort Key: ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto

-> Seq Scan on acct_2010_25 (cost=0.00..352648.10 rows=9533049 width=50) (actual time=0.038..50860.391 rows=9494165 loops=1)

Filter: ((stamp_inserted >= '2010-06-20 10:10:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-21 10:10:00'::timestamp without time zone))

Total runtime: 278791.661 ms

(9 registros)

Another one just summing bytes (still low):

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto LIMIT 50 OFFSET 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=3395202.50..3395213.12 rows=50 width=42) (actual time=106261.359..106261.451 rows=50 loops=1)

-> GroupAggregate (cost=3395202.50..3602225.48 rows=974226 width=42) (actual time=106261.357..106261.435 rows=50 loops=1)

-> Sort (cost=3395202.50..3419558.14 rows=9742258 width=42) (actual time=106261.107..106261.169 rows=176 loops=1)

Sort Key: ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto

-> Seq Scan on acct_2010_25 (cost=0.00..367529.72 rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)

Filter: ((stamp_inserted >= '2010-06-20 10:10:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-21 10:10:00'::timestamp without time zone))

Total runtime: 109911.882 ms

(7 registros)

The server has 2 Intel(R) Xeon(R) CPU E5430 @ 2.66GHz and 16GB RAM.

I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just increased checkpoint_segments to 50).

What can I change to increase performance?

Thanks in advance.

Cheers.

--
Sergio Roberto Charpinel Jr.

Re: Low perfomance SUM and Group by large databse

From

Craig Ringer

Date:

21 June 2010, 22:07:09

On 21/06/10 22:42, Sergio Charpinel Jr. wrote:
> Hi,
>
> I'm getting low performance on SUM and GROUP BY queries.
> How can I improve my database to perform such queries.

>                ->  Sort  (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
>                      Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
>                      ->  Seq Scan on acct_2010_25  (cost=0.00..352648.10
> rows=9533049 width=50) (actual time=0.038..50860.391 rows=9494165 loops=1)
>                            Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))

Provide an index on at least (ip_src,port_src,ip_dst,port_dst). If you
frequently do other queries that only want some of that information you
could create several individual indexes for those columns instead, as Pg
will combine them for a query, but that is much less efficient than an
index across all four columns.

CREATE INDEX ip_peers_idx ON acct_2010_25(ip_src,port_src,ip_dst_port_dst);

Every index added costs you insert/update/delete speed, so try to find
the smallest/simplest index that gives you acceptable performance.

> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto
> LIMIT 50 OFFSET 0;

Same deal. You have no suitable index, so Pg has to do a sequential scan
of the table. Since you appear to query on stamp_inserted a lot, you
should index it.

--
Craig Ringer

Re: Low perfomance SUM and Group by large databse

From

Russell Smith

Date:

22 June 2010, 09:45:56

On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
> Hi,
>
[snip]
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
> "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
> "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
> port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;
>
>                QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3998662.81..3998662.94 rows=50 width=50) (actual
> time=276981.107..276981.133 rows=50 loops=1)
>    ->  Sort  (cost=3998662.81..4001046.07 rows=953305 width=50)
> (actual time=276981.105..276981.107 rows=50 loops=1)
>          Sort Key: sum(bytes)
>          ->  GroupAggregate  (cost=3499863.27..3754872.33 rows=953305
> width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
>                ->  Sort  (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
>                      Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto

You are having to sort and aggregate a large number of rows before you
can get the top 50.  That's 9 million rows in this case, width 50 =
400MB+ sort.  That's going to be slow as you are going to have to sort
it on disk unless you bump up sort mem to 500Mb (bad idea).  So unless
you have really fast storage for temporary tables it's going to take a
while.  About 2.5 minutes you are experiencing at the moment is probably
not too bad.

I'm sure improvements have been made in the area since 8.1 and if you
are able to upgrade to 8.4 which is also offered by Centos5 now, you
might get benefit there.  I can't remember the specific benefits, but I
believe sorting speed has improved, your explain analyze will also give
you more information about what's going on with disk/memory sorting.

>                      ->  Seq Scan on acct_2010_25
>  (cost=0.00..352648.10 rows=9533049 width=50) (actual
> time=0.038..50860.391 rows=9494165 loops=1)
>                            Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 278791.661 ms
> (9 registros)
>
> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto LIMIT 50 OFFSET 0;
>
>             QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3395202.50..3395213.12 rows=50 width=42) (actual
> time=106261.359..106261.451 rows=50 loops=1)
>    ->  GroupAggregate  (cost=3395202.50..3602225.48 rows=974226
> width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
>          ->  Sort  (cost=3395202.50..3419558.14 rows=9742258 width=42)
> (actual time=106261.107..106261.169 rows=176 loops=1)
>                Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
>                ->  Seq Scan on acct_2010_25  (cost=0.00..367529.72
> rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)
>                      Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 109911.882 ms
> (7 registros)
>
>
> The server has 2 Intel(R) Xeon(R) CPU  E5430 @ 2.66GHz and 16GB RAM.
> I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
> increased checkpoint_segments to 50).

Checkpoint segments won't help you as the number of segments is about
writing to the database and how fast that can happen.

>
> What can I change to increase performance?

Increasing sort-memory (work_mem) will give you speed benefits even
though you are going to disk.  I don't know how much spare memory you
have, but trying other values between 8MB and 128MB may be useful just
for the specific query runs.  If you can afford 512Mb for each of the
two sorts, go for that, but it's dangerous as mentioned due to the risk
of using more RAM than you have.  work_mem allocates that amount of
memory per sort.

If you are running these queries all the time, a summary table the
produces there reports on a regular basis, maybe daily or even hourly
would be useful.  Basically the large amount of information that needs
to be processed and sorted is what's taking all the time here.

Regards

Russell

Re: Low perfomance SUM and Group by large databse

From

"Sergio Charpinel Jr."

Date:

23 June 2010, 11:40:49

Craig, Russel,

I appreciate your help.

Thanks.

2010/6/22 Russell Smith <mr-russ@pws.com.au>

On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
> Hi,
>
[snip]
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
> "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
> "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
> port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;
>
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=3998662.81..3998662.94 rows=50 width=50) (actual
> time=276981.107..276981.133 rows=50 loops=1)
> -> Sort (cost=3998662.81..4001046.07 rows=953305 width=50)
> (actual time=276981.105..276981.107 rows=50 loops=1)
> Sort Key: sum(bytes)
> -> GroupAggregate (cost=3499863.27..3754872.33 rows=953305
> width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
> -> Sort (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
> Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto

You are having to sort and aggregate a large number of rows before you
can get the top 50. That's 9 million rows in this case, width 50 =
400MB+ sort. That's going to be slow as you are going to have to sort
it on disk unless you bump up sort mem to 500Mb (bad idea). So unless
you have really fast storage for temporary tables it's going to take a
while. About 2.5 minutes you are experiencing at the moment is probably
not too bad.

I'm sure improvements have been made in the area since 8.1 and if you
are able to upgrade to 8.4 which is also offered by Centos5 now, you
might get benefit there. I can't remember the specific benefits, but I
believe sorting speed has improved, your explain analyze will also give
you more information about what's going on with disk/memory sorting.

> -> Seq Scan on acct_2010_25
> (cost=0.00..352648.10 rows=9533049 width=50) (actual
> time=0.038..50860.391 rows=9494165 loops=1)
> Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
> Total runtime: 278791.661 ms
> (9 registros)
>
> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto LIMIT 50 OFFSET 0;
>
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=3395202.50..3395213.12 rows=50 width=42) (actual
> time=106261.359..106261.451 rows=50 loops=1)
> -> GroupAggregate (cost=3395202.50..3602225.48 rows=974226
> width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
> -> Sort (cost=3395202.50..3419558.14 rows=9742258 width=42)
> (actual time=106261.107..106261.169 rows=176 loops=1)
> Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
> -> Seq Scan on acct_2010_25 (cost=0.00..367529.72
> rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)
> Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
> Total runtime: 109911.882 ms
> (7 registros)
>
>
> The server has 2 Intel(R) Xeon(R) CPU E5430 @ 2.66GHz and 16GB RAM.
> I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
> increased checkpoint_segments to 50).

Checkpoint segments won't help you as the number of segments is about
writing to the database and how fast that can happen.

>
> What can I change to increase performance?

Increasing sort-memory (work_mem) will give you speed benefits even
though you are going to disk. I don't know how much spare memory you
have, but trying other values between 8MB and 128MB may be useful just
for the specific query runs. If you can afford 512Mb for each of the
two sorts, go for that, but it's dangerous as mentioned due to the risk
of using more RAM than you have. work_mem allocates that amount of
memory per sort.

If you are running these queries all the time, a summary table the
produces there reports on a regular basis, maybe daily or even hourly
would be useful. Basically the large amount of information that needs
to be processed and sorted is what's taking all the time here.

Regards

Russell

--
Sergio Roberto Charpinel Jr.

Re: Low perfomance SUM and Group by large databse

From

"Sergio Charpinel Jr."

Date:

29 June 2010, 11:59:52

Hi,

One more question about two specifics query behavior: If I add "AND (ip_dst = x.x.x.x)", it uses another plan and take a much more time. In both of them, I'm using WHERE clause. Why this behavior?

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags, ip_proto, bytes, packets, flows FROM "acct_2010_26" WHERE "stamp_inserted">='2010-06-28 09:07' AND "stamp_inserted"<'2010-06-29 08:07' AND (ip_dst = '8.8.8.8') ORDER BY bytes DESC LIMIT 50 OFFSET 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=496332.56..496332.69 rows=50 width=50) (actual time=125390.523..125390.540 rows=50 loops=1)

-> Sort (cost=496332.56..496351.35 rows=7517 width=50) (actual time=125390.520..125390.525 rows=50 loops=1)

Sort Key: bytes

-> Index Scan using acct_2010_26_pk on acct_2010_26 (cost=0.00..495848.62 rows=7517 width=50) (actual time=0.589..125385.680 rows=1011 loops=1)

Index Cond: ((stamp_inserted >= '2010-06-28 09:07:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-29 08:07:00'::timestamp without time zone) AND (ip_dst = '8.8.8.8'::inet))

Total runtime: 125390.711 ms

(6 registros)

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.00..268.25 rows=50 width=50) (actual time=0.150..70.780 rows=50 loops=1)

-> Index Scan Backward using ibytes_acct_2010_26 on acct_2010_26 (cost=0.00..133240575.70 rows=24835384 width=50) (actual time=0.149..70.762 rows=50 loops=1)

Filter: ((stamp_inserted >= '2010-06-28 09:07:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-29 08:07:00'::timestamp without time zone))

Total runtime: 70.830 ms

(4 registros)

Thanks in advance.

2010/6/23 Sergio Charpinel Jr. <sergiocharpinel@gmail.com>

Craig, Russel,

I appreciate your help.

Thanks.

2010/6/22 Russell Smith <mr-russ@pws.com.au>

On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
> Hi,
>
[snip]
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
> "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
> "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
> port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;
>
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=3998662.81..3998662.94 rows=50 width=50) (actual
> time=276981.107..276981.133 rows=50 loops=1)
> -> Sort (cost=3998662.81..4001046.07 rows=953305 width=50)
> (actual time=276981.105..276981.107 rows=50 loops=1)
> Sort Key: sum(bytes)
> -> GroupAggregate (cost=3499863.27..3754872.33 rows=953305
> width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
> -> Sort (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
> Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto

You are having to sort and aggregate a large number of rows before you
can get the top 50. That's 9 million rows in this case, width 50 =
400MB+ sort. That's going to be slow as you are going to have to sort
it on disk unless you bump up sort mem to 500Mb (bad idea). So unless
you have really fast storage for temporary tables it's going to take a
while. About 2.5 minutes you are experiencing at the moment is probably
not too bad.

I'm sure improvements have been made in the area since 8.1 and if you
are able to upgrade to 8.4 which is also offered by Centos5 now, you
might get benefit there. I can't remember the specific benefits, but I
believe sorting speed has improved, your explain analyze will also give
you more information about what's going on with disk/memory sorting.

> -> Seq Scan on acct_2010_25
> (cost=0.00..352648.10 rows=9533049 width=50) (actual
> time=0.038..50860.391 rows=9494165 loops=1)
> Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
> Total runtime: 278791.661 ms
> (9 registros)
>
> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto LIMIT 50 OFFSET 0;
>
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=3395202.50..3395213.12 rows=50 width=42) (actual
> time=106261.359..106261.451 rows=50 loops=1)
> -> GroupAggregate (cost=3395202.50..3602225.48 rows=974226
> width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
> -> Sort (cost=3395202.50..3419558.14 rows=9742258 width=42)
> (actual time=106261.107..106261.169 rows=176 loops=1)
> Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
> -> Seq Scan on acct_2010_25 (cost=0.00..367529.72
> rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)
> Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
> Total runtime: 109911.882 ms
> (7 registros)
>
>
> The server has 2 Intel(R) Xeon(R) CPU E5430 @ 2.66GHz and 16GB RAM.
> I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
> increased checkpoint_segments to 50).

Checkpoint segments won't help you as the number of segments is about
writing to the database and how fast that can happen.

>
> What can I change to increase performance?

Increasing sort-memory (work_mem) will give you speed benefits even
though you are going to disk. I don't know how much spare memory you
have, but trying other values between 8MB and 128MB may be useful just
for the specific query runs. If you can afford 512Mb for each of the
two sorts, go for that, but it's dangerous as mentioned due to the risk
of using more RAM than you have. work_mem allocates that amount of
memory per sort.

If you are running these queries all the time, a summary table the
produces there reports on a regular basis, maybe daily or even hourly
would be useful. Basically the large amount of information that needs
to be processed and sorted is what's taking all the time here.

Regards

Russell

--
Sergio Roberto Charpinel Jr.

--
Sergio Roberto Charpinel Jr.

Re: Low perfomance SUM and Group by large databse

From

Robert Haas

Date:

02 July 2010, 17:21:55

On Tue, Jun 29, 2010 at 7:59 AM, Sergio Charpinel Jr.
<sergiocharpinel@gmail.com> wrote:
> One more question about two specifics query behavior: If I add "AND (ip_dst
> = x.x.x.x)", it uses another plan and take a much more time. In both of
> them, I'm using WHERE clause. Why this behavior?

With either query, the planner is choosing to scan backward through
the acct_2010_26_pk index to get the rows in descending order by the
"bytes" column.  It keeps scanning until it finds 50 rows that match
the WHERE clause.  With just the critieria on stamp_inserted, matches
are pretty common, so it doesn't have to scan very far before finding
50 suitable rows.  But when you add the ip_dst = 'x.x.x.x' criterion,
suddenly a much smaller percentage of the rows match and so it has to
read much further into the index before it finds 50 that do.

A second index on just the ip_dst column might help a lot - then it
could consider index-scanning for the matching rows and sorting them
afterwards.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company