Thread: Performance Issues

Performance Issues

From

"Shaun Grannis"

Date:

25 April 2002, 08:59:16

Hi,

Ive found some performance issues with Postgres that Im hoping people on this list can help resolve. Were working with a 65 million record table that includes year of birth (data type INT). To count the frequency of dates in the table, it takes 2 hours 26 minutes to execute. (Theres an approximately 100-year range of dates in the 65 million records).

# EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

NOTICE: QUERY PLAN:

Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)

-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)

-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)

-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)

I can count data from the flat text data file with this Perl script:

#!/usr/bin/perl

# script to count YB frequencies in flat data file

open (IN, "$ARGV[0]");

open (OUT, ">$ARGV[0]\_cnt");

while (<IN>) {

chomp;

$years{$_}++;

}

foreach $key (keys %years) {

print OUT "$key,$years{$key}\n";

}

The Perl script takes *1 minute*, 31 seconds to run. Why is there such a discrepancy in times? Ive noticed that the Postgres count() function takes what seems to be longer than it should in other cases as well. For instance, counting the frequency of last names in the same 65 million record table took *1 hour* and 31 minutes:

# EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;

NOTICE: QUERY PLAN:

Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)

-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)

-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)

-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)

The last name (ln) and the year of birth (yb) is indexed, but that shouldnt matter because its doing a sequential scan, correct? Am I running into the limitations of Postgres? Wed like to eventually get this system into production, but if we cant get Postgres to count() faster, we may not be able to use it.

Heres the data_table schema:

# \d data_table

Table "data_table"

Column | Type | Modifiers

--------+---------------+-----------

ss | character(9) |

ln | character(15) |

fn | character(15) |

mi | character(1) |

ns | character(15) |

lny | character(15) |

fny | character(15) |

sny | character(15) |

g | character(1) |

mb | integer |

db | integer |

yb | integer |

md | integer |

dd | integer |

yd | integer |

Indexes: ssdi_ss_idx

ssdi_ln_idx

Were working with Postgres v 7.2. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives configured in a software RAID 0 Array running under RedHat Linux v. 7.2.

Weve VACUUM ANALYZEd the tables after creating the indices. Is there something Im missing here?

Thanks for you suggestions.

Re: Performance Issues

From

Bruno Wolff III

Date:

25 April 2002, 12:20:28

On Tue, Apr 23, 2002 at 02:16:46PM -0500,
  Shaun Grannis <shaun_grannis@hotmail.com> wrote:
> Hi,
>
>
>
> I've found some performance issues with Postgres that I'm hoping people on this list can help resolve.  We're working
witha 65 million record table that includes year of birth (data type INT). To count the frequency of dates in the
table,it takes 2 hours 26 minutes to execute. (There's an approximately 100-year range of dates in the 65 million
records).

I believe that issue is the postgres doesn't know how many different values
to expect, so that rather than set up buckets and add one to the appropiate
bucket when you encouter a value (which is what your perl script does),
it sorts all of the like values together and counts them up one group at
a time. The sort is probably what is killing your performance.

There has been some discussion on this topic on one of the postgres
lists in the last couple of days.

Re: Performance Issues

From

Stephan Szabo

Date:

25 April 2002, 12:42:45

> # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
>
> NOTICE:  QUERY PLAN:
>
> Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
>
>   ->  Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
>
>        ->  Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
>
>             ->  Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
>
>
>
> The last name (ln) and the year of birth (yb) is indexed, but that
> shouldn't matter because it's doing a sequential scan, correct?  Am I
> running into the limitations of Postgres?  We'd like to eventually get

I didn't see anything about your settings in postgresql.conf, but
increasing the sort_mem parameter may help that really expensive sort
step.  I think the desired fix for this would probably be the TODO entry
on hash based aggregates but that's still in the future...

Re: Performance Issues

From

Jean-Michel POURE

Date:

28 April 2002, 16:09:05

Le Mardi 23 Avril 2002 21:16, Shaun Grannis a écrit :
> Thanks for you suggestions.

Dear Shaun,

The only solution I see is to create a pseudo-count function.
For this, you need to :

- add fire_trigger (timestamp) field on data_table with index.

- create a datefreq_table table which will be used
  as a pseudo-counter with two fields :
  datefreq_date (int4) and datefreq_count (int8).
  Both fields should be indexed.

- create a trigger on after update or insert of data_table
  to increase datefreq_count by 1.

- create a trigger on after delete of data_table
  to decrease date_freq_count by 1.

- create the needed records in datefreq_table (100 as you said).

To inizialize the system, enter:
UPDATE * FROM data_table
SET fire_trigger ='now'
WHERE fire_trigger IS NULL;
LIMIT 1000

This will inizilize the system for 10000 records. I don't know how much time
it will take. Probably 10 seconds on your hardware. Mesure the time needed to
inizialize the whole system.

Once inizialized, the system will give immediate answers using the
pseudo-counter table.

SELECT datefreq_count
FROM datefreq_table
WHERE datefreq_date = foo;

The only drawback of such a system is that you will not be able to DROP or
INSERT or even UPDATE data as fast as if there was no trigger. it will only
be a problem if your data changes very often.

Even on Oracle or DB2, you would have to use data-mining features to be able
to manage such amounts of data. Here, on PostgreSQL, you can do it manually
using PLpgSQL.

Cheers,
Jean-Michel