Thread: Performance Issues with count()

Performance Issues with count()

From

"asdf asdasfa"

Date:

23 April 2002, 18:06:54

Hi,
 
Ive found some performance issues with Postgres that Im hoping people on this list can help resolve.  Were
workingwith a 65 million record table that includes year of birth (data type INT). To count the frequency of dates in
thetable, it takes 2 hours 26 minutes to execute. (Theres an approximately 100-year range of dates in the 65 million
records).
 
# EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
NOTICE:  QUERY PLAN:
Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
  -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
      -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
          -> Seq Scan on data_table (cost=0.00..2368620.84
rows=65222884width=4) 
 
I can count data from the flat text data file with this Perl script:
 
   #!/usr/bin/perl
   # script to count YB frequencies in flat data file
   open (IN, "$ARGV[0]");
   open (OUT, ">$ARGV[0]\_cnt");
   while (<IN>) {
       chomp;
       $years{$_}++;
   }
   foreach $key (keys %years) {
       print OUT "$key,$years{$key}\n";
   }
 
The Perl script takes *1 minute*, 31 seconds to run.  Why is there such a discrepancy in times?  Ive noticed
thatthe Postgres count() function takes what seems to be longer than it should in other cases as well.  For
instance,counting the frequency of last names in the same 65 million record table took *1 hour* and 31 minutes: 
 
# EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
NOTICE:  QUERY PLAN:
Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
  ->  Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
       ->  Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
            ->  Seq Scan on data_table
(cost=0.00..2324610.84rows=65222884 width=19) 
 
The last name (ln) and the year of birth (yb) is indexed, but that shouldnt matter because its doing a sequential
scan,correct?  Am I running into the limitations of Postgres?  Wed like to eventually get this system into
production,but if we cant get Postgres to count() faster, we may not be able to use it. 
 
Heres the data_table schema:
 
# \d data_table
         Table "data_table"
 Column |     Type      | Modifiers
--------+---------------+-----------
 ss     | character(9)  |
 ln     | character(15) |
 fn     | character(15) |
 mi     | character(1)  |
 ns     | character(15) |
 lny    | character(15) |
 fny    | character(15) |
 sny    | character(15) |
 g      | character(1)  |
 mb     | integer       |
 db     | integer       |
 yb     | integer       |
 md     | integer       |
 dd     | integer       |
 yd     | integer       |
Indexes: ssdi_ss_idx
         ssdi_ln_idx
 
 
Were working with Postgres v 7.2. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with 3GB of PC2100
DDRRAM, and 3-80GB IBM 120GXP hard drives configured in a software RAID 0 Array running under RedHat Linux v. 7.2. 
 
Weve VACUUM ANALYZEd the tables after creating the indices.  Is there something Im missing here?
 
Thanks for you suggestions.
 
Shaun Grannis
--


_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com

Re: Performance Issues with count()

From

Jan Wieck

Date:

23 April 2002, 19:26:14

asdf asdasfa wrote:
[text/html is unsupported, treating like TEXT/PLAIN]

> <DIV><FONT face=Arial size=2>
> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><FONT face="Times New Roman" size=3>Hi,</FONT></P>

    First  a  little  hint: mime mail with plain text and HTML is
    usually  already  unwanted  on  mailing  lists,  because   it
    increases  the  bandwidth  for  absolutely no benefit at all.
    Plus, alot of people don't even read HTML mail, so  HTML-only
    mail  like  yours  isn't  the  smartest  way to get access to
    developers knowledge here.

    Anyway, the problem is that your Perl script is optimized for
    your  purpose, while PostgreSQL has to use generic algorithms
    that work in ANY situation to solve the problem.

    In particular, the Perl script uses a  hash  table  with  one
    entry  for  each group. Now what happens if the input data is
    spread out and contains 1G groups?  It'll  simply  blow  away
    your  script because it runs out of memory. This behaviour is
    unacceptable for a database system, so  as  you  see  in  the
    Explain output, PostgreSQL sorts and groups the input data in
    temporary files before counting the rows.  Due to  that,  the
    PostgreSQL solution to the problem requires a gazillion of IO
    operations, but it'll work whatever the input data is,  given
    that there is enough disk space.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Performance Issues with count()

From

Martijn van Oosterhout

Date:

23 April 2002, 19:54:58

On Tue, Apr 23, 2002 at 07:25:34PM -0400, Jan Wieck wrote:
>     In particular, the Perl script uses a  hash  table  with  one
>     entry  for  each group. Now what happens if the input data is
>     spread out and contains 1G groups?  It'll  simply  blow  away
>     your  script because it runs out of memory. This behaviour is
>     unacceptable for a database system, so  as  you  see  in  the
>     Explain output, PostgreSQL sorts and groups the input data in
>     temporary files before counting the rows.  Due to  that,  the
>     PostgreSQL solution to the problem requires a gazillion of IO
>     operations, but it'll work whatever the input data is,  given
>     that there is enough disk space.

I've always had a little problem with this setup. Sure, if you're using 1
billion groups then you have a problem, but if you are going into only a
small number of groups, the sorting takes forever compared to using a hash.
Currently the planner beleives that the output of a group will be 1/10th of
the input, but I think the square root would be a better estimate.

Here we regularly summerise 2 million rows into 6 groups and it's almost
faster to do 6 sequential scans than it is to do the sort/group.

Besides, the running out of memory argument is silly, because if sorting
moves out to disk when it gets too big, you can do the same with a hash
table.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Canada, Mexico, and Australia form the Axis of Nations That
> Are Actually Quite Nice But Secretly Have Nasty Thoughts About America

Re: Performance Issues with count()

From

"S Grannis"

Date:

25 April 2002, 14:55:27

Thanks for all of the useful comments.

Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to
2hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. 

Others have suggested the "fix" is in the future.

Stephan Szabo wrote:
"I didn't see anything about your settings in postgresql.conf,
but increasing the sort_mem parameter may help that really
expensive sort step. I think the desired fix for this would
probably be the TODO entry on hash based aggregates but that's
still in the future..."

The "non-default" postgresql.conf settings are as follows:

shared_buffers = 240000 # uses ~2GB of shared mem
sort_mem = 512
wal_files = 64
enable_seqscan = 0      # per a recommendation
enable_indexscan = true
enable_tidscan = true
enable_sort = true
enable_nestloop = true
enable_mergejoin = true
enable_hashjoin = true

I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl
scriptto do the actual counting. 

Thanks again for the feedback,

Shaun Grannis

----- Original Message -----
From: Michael Loftis <mloftis@wgops.com>
Date: Wed, 24 Apr 2002 10:14:04 -0700
To: asdf asdasfa <sjg@email.com>
Subject: Re: [GENERAL] Performance Issues with count()


> Can you humour me and
> set enable_seqscan=0
> And retry the query?
>
> Thanks :)
>
> S Grannis wrote:
>
> > Hi,
> >
> > I've found some performance issues with Postgres that
> > I'm hoping people on this list can help resolve. We're
> > working with a 65 million record table that includes year
> > of birth (data type INT). To count the frequency of dates
> > in the table, it takes 2 hours 26 minutes to execute.
> > (There's an approximately 100-year range of dates in the
> > 65 million records).
> >
> > # EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
> >
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
> > -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
> > -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
> > -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
> > I can count data from the flat text data file with this
> > Perl script:
> >
> > #!/usr/bin/perl
> > # script to count YB frequencies in flat data file
> > open (IN, "$ARGV[0]");
> > open (OUT, ">$ARGV[0]\_cnt");
> > while (<IN>) {
> > chomp;
> > $years{$_}++;}
> > foreach $key (keys %years) {
> > print OUT "$key,$years{$key}\n";}
> >
> > The Perl script takes *1 minute*, 31 seconds to run.
> > Why is there such a discrepancy in times? I've noticed
> > that the Postgres count() function takes what seems to
> > be "longer than it should" in other cases as well. For
> > instance, counting the frequency of last names in the
> > same 65 million record table took *1 hour* and 31
> > minutes:
> >
> > # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
> > -> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
> > -> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
> > -> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
> >
> > The last name ( ln) and the year of birth ( yb) is
> > indexed, but that shouldn't matter because it's doing a
> > sequential scan, correct? Am I running into the
> > limitations of Postgres? We'd like to eventually get this
> > system into production, but if we can't get Postgres to
> > count() faster, we may not be able to use it.
> >
> > Here's the data_table schema:
> >
> > # \d data_table
> >
> > Table "data_table"
> >  Column | Type          | Modifiers
> > --------+---------------+-----------
> >      ss | character(9)  |
> >      ln | character(15) |
> >      fn | character(15) |
> >      mi | character(1)  |
> >      ns | character(15) |
> >     lny | character(15) |
> >     fny | character(15) |
> >     sny | character(15) |
> >       g | character(1)  |
> >      mb | integer       |
> >      db | integer       |
> >      yb | integer       |
> >      md | integer       |
> >      dd | integer       |
> >      yd | integer       |
> > Indexes: ssdi_ss_idx
> >          ssdi_ln_idx
> >
> > We're working with Postgres v 7.2. The machine is a
> > dual-processor Athlon MP1900 (Tyan Tiger board) with
> > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
> > configured in a software RAID 0 Array running under
> > RedHat Linux v. 7.2.
> >
> > We've VACUUM ANALYZE'd the tables after creating the
> > indices. Is there something I'm missing here?
> >
> > Thanks for your suggestions.
> >
> > Shaun Grannis
--

_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com
http://www.email.com/?sr=signup

Re: Performance Issues with count()

From

Neil Conway

Date:

25 April 2002, 15:04:02

On Thu, 25 Apr 2002 13:54:40 -0500
"S Grannis" <sjg@email.com> wrote:
> enable_seqscan = 0      # per a recommendation

This is a bad idea -- I'd recommend you leave this set to 'true',
particularly since it doesn't help your particular case.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: Performance Issues with count()

From

Grant Johnson

Date:

25 April 2002, 15:19:49

Index the yb col.  It might help the sort and group run better.


create index data_table_yb on data_table (yb);
vacuum analyze;

EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

SELECT yb, count(1) FROM data_table GROUP BY yb;



Also try a count(*) instead, it will count any row with any non-null values.


S Grannis wrote:

>Thanks for all of the useful comments.
>
>Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes
to2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. 
>
>Others have suggested the "fix" is in the future.
>
>Stephan Szabo wrote:
>"I didn't see anything about your settings in postgresql.conf,
>but increasing the sort_mem parameter may help that really
>expensive sort step. I think the desired fix for this would
>probably be the TODO entry on hash based aggregates but that's
>still in the future..."
>
>The "non-default" postgresql.conf settings are as follows:
>
>shared_buffers = 240000 # uses ~2GB of shared mem
>sort_mem = 512
>wal_files = 64
>enable_seqscan = 0      # per a recommendation
>enable_indexscan = true
>enable_tidscan = true
>enable_sort = true
>enable_nestloop = true
>enable_mergejoin = true
>enable_hashjoin = true
>
>I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl
scriptto do the actual counting. 
>
>Thanks again for the feedback,
>
>Shaun Grannis
>
>----- Original Message -----
>From: Michael Loftis <mloftis@wgops.com>
>Date: Wed, 24 Apr 2002 10:14:04 -0700
>To: asdf asdasfa <sjg@email.com>
>Subject: Re: [GENERAL] Performance Issues with count()
>
>
>>Can you humour me and
>>set enable_seqscan=0
>>And retry the query?
>>
>>Thanks :)
>>
>>S Grannis wrote:
>>
>>>Hi,
>>>
>>>I've found some performance issues with Postgres that
>>>I'm hoping people on this list can help resolve. We're
>>>working with a 65 million record table that includes year
>>>of birth (data type INT). To count the frequency of dates
>>>in the table, it takes 2 hours 26 minutes to execute.
>>>(There's an approximately 100-year range of dates in the
>>>65 million records).
>>>
>>># EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
>>>
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
>>>-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
>>>-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
>>>-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
>>>I can count data from the flat text data file with this
>>>Perl script:
>>>
>>>#!/usr/bin/perl
>>># script to count YB frequencies in flat data file
>>>open (IN, "$ARGV[0]");
>>>open (OUT, ">$ARGV[0]\_cnt");
>>>while (<IN>) {
>>>chomp;
>>>$years{$_}++;}
>>>foreach $key (keys %years) {
>>>print OUT "$key,$years{$key}\n";}
>>>
>>>The Perl script takes *1 minute*, 31 seconds to run.
>>>Why is there such a discrepancy in times? I've noticed
>>>that the Postgres count() function takes what seems to
>>>be "longer than it should" in other cases as well. For
>>>instance, counting the frequency of last names in the
>>>same 65 million record table took *1 hour* and 31
>>>minutes:
>>>
>>># EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
>>>-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
>>>-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
>>>-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
>>>
>>>The last name ( ln) and the year of birth ( yb) is
>>>indexed, but that shouldn't matter because it's doing a
>>>sequential scan, correct? Am I running into the
>>>limitations of Postgres? We'd like to eventually get this
>>>system into production, but if we can't get Postgres to
>>>count() faster, we may not be able to use it.
>>>
>>>Here's the data_table schema:
>>>
>>># \d data_table
>>>
>>>Table "data_table"
>>> Column | Type          | Modifiers
>>>--------+---------------+-----------
>>>     ss | character(9)  |
>>>     ln | character(15) |
>>>     fn | character(15) |
>>>     mi | character(1)  |
>>>     ns | character(15) |
>>>    lny | character(15) |
>>>    fny | character(15) |
>>>    sny | character(15) |
>>>      g | character(1)  |
>>>     mb | integer       |
>>>     db | integer       |
>>>     yb | integer       |
>>>     md | integer       |
>>>     dd | integer       |
>>>     yd | integer       |
>>>Indexes: ssdi_ss_idx
>>>         ssdi_ln_idx
>>>
>>>We're working with Postgres v 7.2. The machine is a
>>>dual-processor Athlon MP1900 (Tyan Tiger board) with
>>>3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
>>>configured in a software RAID 0 Array running under
>>>RedHat Linux v. 7.2.
>>>
>>>We've VACUUM ANALYZE'd the tables after creating the
>>>indices. Is there something I'm missing here?
>>>
>>>Thanks for your suggestions.
>>>
>>>Shaun Grannis
>>>

Re: Performance Issues with count()

From

Stephan Szabo

Date:

25 April 2002, 15:35:06

On Thu, 25 Apr 2002, S Grannis wrote:

> Others have suggested the "fix" is in the future.
>
> Stephan Szabo wrote:
> "I didn't see anything about your settings in postgresql.conf,
> but increasing the sort_mem parameter may help that really
> expensive sort step. I think the desired fix for this would
> probably be the TODO entry on hash based aggregates but that's
> still in the future..."
>
> The "non-default" postgresql.conf settings are as follows:
>
> shared_buffers = 240000 # uses ~2GB of shared mem
> sort_mem = 512

It might be interesting to try with sort_mem set up an order
of magnitude to see if that helps at all.  I doubt it'll be
enough, but it's probably worth trying.

> I think our work-around for now will be to SELECT the column we wish
> to analyze into a flat file and then run a Perl script to do the
> actual counting.

Another option would be a C function that does runs a query via
SPI and does the hashing count and sticks it in a temporary table
but that may be a bit excessive.

Re: Performance Issues with count()

From

Mark kirkwood

Date:

26 April 2002, 04:44:02

I wonder if using a summary table could help you here...

for example :

suppose you have a table - big_table(id,...,interesting_value)

and you want to provide counts of interesting_value quickly.

The plan is to create another table -
summ_table(interesting_value,...,its_count)

and keep it up to date with big_table via triggers.

Then any query of the form :

SELECT interesting_value, count(*)
FROM big_table
WHERE ...
GROUP BY interesting value;

can be answered by :

SELECT interesting_value,its_count
FROM summ_table
WHERE ...;

which is generally *much* faster.
(clearly a complete count is easy to answer quickly too...)

I must say that I have not used this technique in Postgresql (I have
used it in DB2 and Oracle). However this issue is typical for large
databases of all flavours (i.e its too slow to scan and count a lot of
values for each query.... so you do it once and save the results for
future reference).

In addition the rule system in Postgresql *might* be ameniable to
providing a sort of 'query rewrite' to automatically make (some) queries
on big_table go to summ_table instead...(have not tried this... but you
never know until you try)

best wishes

Mark

P.s : the trigger code to keep big_table and summ_table in sync is the
hard bit... but there are no doubt many folks on this list ...incl even
me.. who will happily help you out here.

Performance HOWTO - pseudo counter example

From

Jean-Michel POURE

Date:

29 April 2002, 11:54:08

Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.

Dear all,

I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.

Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?

Thank you for your feedback,
Cheers,
Jean-Michel

*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************

This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.

1) Performance background
This small howto is insprired in reply to an email on
pgsql-general@postgresql.org complaining about PostgreSQL speed.

The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:

CREATE TABLE "data" (
  "data_oid" serial8,
  "data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.

The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.

The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.

1) INSTALLATION
a) Database creation

Open a terminal windows, connect as 'postgres' user:
root@localhost>su postgres

Create an empty database:
postgresql@localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q

b) PLpgSQL declaration

PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:

postgresql@localhost>CREATELANG plpgsql pseudo_counter

c) Data table

We first need to create the table stucture:
CREATE TABLE "data" (
  "data_oid" serial8,
  "data_yd" int4,
  "data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);

And create a PLpgSQL function to add fake records:

CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;

BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
  FOR iLoop in 1 .. $1 LOOP
    INSERT INTO data (data_yd)
    VALUES (
       int8 (random () * ($3-$2) +$2)
    );
  END LOOP;
  RETURN ''t'';
ELSE
  RETURN ''f'';
END IF;
END;
'  LANGUAGE 'plpgsql';

To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.

If you cannot wait that long :

For testing, insert 126.953 records  :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.

Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;

This should produce 64999936 fake records more quickly as no random function
is
used.

Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.

d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
  "count_oid" serial,
  "count_year" int4,
  "count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);

... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;

BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
  FOR iLoop in $1 .. $2 LOOP
    INSERT INTO pseudo_count (count_year)
    VALUES (iLoop);
  END LOOP;
  RETURN ''t'';
ELSE
  RETURN ''f'';
END IF;
END;
'  LANGUAGE 'plpgsql';

Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.


e) PLpgSQL function and trigger

Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').

CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN

IF (TG_OP=''UPDATE'') THEN
 IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
   UPDATE pseudo_count
   SET count_value = count_value +1
   WHERE count_year = new.data_yd
   AND count_value >= 0;
 END IF;

 IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
   UPDATE pseudo_count
   SET count_value = count_value -1
   WHERE count_year = new.data_yd
   AND count_value > 0;
 END IF;

 IF (old.data_yd <> new.data_yd) THEN
   UPDATE pseudo_count
   SET count_value = count_value -1
   WHERE count_year = old.data_yd
   AND count_value > 0;

   UPDATE pseudo_count
   SET count_value = count_value + 1
   WHERE count_year = new.data_yd
   AND count_value >= 0 ;
  END IF;
END IF;

IF (TG_OP=''DELETE'')  THEN
  UPDATE pseudo_count
  SET count_value = count_value - 1
  WHERE count_year = old.data_yd
   AND count_value >= 0 ;
END IF;

IF (TG_OP=''UPDATE'') THEN
  RETURN new;
ELSE
  RETURN old;
END IF;

END;
'  LANGUAGE 'plpgsql';

CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();

2) USAGE

a) Initialisation
Set "pseudo_countenabled" flag on:

UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'

b) Run pseudo-count queries

Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.

you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year

The anwer comes in 0.005 second.

c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.