Thread: Performance HOWTO - pseudo counter example

Performance HOWTO - pseudo counter example

From
Jean-Michel POURE
Date:
29 April 2002, 11:54:08
Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.

Dear all,

I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.

Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?

Thank you for your feedback,
Cheers,
Jean-Michel

*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************

This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.

1) Performance background
This small howto is insprired in reply to an email on
pgsql-general@postgresql.org complaining about PostgreSQL speed.

The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:

CREATE TABLE "data" (
  "data_oid" serial8,
  "data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.

The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.

The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.

1) INSTALLATION
a) Database creation

Open a terminal windows, connect as 'postgres' user:
root@localhost>su postgres

Create an empty database:
postgresql@localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q

b) PLpgSQL declaration

PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:

postgresql@localhost>CREATELANG plpgsql pseudo_counter

c) Data table

We first need to create the table stucture:
CREATE TABLE "data" (
  "data_oid" serial8,
  "data_yd" int4,
  "data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);

And create a PLpgSQL function to add fake records:

CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;

BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
  FOR iLoop in 1 .. $1 LOOP
    INSERT INTO data (data_yd)
    VALUES (
       int8 (random () * ($3-$2) +$2)
    );
  END LOOP;
  RETURN ''t'';
ELSE
  RETURN ''f'';
END IF;
END;
'  LANGUAGE 'plpgsql';

To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.

If you cannot wait that long :

For testing, insert 126.953 records  :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.

Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;

This should produce 64999936 fake records more quickly as no random function
is
used.

Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.

d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
  "count_oid" serial,
  "count_year" int4,
  "count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);

... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;

BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
  FOR iLoop in $1 .. $2 LOOP
    INSERT INTO pseudo_count (count_year)
    VALUES (iLoop);
  END LOOP;
  RETURN ''t'';
ELSE
  RETURN ''f'';
END IF;
END;
'  LANGUAGE 'plpgsql';

Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.


e) PLpgSQL function and trigger

Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').

CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN

IF (TG_OP=''UPDATE'') THEN
 IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
   UPDATE pseudo_count
   SET count_value = count_value +1
   WHERE count_year = new.data_yd
   AND count_value >= 0;
 END IF;

 IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
   UPDATE pseudo_count
   SET count_value = count_value -1
   WHERE count_year = new.data_yd
   AND count_value > 0;
 END IF;

 IF (old.data_yd <> new.data_yd) THEN
   UPDATE pseudo_count
   SET count_value = count_value -1
   WHERE count_year = old.data_yd
   AND count_value > 0;

   UPDATE pseudo_count
   SET count_value = count_value + 1
   WHERE count_year = new.data_yd
   AND count_value >= 0 ;
  END IF;
END IF;

IF (TG_OP=''DELETE'')  THEN
  UPDATE pseudo_count
  SET count_value = count_value - 1
  WHERE count_year = old.data_yd
   AND count_value >= 0 ;
END IF;

IF (TG_OP=''UPDATE'') THEN
  RETURN new;
ELSE
  RETURN old;
END IF;

END;
'  LANGUAGE 'plpgsql';

CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();

2) USAGE

a) Initialisation
Set "pseudo_countenabled" flag on:

UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'

b) Run pseudo-count queries

Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.

you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year

The anwer comes in 0.005 second.

c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.