Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.
Dear all,
I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.
Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?
Thank you for your feedback,
Cheers,
Jean-Michel
*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************
This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.
1) Performance background
This small howto is insprired in reply to an email on
pgsql-general@postgresql.org complaining about PostgreSQL speed.
The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.
The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.
The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.
1) INSTALLATION
a) Database creation
Open a terminal windows, connect as 'postgres' user:
root@localhost>su postgres
Create an empty database:
postgresql@localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q
b) PLpgSQL declaration
PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:
postgresql@localhost>CREATELANG plpgsql pseudo_counter
c) Data table
We first need to create the table stucture:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4,
"data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);
And create a PLpgSQL function to add fake records:
CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;
BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
FOR iLoop in 1 .. $1 LOOP
INSERT INTO data (data_yd)
VALUES (
int8 (random () * ($3-$2) +$2)
);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';
To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.
If you cannot wait that long :
For testing, insert 126.953 records :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.
Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;
This should produce 64999936 fake records more quickly as no random function
is
used.
Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.
d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
"count_oid" serial,
"count_year" int4,
"count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);
... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
FOR iLoop in $1 .. $2 LOOP
INSERT INTO pseudo_count (count_year)
VALUES (iLoop);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';
Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.
e) PLpgSQL function and trigger
Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').
CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN
IF (TG_OP=''UPDATE'') THEN
IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
UPDATE pseudo_count
SET count_value = count_value +1
WHERE count_year = new.data_yd
AND count_value >= 0;
END IF;
IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = new.data_yd
AND count_value > 0;
END IF;
IF (old.data_yd <> new.data_yd) THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = old.data_yd
AND count_value > 0;
UPDATE pseudo_count
SET count_value = count_value + 1
WHERE count_year = new.data_yd
AND count_value >= 0 ;
END IF;
END IF;
IF (TG_OP=''DELETE'') THEN
UPDATE pseudo_count
SET count_value = count_value - 1
WHERE count_year = old.data_yd
AND count_value >= 0 ;
END IF;
IF (TG_OP=''UPDATE'') THEN
RETURN new;
ELSE
RETURN old;
END IF;
END;
' LANGUAGE 'plpgsql';
CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();
2) USAGE
a) Initialisation
Set "pseudo_countenabled" flag on:
UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'
b) Run pseudo-count queries
Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.
you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year
The anwer comes in 0.005 second.
c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.