Re: Another Plpgsql trigger example - summary table - Mailing list pgsql-patches

From Bruce Momjian
Subject Re: Another Plpgsql trigger example - summary table
Date
Msg-id 200501040408.j0448O200354@candle.pha.pa.us
Whole thread Raw
In response to Another Plpgsql trigger example - summary table  (Mark Kirkwood <markir@coretech.co.nz>)
Responses Re: Another Plpgsql trigger example - summary table  (Mark Kirkwood <markir@coretech.co.nz>)
List pgsql-patches
Wow, yea, that is long.  Not sure where that should go.

---------------------------------------------------------------------------

Mark Kirkwood wrote:
> I have always thought that an example of how to maintain a summary table
> via triggers would be nice... but until the other day, had not attempted
> to do one, so contributing a simplified version seemed like a good thing
> to do.
>
> I have made the example pretty much self contained, which unfortunately
> means it is longish.  This made me wonder about  its placement (i.e in
> plpgsql examples). It could go in a 'Data warehousing' chapter - if we
> had one....
>
> Any suggestions welcome.
>
> best wishes
>
> Mark
>
> P.s : use is made of a schema from Ralph Kimball's "The Data  Warehouse
> Toolkit" -  I mailed him to check it was ok (and it was, in fact he was
> quite pleased)

> --- plpgsql.sgml.orig    2004-12-29 15:48:53.089973005 +1300
> +++ plpgsql.sgml    2004-12-29 12:43:50.000000000 +1300
> @@ -2632,6 +2632,306 @@
>  ;
>  </programlisting>
>     </example>
> +
> +   <para>
> +    An area where triggers can be useful is maintaining a summary table
> +    of another table. The resulting summary can be used in place of the
> +    original table for certain queries - with often vastly reduced run
> +    times.
> +   </para>
> +
> +   <para>
> +    This technique is commonly used in Data Warehousing, where the tables
> +    of measured or observed data (called fact tables) can be extremely large.
> +   </para>
> +
> +   <example id="plpgsql-trigger-summary-example">
> +    <title>A <application>PL/pgSQL</application> Trigger Procedure For Maintaining A Summary Table</title>
> +
> +    <para>
> +     <xref linkend="plpgsql-trigger-summary-example"> shows an example of a
> +     trigger procedure in <application>PL/pgSQL</application> that maintains
> +     a summary table for a fact table in a data warehouse.
> +    </para>
> +
> +    <para>
> +     The schema detailed here is loosely based on the <emphasis>Grocery Store
> +     </emphasis> example from <emphasis>The Data Warehouse Toolkit</emphasis>
> +     by Ralph Kimball.
> +    </para>
> +
> +<programlisting>
> +--
> +-- Three dimension tables.
> +--
> +CREATE TABLE time_dimension (
> +    time_key                    integer NOT NULL,
> +    day_of_week                 integer NOT NULL,
> +    day_of_month                integer NOT NULL,
> +    month                       integer NOT NULL,
> +    quarter                     integer NOT NULL,
> +    year                        integer NOT NULL
> +);
> +
> +CREATE TABLE product_dimension (
> +    product_key                 integer NOT NULL,
> +    description                 varchar(100) NOT NULL,
> +    brand                       varchar(50) NOT NULL,
> +    catageory                   varchar(20) NOT NULL
> +);
> +
> +CREATE TABLE store_dimension (
> +    store_key                   integer NOT NULL,
> +    store_name                  varchar(100) NOT NULL,
> +    address                     varchar(100) NOT NULL
> +);
> +
> +
> +--
> +-- Sales fact.
> +--
> +CREATE TABLE sales_fact (
> +    time_key                    integer NOT NULL,
> +    product_key                 integer NOT NULL,
> +    store_key                   integer NOT NULL,
> +    amount_sold                 numeric(12,2) NOT NULL,
> +    units_sold                  integer NOT NULL,
> +    amount_cost                 numeric(12,2) NOT NULL
> +);
> +
> +
> +--
> +-- Sales summary.
> +--
> +CREATE TABLE sales_summary_bytime (
> +    time_key                    integer NOT NULL,
> +    amount_sold                 numeric(15,2) NOT NULL,
> +    units_sold                  numeric(12) NOT NULL,
> +    amount_cost                 numeric(15,2) NOT NULL
> +);
> +
> +--
> +-- COPY in data.
> +--
> +COPY time_dimension FROM '/var/dump/time.dat' DELIMITERS ',';
> +COPY product_dimension FROM '/var/dump/prod.dat' DELIMITERS ',';
> +COPY store_dimension FROM '/var/dump/store.dat' DELIMITERS ',';
> +COPY sales_fact FROM '/var/dump/sales.dat' DELIMITERS ',';
> +
> +
> +--
> +-- Create indexes on the dimensions, facts and summary.
> +--
> +CREATE UNIQUE INDEX time_dimension_key ON time_dimension(time_key);
> +CREATE INDEX time_dimension_year ON time_dimension(year);
> +CREATE UNIQUE INDEX product_dimension_key ON product_dimension(product_key);
> +CREATE UNIQUE INDEX store_dimension_key ON store_dimension(store_key);
> +
> +CREATE INDEX sales_fact_time ON sales_fact(time_key);
> +CREATE INDEX sales_fact_product ON sales_fact(product_key);
> +CREATE INDEX sales_fact_store ON sales_fact(store_key);
> +
> +CREATE UNIQUE INDEX sales_summary_bytime_key ON sales_summary_bytime(time_key);
> +
> +
> +--
> +-- Collect statistics for the optimizer.
> +--
> +ANALYZE VERBOSE time_dimension;
> +ANALYZE VERBOSE product_dimension;
> +ANALYZE VERBOSE store_dimension;
> +ANALYZE VERBOSE sales_fact;
> +
> +
> +--
> +-- Pre populate (and collect statistics for) the summary table.
> +--
> +INSERT INTO sales_summary_bytime (
> +            time_key,
> +            amount_sold,
> +            units_sold,
> +            amount_cost)
> +    SELECT  f.time_key,
> +            sum(f.amount_sold),
> +            sum(f.units_sold),
> +            sum(f.amount_cost)
> +    FROM sales_fact f
> +    GROUP BY f.time_key;
> +
> +ANALYZE VERBOSE sales_summary_bytime;
> +
> +
> +--
> +-- Function to amend summarized column(s) on UPDATE, INSERT, DELETE.
> +--
> +CREATE OR REPLACE FUNCTION maint_sales_summary_bytime() RETURNS TRIGGER AS $maint_sales_summary_bytime$
> +    DECLARE
> +        delta_time_key          integer;
> +        delta_amount_sold       numeric(15,2);
> +        delta_units_sold        numeric(12);
> +        delta_amount_cost       numeric(15,2);
> +    BEGIN
> +
> +        -- Work out the increment/decrement amount(s).
> +        IF (TG_OP = 'DELETE') THEN
> +
> +            delta_time_key = OLD.time_key;
> +            delta_amount_sold = -1 * OLD.amount_sold;
> +            delta_units_sold = -1 * OLD.units_sold;
> +            delta_amount_cost = -1 * OLD.amount_cost;
> +
> +        ELSIF (TG_OP = 'UPDATE') THEN
> +
> +            -- forbid updates that change the time_key -
> +            -- (probably not too onerous, as DELETE + INSERT is how most
> +            -- changes will be made).
> +            IF ( OLD.time_key != NEW.time_key) THEN
> +                RAISE EXCEPTION 'Update of time_key : % -> % not allowed', OLD.time_key, NEW.time_key;
> +            END IF;
> +
> +            delta_time_key = OLD.time_key;
> +            delta_amount_sold = NEW.amount_sold - OLD.amount_sold;
> +            delta_units_sold = NEW.units_sold - OLD.units_sold;
> +            delta_amount_cost = NEW.amount_cost - OLD.amount_cost;
> +
> +        ELSIF (TG_OP = 'INSERT') THEN
> +
> +            delta_time_key = NEW.time_key;
> +            delta_amount_sold = NEW.amount_sold;
> +            delta_units_sold = NEW.units_sold;
> +            delta_amount_cost = NEW.amount_cost;
> +
> +        END IF;
> +
> +
> +        -- Update the summary row with the new values.
> +        UPDATE sales_summary_bytime
> +            SET amount_sold = amount_sold + delta_amount_sold,
> +                units_sold = units_sold + delta_units_sold,
> +                amount_cost = amount_cost + delta_amount_cost
> +            WHERE time_key = delta_time_key;
> +
> +
> +        -- There might have been no row with this time_key (e.g new data!).
> +        IF (NOT FOUND) THEN
> +            BEGIN
> +                INSERT INTO sales_summary_bytime (
> +                            time_key,
> +                            amount_sold,
> +                            units_sold,
> +                            amount_cost)
> +                    SELECT  f.time_key,
> +                            sum(f.amount_sold),
> +                            sum(f.units_sold),
> +                            sum(f.amount_cost)
> +                    FROM sales_fact f
> +                    WHERE f.time_key = delta_time_key
> +                    GROUP BY f.time_key;
> +                -- This query can potentially be very expensive if the trigger
> +                -- is created on sales_fact without the time_key indexes.
> +                -- Some care is needed to ensure that this situation does
> +                -- *not* occur.
> +            EXCEPTION
> +                --
> +                -- Catch race condition when two transactions are adding data
> +                -- for a new time_key.
> +                --
> +                WHEN UNIQUE_VIOLATION THEN
> +                    UPDATE sales_summary_bytime
> +                        SET    amount_sold = amount_sold + delta_amount_sold,
> +                            units_sold = units_sold + delta_units_sold,
> +                            amount_cost = amount_cost + delta_amount_cost
> +                        WHERE time_key = delta_time_key;
> +
> +            END;
> +        END IF;
> +        RETURN NULL;
> +
> +    END;
> +$maint_sales_summary_bytime$ LANGUAGE plpgsql;
> +
> +
> +--
> +-- The trigger.
> +--
> +CREATE TRIGGER maint_sales_summary_bytime
> +AFTER INSERT OR UPDATE OR DELETE ON sales_fact
> +    FOR EACH ROW EXECUTE PROCEDURE maint_sales_summary_bytime()
> +;
> +</programlisting>
> +  <para>
> +   The effectiveness of the summary technique can be easily demonstrated. In
> +   the first case shown below, the base fact table is used. In the second the
> +   summary table is substituted. The run times and plans shown are real.
> +  </para>
> +<programlisting>
> +dwexample=#
> +SELECT
> +    d0.quarter,
> +    sum(f.amount_sold)
> +FROM
> +    time_dimension d0,
> +    sales_fact f
> +WHERE d0.time_key = f.time_key
> +AND   d0.year = 2004
> +GROUP BY
> +    d0.quarter
> +;
> + quarter |     sum
> +---------+-------------
> +       2 | 90000000.00
> +       1 | 90000000.00
> +       4 | 90000000.00
> +       3 | 90000000.00
> +(4 rows)
> +
> +Time: 2898.236 ms
> +                                            QUERY PLAN
> +-----------------------------------------------------------------------------------------------------
> + HashAggregate  (cost=53237.46..53237.46 rows=1 width=14)
> +   ->  Nested Loop  (cost=0.00..51447.46 rows=358001 width=14)
> +         ->  Index Scan using time_year on time_dimension d0  (cost=0.00..9.83 rows=358 width=8)
> +               Index Cond: ("year" = 2004)
> +         ->  Index Scan using sales_fact_time on sales_fact f  (cost=0.00..96.72 rows=3757 width=14)
> +               Index Cond: ("outer".time_key = f.time_key)
> +(6 rows)
> +
> +
> +dwexample=#
> +SELECT
> +    d0.quarter,
> +    sum(f.amount_sold)
> +FROM
> +    time_dimension d0,
> +    sales_summary_bytime f
> +WHERE d0.time_key = f.time_key
> +AND   d0.year = 2004
> +GROUP BY
> +    d0.quarter
> +;
> + quarter |     sum
> +---------+-------------
> +       2 | 90000000.00
> +       1 | 90000000.00
> +       4 | 90000000.00
> +       3 | 90000000.00
> +(4 rows)
> +
> +Time: 28.459 ms
> +                                              QUERY PLAN
> +-------------------------------------------------------------------------------------------------------
> + HashAggregate  (cost=260.10..260.10 rows=1 width=14)
> +   ->  Hash Join  (cost=10.72..258.31 rows=358 width=14)
> +         Hash Cond: ("outer".time_key = "inner".time_key)
> +         ->  Seq Scan on sales_summary_bytime f  (cost=0.00..194.00 rows=10000 width=14)
> +         ->  Hash  (cost=9.83..9.83 rows=358 width=8)
> +               ->  Index Scan using time_year on time_dimension d0  (cost=0.00..9.83 rows=358 width=8)
> +                     Index Cond: ("year" = 2004)
> +(7 rows)
> +
> +
> +</programlisting>
> +   </example>
>    </sect1>
>
>    <!-- **** Porting from Oracle PL/SQL **** -->

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Implementing SELECT FOR UPDATE [NOWAIT]
Next
From: Bruce Momjian
Date:
Subject: Re: Move get_grosysid() to utils/cache/lsyscache.c