Slow trigger on insert: alternative solutions? - Mailing list pgsql-general

From Guillaume Drolet
Subject Slow trigger on insert: alternative solutions?
Date
Msg-id CAOkiyv75abqw0m1syoimguT7XgBbwkGQXKnj-hkg+JjTXVd8uw@mail.gmail.com
Whole thread Raw
Responses Re: Slow trigger on insert: alternative solutions?
List pgsql-general
Hello,

I need your help speeding up the procedure I will explain below. I am looking for improvements to my method or different approaches/ideas to would help in this matter.

I have a set of DBF files that I load into my database using a plpython function and a call to ogr2ogr (http://www.gdal.org/drv_pg.html). Once in a while, I'll have to load updated versions of these tables to get the latest additions and possible corrections to older versions.

In my plpython script, if a table is loaded for the first time, I first load it empty, then I create a trigger function on insert (execute on row) that will check for duplicates on each insert. Depending on the type of data I load, my trigger first checks for equality in a subset of columns (between 1 and 3 columns that would be like my primary key(s)) and if true, I check if all columns are equal between NEW and the matching row from my table. When this condition is true, I return null, else I store rows (i.e. NEW.* and matching row(s) in a new table called "duplicate" for further manual investigation. Here's an example for one table:

CREATE OR REPLACE FUNCTION check_naipf_insert()
  RETURNS trigger AS
' BEGIN
   IF EXISTS (SELECT 1
                    FROM    public.naipf
                    WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
                    AND etage IS NOT DISTINCT FROM NEW.etage) THEN
      IF EXISTS (SELECT 1
                       FROM public.naipf
                       WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
                       AND etage IS NOT DISTINCT FROM NEW.etage
                       AND type_couv IS NOT DISTINCT FROM NEW.type_couv
                       AND densite IS NOT DISTINCT FROM NEW.densite
                       AND hauteur IS NOT DISTINCT FROM NEW.hauteur
                       AND cl_age IS NOT DISTINCT FROM NEW.cl_age) THEN
                         RETURN NULL;
       ELSE 
         INSERT INTO public.duplic_naipf SELECT NEW.*;
         INSERT INTO public.duplic_naipf (SELECT *
                                                           FROM   public.naipf
                                                           WHERE id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
                                                            AND etage IS NOT DISTINCT FROM NEW.etage );     
         RETURN NULL;
       END IF;
     END IF;
     RETURN NEW;
 END;  '
 LANGUAGE plpgsql VOLATILE COST 100;

CREATE TRIGGER check_insert_naipf
  BEFORE INSERT
  ON public.pet4_naipf
  FOR EACH ROW
  EXECUTE PROCEDURE check_naipf_insert();

(in this case, duplicate rows that need investigation are rows that may have changed relative to older version of the DBF file, but that have no change in what I call their primary keys although they are not really PKs since I don't want to raise errors at loading)

Once this is done, ogr2ogr is called a second time to load the data. It is quite fast for small tables (tens of thousands of rows, tens of columns) but for large tables it takes forever. For example, I started loading a table with 3.5 million rows/33 columns last Friday at 3PM and this now, Monday morning at 9PM some 3 million rows have been loaded.

My question is: what are the other approaches that would make this procedure faster? How is this kind of task usually implemented in postgresql? Would it be better to load everything with no check and then apply some functions to find duplicate rows (although this would involve more manual work)?

Thanks a lot for your help!




pgsql-general by date:

Previous
From: Vincent Veyron
Date:
Subject: Re: Link Office Word form document with data from PostgreSQL
Next
From: Danushka Menikkumbura
Date:
Subject: Extensibility features in PQSQL