Re: Slow trigger on insert: alternative solutions? - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Slow trigger on insert: alternative solutions?
Date
Msg-id 5519BA7C.8030000@aklaver.com
Whole thread Raw
In response to Re: Slow trigger on insert: alternative solutions?  (Guillaume Drolet <droletguillaume@gmail.com>)
List pgsql-general
On 03/30/2015 01:43 PM, Guillaume Drolet wrote:
> Thanks Adrian,
>
> In option 2, when you say "this is automated in an external Python
> script", do you mean that you use something like psycopg2 to perform the
> queries the database (e.g. for comparing data in the holding table with
> the older table)?

Yes. Basically I use the dbf module I mentioned previously to read the
DBF files, output the data I need, writing it into an in memory CSV file
which I then use the psycopg2 COPY functions to dump into the Postgres
database. The files I am working with are not as the large the ones you
mention, so doing all this in memory is workable. The script then calls
a series of user functions in Postgres to do the comparing and
manipulating.  At the time I did this plpythonu was less featured then
it is now, so to do what I wanted made more sense in an external script.
Also the script pulls the DBF files from elsewhere and I felt more
comfortable doing that outside the database then in.

Though more and more I seem to be using pandas(pandas.pydata.org) to do
data conversions. Saves a lot of the steps in the above. In this case
you would still need to get the data out of the DBF files.

>
> Thanks.
>
> 2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>>:
>
>     On 03/30/2015 06:04 AM, Guillaume Drolet wrote:
>
>         Hello,
>
>         I need your help speeding up the procedure I will explain below.
>         I am
>         looking for improvements to my method or different
>         approaches/ideas to
>         would help in this matter.
>
>         I have a set of DBF files that I load into my database using a
>         plpython
>         function and a call to ogr2ogr
>         (http://www.gdal.org/drv_pg.__html
>         <http://www.gdal.org/drv_pg.html>). Once
>         in a while, I'll have to load updated versions of these tables
>         to get
>         the latest additions and possible corrections to older versions.
>
>         In my plpython script, if a table is loaded for the first time,
>         I first
>         load it empty, then I create a trigger function on insert
>         (execute on
>         row) that will check for duplicates on each insert. Depending on the
>         type of data I load, my trigger first checks for equality in a
>         subset of
>         columns (between 1 and 3 columns that would be like my primary
>         key(s))
>         and if true, I check if all columns are equal between NEW and the
>         matching row from my table. When this condition is true, I
>         return null,
>         else I store rows (i.e. NEW.* and matching row(s) in a new table
>         called
>         "duplicate" for further manual investigation. Here's an example
>         for one
>         table:
>
>         CREATE OR REPLACE FUNCTION check_naipf_insert()
>             RETURNS trigger AS
>         ' BEGIN
>              IF EXISTS (SELECT 1
>                               FROM    public.naipf
>                               WHERE id_pet_mes IS NOT DISTINCT FROM
>         NEW.id_pet_mes
>                               AND etage IS NOT DISTINCT FROM NEW.etage) THEN
>                 IF EXISTS (SELECT 1
>                                  FROM public.naipf
>                                  WHERE id_pet_mes IS NOT DISTINCT FROM
>         NEW.id_pet_mes
>                                  AND etage IS NOT DISTINCT FROM NEW.etage
>                                  AND type_couv IS NOT DISTINCT FROM
>         NEW.type_couv
>                                  AND densite IS NOT DISTINCT FROM
>         NEW.densite
>                                  AND hauteur IS NOT DISTINCT FROM
>         NEW.hauteur
>                                  AND cl_age IS NOT DISTINCT FROM
>         NEW.cl_age) THEN
>                                    RETURN NULL;
>                  ELSE
>                    INSERT INTO public.duplic_naipf SELECT NEW.*;
>                    INSERT INTO public.duplic_naipf (SELECT *
>                                                                      FROM
>         public.naipf
>                                                                      WHERE
>         id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
>
>           AND etage
>         IS NOT DISTINCT FROM NEW.etage );
>                    RETURN NULL;
>                  END IF;
>                END IF;
>                RETURN NEW;
>            END;  '
>            LANGUAGE plpgsql VOLATILE COST 100;
>
>         CREATE TRIGGER check_insert_naipf
>             BEFORE INSERT
>             ON public.pet4_naipf
>             FOR EACH ROW
>             EXECUTE PROCEDURE check_naipf_insert();
>
>         (in this case, duplicate rows that need investigation are rows
>         that may
>         have changed relative to older version of the DBF file, but that
>         have no
>         change in what I call their primary keys although they are not
>         really
>         PKs since I don't want to raise errors at loading)
>
>         Once this is done, ogr2ogr is called a second time to load the
>         data. It
>         is quite fast for small tables (tens of thousands of rows, tens of
>         columns) but for large tables it takes forever. For example, I
>         started
>         loading a table with 3.5 million rows/33 columns last Friday at
>         3PM and
>         this now, Monday morning at 9PM some 3 million rows have been
>         loaded.
>
>         My question is: what are the other approaches that would make this
>         procedure faster? How is this kind of task usually implemented in
>         postgresql? Would it be better to load everything with no check
>         and then
>         apply some functions to find duplicate rows (although this would
>         involve
>         more manual work)?
>
>
>     I guess it depends on what end purpose of the above is? If you are
>     just trying to keep relatively update to date information from the
>     DBF sources, would it not be easier just to load them into a new table?
>
>     So, where existing table is some_dbf_data:
>
>     1) CREATE TABLE new_some_dbf_data(...)
>     2) Dump DBF file into new_some_dbf_data
>     3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data
>     to some_dbf_data
>
>
>     Option 2 is what I do for a similar procedure:
>
>     1) Dump DBF data into holding table.
>     2) Use SQL in function(s) to compare old/new table and make
>     appropriate adjustments. Doing SQL in bulk is a lot faster then
>     checking each row, or least that is what I found. In any case the
>     way you are doing it looks to involve 3.5 million inserts with a
>     trigger action on each, that is bound to be slow:)
>     3) This is automated in an external Python script.
>
>     Option 3
>
>     Use dbf(https://pypi.python.org/__pypi/dbf/0.88.16
>     <https://pypi.python.org/pypi/dbf/0.88.16>) and do the comparisons
>     in the DBF files outside Postgres and only import what has changed.
>
>
>
>
>
>         Thanks a lot for your help!
>
>
>
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: avpro avpro
Date:
Subject: Re: [SQL] Re: Link Office Word form document with data from PostgreSQL
Next
From: Deven Phillips
Date:
Subject: Re: Muti-table join and roll-up aggregate data into nested JSON?