Re: Slow trigger on insert: alternative solutions? - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: Slow trigger on insert: alternative solutions? |
Date | |
Msg-id | 5519BA7C.8030000@aklaver.com Whole thread Raw |
In response to | Re: Slow trigger on insert: alternative solutions? (Guillaume Drolet <droletguillaume@gmail.com>) |
List | pgsql-general |
On 03/30/2015 01:43 PM, Guillaume Drolet wrote: > Thanks Adrian, > > In option 2, when you say "this is automated in an external Python > script", do you mean that you use something like psycopg2 to perform the > queries the database (e.g. for comparing data in the holding table with > the older table)? Yes. Basically I use the dbf module I mentioned previously to read the DBF files, output the data I need, writing it into an in memory CSV file which I then use the psycopg2 COPY functions to dump into the Postgres database. The files I am working with are not as the large the ones you mention, so doing all this in memory is workable. The script then calls a series of user functions in Postgres to do the comparing and manipulating. At the time I did this plpythonu was less featured then it is now, so to do what I wanted made more sense in an external script. Also the script pulls the DBF files from elsewhere and I felt more comfortable doing that outside the database then in. Though more and more I seem to be using pandas(pandas.pydata.org) to do data conversions. Saves a lot of the steps in the above. In this case you would still need to get the data out of the DBF files. > > Thanks. > > 2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>>: > > On 03/30/2015 06:04 AM, Guillaume Drolet wrote: > > Hello, > > I need your help speeding up the procedure I will explain below. > I am > looking for improvements to my method or different > approaches/ideas to > would help in this matter. > > I have a set of DBF files that I load into my database using a > plpython > function and a call to ogr2ogr > (http://www.gdal.org/drv_pg.__html > <http://www.gdal.org/drv_pg.html>). Once > in a while, I'll have to load updated versions of these tables > to get > the latest additions and possible corrections to older versions. > > In my plpython script, if a table is loaded for the first time, > I first > load it empty, then I create a trigger function on insert > (execute on > row) that will check for duplicates on each insert. Depending on the > type of data I load, my trigger first checks for equality in a > subset of > columns (between 1 and 3 columns that would be like my primary > key(s)) > and if true, I check if all columns are equal between NEW and the > matching row from my table. When this condition is true, I > return null, > else I store rows (i.e. NEW.* and matching row(s) in a new table > called > "duplicate" for further manual investigation. Here's an example > for one > table: > > CREATE OR REPLACE FUNCTION check_naipf_insert() > RETURNS trigger AS > ' BEGIN > IF EXISTS (SELECT 1 > FROM public.naipf > WHERE id_pet_mes IS NOT DISTINCT FROM > NEW.id_pet_mes > AND etage IS NOT DISTINCT FROM NEW.etage) THEN > IF EXISTS (SELECT 1 > FROM public.naipf > WHERE id_pet_mes IS NOT DISTINCT FROM > NEW.id_pet_mes > AND etage IS NOT DISTINCT FROM NEW.etage > AND type_couv IS NOT DISTINCT FROM > NEW.type_couv > AND densite IS NOT DISTINCT FROM > NEW.densite > AND hauteur IS NOT DISTINCT FROM > NEW.hauteur > AND cl_age IS NOT DISTINCT FROM > NEW.cl_age) THEN > RETURN NULL; > ELSE > INSERT INTO public.duplic_naipf SELECT NEW.*; > INSERT INTO public.duplic_naipf (SELECT * > FROM > public.naipf > WHERE > id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes > > AND etage > IS NOT DISTINCT FROM NEW.etage ); > RETURN NULL; > END IF; > END IF; > RETURN NEW; > END; ' > LANGUAGE plpgsql VOLATILE COST 100; > > CREATE TRIGGER check_insert_naipf > BEFORE INSERT > ON public.pet4_naipf > FOR EACH ROW > EXECUTE PROCEDURE check_naipf_insert(); > > (in this case, duplicate rows that need investigation are rows > that may > have changed relative to older version of the DBF file, but that > have no > change in what I call their primary keys although they are not > really > PKs since I don't want to raise errors at loading) > > Once this is done, ogr2ogr is called a second time to load the > data. It > is quite fast for small tables (tens of thousands of rows, tens of > columns) but for large tables it takes forever. For example, I > started > loading a table with 3.5 million rows/33 columns last Friday at > 3PM and > this now, Monday morning at 9PM some 3 million rows have been > loaded. > > My question is: what are the other approaches that would make this > procedure faster? How is this kind of task usually implemented in > postgresql? Would it be better to load everything with no check > and then > apply some functions to find duplicate rows (although this would > involve > more manual work)? > > > I guess it depends on what end purpose of the above is? If you are > just trying to keep relatively update to date information from the > DBF sources, would it not be easier just to load them into a new table? > > So, where existing table is some_dbf_data: > > 1) CREATE TABLE new_some_dbf_data(...) > 2) Dump DBF file into new_some_dbf_data > 3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data > to some_dbf_data > > > Option 2 is what I do for a similar procedure: > > 1) Dump DBF data into holding table. > 2) Use SQL in function(s) to compare old/new table and make > appropriate adjustments. Doing SQL in bulk is a lot faster then > checking each row, or least that is what I found. In any case the > way you are doing it looks to involve 3.5 million inserts with a > trigger action on each, that is bound to be slow:) > 3) This is automated in an external Python script. > > Option 3 > > Use dbf(https://pypi.python.org/__pypi/dbf/0.88.16 > <https://pypi.python.org/pypi/dbf/0.88.16>) and do the comparisons > in the DBF files outside Postgres and only import what has changed. > > > > > > Thanks a lot for your help! > > > > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > > -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: