Thread: error-tolerant COPY FROM
Hello everyone, I'm building a postgresql db which will have to get lots of data from "the outside" (customers, that is). The db has lots of constraints, and I'm sure that our customers will offer lots of invalid information. We receive the information in csv format. My first thought was to read them into the database with COPY, but "COPY stops operation at the first error." What I need is an import where all valid lines from the csv files are read into the db, and I also get a logfile for all invalid lines, stating the line number plus the pg error message so I can see which constraint was violated. I can't think of a direct, elegant solution for this, does anyone have any suggestions? Thanks a lot!
On Fri, 4 Feb 2005 13:32:40 +0100 (CET), Joolz <joolz@arbodienst-limburg.nl> wrote: > Hello everyone, > > I'm building a postgresql db which will have to get lots of data > from "the outside" (customers, that is). The db has lots of > constraints, and I'm sure that our customers will offer lots of > invalid information. We receive the information in csv format. My > first thought was to read them into the database with COPY, but > "COPY stops operation at the first error." > > What I need is an import where all valid lines from the csv files > are read into the db, and I also get a logfile for all invalid > lines, stating the line number plus the pg error message so I can > see which constraint was violated. > > I can't think of a direct, elegant solution for this, does anyone > have any suggestions? Thanks a lot! How about a Perl script that uses DBI with autocommit turned on and loops over the file using (something like) 'while (my @line = split(',',scalar(<>)) {}'? Perl tracks the line number of the current input file (including STDIN) in $. (dollar-period). If you get a DBI error, 'warn "Input error at line $.\n";'. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Mike Rylander zei: > On Fri, 4 Feb 2005 13:32:40 +0100 (CET), Joolz > <joolz@arbodienst-limburg.nl> wrote: >> Hello everyone, >> >> I'm building a postgresql db which will have to get lots of data >> from "the outside" (customers, that is). The db has lots of >> constraints, and I'm sure that our customers will offer lots of >> invalid information. We receive the information in csv format. My >> first thought was to read them into the database with COPY, but >> "COPY stops operation at the first error." >> >> What I need is an import where all valid lines from the csv files >> are read into the db, and I also get a logfile for all invalid >> lines, stating the line number plus the pg error message so I can >> see which constraint was violated. >> >> I can't think of a direct, elegant solution for this, does anyone >> have any suggestions? Thanks a lot! > > How about a Perl script that uses DBI with autocommit turned on and > loops over the file using (something like) 'while (my @line = > split(',',scalar(<>)) {}'? > > Perl tracks the line number of the current input file (including > STDIN) in $. (dollar-period). If you get a DBI error, 'warn "Input > error at line $.\n";'. Thanks Mike, someone else suggested the same (and help :)). If is has to be perl, so be it, although I'm not a big fan. Do you think this is possible in python?
On Feb 4, 2005, at 21:32, Joolz wrote: > What I need is an import where all valid lines from the csv files > are read into the db, and I also get a logfile for all invalid > lines, stating the line number plus the pg error message so I can > see which constraint was violated. > > I can't think of a direct, elegant solution for this, does anyone > have any suggestions? Thanks a lot! I don't know if it's elegant or not, but my approach to this situation is to read the data into a temp table that does not have as strict constraints. Once the data is in the database, I process it further, moving the data that's valid into the appropriate table. Then I see what's left, and what further processing I need to do to "fix" the invalid data. Perhaps a similar strategy would work for you. Michael Glaesemann grzm myrealbox com
Michael Glaesemann zei: > > On Feb 4, 2005, at 21:32, Joolz wrote: > >> What I need is an import where all valid lines from the csv files >> are read into the db, and I also get a logfile for all invalid >> lines, stating the line number plus the pg error message so I can >> see which constraint was violated. >> >> I can't think of a direct, elegant solution for this, does anyone >> have any suggestions? Thanks a lot! > > I don't know if it's elegant or not, but my approach to this > situation > is to read the data into a temp table that does not have as strict > constraints. Once the data is in the database, I process it further, > moving the data that's valid into the appropriate table. Then I see > what's left, and what further processing I need to do to "fix" the > invalid data. Perhaps a similar strategy would work for you. I'm afraid this is a bit too indirect IMHO. As I want to know the line number in which an error occurs, I would have to traverse the error-tolerant table with limit 1 offset N, and report N when an error occurs, hoping that the row order is identical to the line order in the csv file.
I use a trigger on tables with foreign key references to either ignore the insert row or insert an appropriate matching row in the referenced table, if it does not exist. In the function, I just raise a notice that I am doing this. This is a simple example: create or replace function tgf_insert_gene_id() returns trigger as $$ declare gene_id_chk integer; begin select into gene_id_chk gene_id from g_main where gene_id=NEW.gene_id; if (NOT FOUND) THEN BEGIN INSERT into g_main (gene_id,name) values (NEW.gene_id,'INSERTED FOR FOREIGN KEY CHECK'); RAISE NOTICE 'Inserting a gene ID to meet foreign key requirements'; END; END IF; return NEW; end;$$ language plpgsql; Just make a trigger for the tables of interest (I have several tables that reference gene_id in g_main) so they can all use the same trigger. This function simply looks up the proposed key to the foreign key table (gene_id) in the foreign key table. If it is (NOT FOUND) then it inserts the key into the foreign table (g_main) the gene_id about to be inserted and then raises a notice (which goes into the log if you have logging set up to log notices). Then, the function returns and, because we did the insert, foreign key checks work fine. You could just as easily choose to NOT insert a record that doesn't meet foreign key checks, or even do an insert into a separate table instead that contains all of the rows that don't match a foreign key. Hope this helps.... Sean On Feb 4, 2005, at 7:32 AM, Joolz wrote: > Hello everyone, > > I'm building a postgresql db which will have to get lots of data > from "the outside" (customers, that is). The db has lots of > constraints, and I'm sure that our customers will offer lots of > invalid information. We receive the information in csv format. My > first thought was to read them into the database with COPY, but > "COPY stops operation at the first error." > > What I need is an import where all valid lines from the csv files > are read into the db, and I also get a logfile for all invalid > lines, stating the line number plus the pg error message so I can > see which constraint was violated. > > I can't think of a direct, elegant solution for this, does anyone > have any suggestions? Thanks a lot! > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
[snip] > I'm afraid this is a bit too indirect IMHO. As I want to know the > line number in which an error occurs, I would have to traverse the > error-tolerant table with limit 1 offset N, and report N when an > error occurs, hoping that the row order is identical to the line > order in the csv file. So why don't you insert the line number too in the error tolerant table when you do the import ? Then you will have the line number all the time. Not that I would understand what will you do with that line number, once you already have all the data in the table... Cheers, Csaba.
Csaba Nagy zei: > [snip] >> I'm afraid this is a bit too indirect IMHO. As I want to know the >> line number in which an error occurs, I would have to traverse the >> error-tolerant table with limit 1 offset N, and report N when an >> error occurs, hoping that the row order is identical to the line >> order in the csv file. > > So why don't you insert the line number too in the error tolerant > table > when you do the import ? Then you will have the line number all the > time. Not that I would understand what will you do with that line > number, once you already have all the data in the table... I need the line number to let the customer know which data were refused, and the corresponding errormessage to let him know why.
Sean Davis zei: > I use a trigger on tables with foreign key references to either > ignore > the insert row or insert an appropriate matching row in the > referenced > table, if it does not exist. In the function, I just raise a notice > that I am doing this. This is a simple example: > create or replace function tgf_insert_gene_id() returns trigger as > $$ > declare > gene_id_chk integer; > begin > select into gene_id_chk gene_id from g_main where > gene_id=NEW.gene_id; > if (NOT FOUND) THEN > BEGIN > INSERT into g_main (gene_id,name) values > (NEW.gene_id,'INSERTED FOR FOREIGN KEY CHECK'); > RAISE NOTICE 'Inserting a gene ID to meet foreign > key > requirements'; > END; > END IF; > return NEW; > end;$$ language plpgsql; > > Just make a trigger for the tables of interest (I have several > tables > that reference gene_id in g_main) so they can all use the same > trigger. > > This function simply looks up the proposed key to the foreign key > table > (gene_id) in the foreign key table. If it is (NOT FOUND) then it > inserts the key into the foreign table (g_main) the gene_id about to > be > inserted and then raises a notice (which goes into the log if you > have > logging set up to log notices). Then, the function returns and, > because we did the insert, foreign key checks work fine. You could > just as easily choose to NOT insert a record that doesn't meet > foreign > key checks, or even do an insert into a separate table instead that > contains all of the rows that don't match a foreign key. > > Hope this helps.... > Sean Thanks Sean, but in my situation I don't want the database to be so versatile as to handle all the errors itself, this would create unneccesary load during all operations on the tables, not only when running the import. But I'm getting lots of great feedback from the list, thanks everyone!
On Feb 4, 2005, at 8:30 AM, Joolz wrote: > Sean Davis zei: >> I use a trigger on tables with foreign key references to either >> ignore >> the insert row or insert an appropriate matching row in the >> referenced > Thanks Sean, but in my situation I don't want the database to be so > versatile as to handle all the errors itself, this would create > unneccesary load during all operations on the tables, not only when > running the import. > Just drop the trigger once the table is copied. In my application (a hundred thousand to million rows or so per table), using copy rather than DBI inserts is an order of magnitude or two faster, even with the triggers in place. Sean
On Feb 4, 2005, at 8:27 AM, Joolz wrote: > > Csaba Nagy zei: >> [snip] >>> I'm afraid this is a bit too indirect IMHO. As I want to know the >>> line number in which an error occurs, I would have to traverse the >>> error-tolerant table with limit 1 offset N, and report N when an >>> error occurs, hoping that the row order is identical to the line >>> order in the csv file. >> >> So why don't you insert the line number too in the error tolerant >> table >> when you do the import ? Then you will have the line number all the >> time. Not that I would understand what will you do with that line >> number, once you already have all the data in the table... > > I need the line number to let the customer know which data were > refused, and the corresponding errormessage to let him know why. > Not to belabor the point, but using a trigger with a RAISED notice gives the "context" in which the error is raised like: NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 58245: "284462 Hs.128382" NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 60235: "285936 Hs.174768" NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 66339: "347813 Hs.406568" NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 114352: "401610 Hs.454283" NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 134275: "75875 Mm.300210" NOTICE: Inserting a gene ID to meet foreign key requirements CONTEXT: COPY g_ug, line 165798: "441201 Hs.535027" Sean
On Fri, 4 Feb 2005 13:56:23 +0100 (CET), Joolz <joolz@arbodienst-limburg.nl> wrote: > If is has to be perl, so be it, although I'm not a big fan. Do you > think this is possible in python? > Sure. I just suggested Perl since that's my QnD tool of choice. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
joolz@arbodienst-limburg.nl ("Joolz") writes: > Hello everyone, > > I'm building a postgresql db which will have to get lots of data > from "the outside" (customers, that is). The db has lots of > constraints, and I'm sure that our customers will offer lots of > invalid information. We receive the information in csv format. My > first thought was to read them into the database with COPY, but > "COPY stops operation at the first error." > > What I need is an import where all valid lines from the csv files > are read into the db, and I also get a logfile for all invalid > lines, stating the line number plus the pg error message so I can > see which constraint was violated. > > I can't think of a direct, elegant solution for this, does anyone > have any suggestions? Thanks a lot! First, load it into a table that has really, really _weak_ constraints, but which also adds a sequence column at the end that gets automagically populated. - The sequence gets you the "line numbers" you want. - You then run queries that separate the "good" from the "crud." This might ultimately turn into having 9 tables: - The original data, that you NEVER touch again, as the "pristine" form to look at if you find a problem; - 3 tables that collect entries with 3 different sorts of problems that mandate discarding the data [well, sending it back for retreading...] - 4 tables that contain entries that could get rewritten in some automatic fashion based on 4 policies you discovered - 1 "final results" table that aggregates those 4 tables along with the entries that were OK, which are fine to use as the resulting data import that can safely go into your application. I'm making up numbers, but you hopefully get the idea... -- "cbbrowne","@","ca.afilias.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 673-4124 (land)