Thread: error-tolerant COPY FROM

error-tolerant COPY FROM

From

"Joolz"

Date:

04 February 2005, 12:32:49

Hello everyone,

I'm building a postgresql db which will have to get lots of data
from "the outside" (customers, that is). The db has lots of
constraints, and I'm sure that our customers will offer lots of
invalid information. We receive the information in csv format. My
first thought was to read them into the database with COPY, but
"COPY stops operation at the first error."

What I need is an import where all valid lines from the csv files
are read into the db, and I also get a logfile for all invalid
lines, stating the line number plus the pg error message so I can
see which constraint was violated.

I can't think of a direct, elegant solution for this, does anyone
have any suggestions? Thanks a lot!

Re: error-tolerant COPY FROM

From

Mike Rylander

Date:

04 February 2005, 12:49:54

On Fri, 4 Feb 2005 13:32:40 +0100 (CET), Joolz
<joolz@arbodienst-limburg.nl> wrote:
> Hello everyone,
>
> I'm building a postgresql db which will have to get lots of data
> from "the outside" (customers, that is). The db has lots of
> constraints, and I'm sure that our customers will offer lots of
> invalid information. We receive the information in csv format. My
> first thought was to read them into the database with COPY, but
> "COPY stops operation at the first error."
>
> What I need is an import where all valid lines from the csv files
> are read into the db, and I also get a logfile for all invalid
> lines, stating the line number plus the pg error message so I can
> see which constraint was violated.
>
> I can't think of a direct, elegant solution for this, does anyone
> have any suggestions? Thanks a lot!

How about a Perl script that uses DBI with autocommit turned on and
loops over the file using (something like) 'while (my @line =
split(',',scalar(<>)) {}'?

Perl tracks the line number of the current input file (including
STDIN) in $. (dollar-period).  If you get a DBI error, 'warn "Input
error at line $.\n";'.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: error-tolerant COPY FROM

From

"Joolz"

Date:

04 February 2005, 12:56:31

Mike Rylander zei:
> On Fri, 4 Feb 2005 13:32:40 +0100 (CET), Joolz
> <joolz@arbodienst-limburg.nl> wrote:
>> Hello everyone,
>>
>> I'm building a postgresql db which will have to get lots of data
>> from "the outside" (customers, that is). The db has lots of
>> constraints, and I'm sure that our customers will offer lots of
>> invalid information. We receive the information in csv format. My
>> first thought was to read them into the database with COPY, but
>> "COPY stops operation at the first error."
>>
>> What I need is an import where all valid lines from the csv files
>> are read into the db, and I also get a logfile for all invalid
>> lines, stating the line number plus the pg error message so I can
>> see which constraint was violated.
>>
>> I can't think of a direct, elegant solution for this, does anyone
>> have any suggestions? Thanks a lot!
>
> How about a Perl script that uses DBI with autocommit turned on and
> loops over the file using (something like) 'while (my @line =
> split(',',scalar(<>)) {}'?
>
> Perl tracks the line number of the current input file (including
> STDIN) in $. (dollar-period).  If you get a DBI error, 'warn "Input
> error at line $.\n";'.

Thanks Mike, someone else suggested the same (and help :)).

If is has to be perl, so be it, although I'm not a big fan. Do you
think this is possible in python?

Re: error-tolerant COPY FROM

From

Michael Glaesemann

Date:

04 February 2005, 13:00:29

On Feb 4, 2005, at 21:32, Joolz wrote:

> What I need is an import where all valid lines from the csv files
> are read into the db, and I also get a logfile for all invalid
> lines, stating the line number plus the pg error message so I can
> see which constraint was violated.
>
> I can't think of a direct, elegant solution for this, does anyone
> have any suggestions? Thanks a lot!

I don't know if it's elegant or not, but my approach to this situation
is to read the data into a temp table that does not have as strict
constraints. Once the data is in the database, I process it further,
moving the data that's valid into the appropriate table. Then I see
what's left, and what further processing I need to do to "fix" the
invalid data. Perhaps a similar strategy would work for you.

Michael Glaesemann
grzm myrealbox com

Re: error-tolerant COPY FROM

From

"Joolz"

Date:

04 February 2005, 13:06:25

Michael Glaesemann zei:
>
> On Feb 4, 2005, at 21:32, Joolz wrote:
>
>> What I need is an import where all valid lines from the csv files
>> are read into the db, and I also get a logfile for all invalid
>> lines, stating the line number plus the pg error message so I can
>> see which constraint was violated.
>>
>> I can't think of a direct, elegant solution for this, does anyone
>> have any suggestions? Thanks a lot!
>
> I don't know if it's elegant or not, but my approach to this
> situation
> is to read the data into a temp table that does not have as strict
> constraints. Once the data is in the database, I process it further,
> moving the data that's valid into the appropriate table. Then I see
> what's left, and what further processing I need to do to "fix" the
> invalid data. Perhaps a similar strategy would work for you.

I'm afraid this is a bit too indirect IMHO. As I want to know the
line number in which an error occurs, I would have to traverse the
error-tolerant table with limit 1 offset N, and report N when an
error occurs, hoping that the row order is identical to the line
order in the csv file.

Re: error-tolerant COPY FROM

From

Sean Davis

Date:

04 February 2005, 13:16:06

I use a trigger on tables with foreign key references to either ignore
the insert row or insert an appropriate matching row in the referenced
table, if it does not exist.  In the function, I just raise a notice
that I am doing this.  This is a simple example:
create or replace function tgf_insert_gene_id() returns trigger as $$
declare
     gene_id_chk integer;
begin
     select into gene_id_chk gene_id from g_main where
gene_id=NEW.gene_id;
     if (NOT FOUND) THEN
           BEGIN
              INSERT into g_main (gene_id,name) values
(NEW.gene_id,'INSERTED FOR FOREIGN KEY CHECK');
                RAISE NOTICE 'Inserting a gene ID to meet foreign key
requirements';
           END;
     END IF;
     return NEW;
end;$$ language plpgsql;

Just make a trigger for the tables of interest (I have several tables
that reference gene_id in g_main) so they can all use the same trigger.

This function simply looks up the proposed key to the foreign key table
(gene_id) in the foreign key table.  If it is (NOT FOUND) then it
inserts the key into the foreign table (g_main) the gene_id about to be
inserted and then raises a notice (which goes into the log if you have
logging set up to log notices).  Then, the function returns and,
because we did the insert, foreign key checks work fine.  You could
just as easily choose to NOT insert a record that doesn't meet foreign
key checks, or even do an insert into a separate table instead that
contains all of the rows that don't match a foreign key.

Hope this helps....
Sean

On Feb 4, 2005, at 7:32 AM, Joolz wrote:

> Hello everyone,
>
> I'm building a postgresql db which will have to get lots of data
> from "the outside" (customers, that is). The db has lots of
> constraints, and I'm sure that our customers will offer lots of
> invalid information. We receive the information in csv format. My
> first thought was to read them into the database with COPY, but
> "COPY stops operation at the first error."
>
> What I need is an import where all valid lines from the csv files
> are read into the db, and I also get a logfile for all invalid
> lines, stating the line number plus the pg error message so I can
> see which constraint was violated.
>
> I can't think of a direct, elegant solution for this, does anyone
> have any suggestions? Thanks a lot!
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: error-tolerant COPY FROM

From

Csaba Nagy

Date:

04 February 2005, 13:24:37

[snip]
> I'm afraid this is a bit too indirect IMHO. As I want to know the
> line number in which an error occurs, I would have to traverse the
> error-tolerant table with limit 1 offset N, and report N when an
> error occurs, hoping that the row order is identical to the line
> order in the csv file.

So why don't you insert the line number too in the error tolerant table
when you do the import ? Then you will have the line number all the
time. Not that I would understand what will you do with that line
number, once you already have all the data in the table...

Cheers,
Csaba.

Re: error-tolerant COPY FROM

From

"Joolz"

Date:

04 February 2005, 13:27:59

Csaba Nagy zei:
> [snip]
>> I'm afraid this is a bit too indirect IMHO. As I want to know the
>> line number in which an error occurs, I would have to traverse the
>> error-tolerant table with limit 1 offset N, and report N when an
>> error occurs, hoping that the row order is identical to the line
>> order in the csv file.
>
> So why don't you insert the line number too in the error tolerant
> table
> when you do the import ? Then you will have the line number all the
> time. Not that I would understand what will you do with that line
> number, once you already have all the data in the table...

I need the line number to let the customer know which data were
refused, and the corresponding errormessage to let him know why.

Re: error-tolerant COPY FROM

From

"Joolz"

Date:

04 February 2005, 13:30:31

Sean Davis zei:
> I use a trigger on tables with foreign key references to either
> ignore
> the insert row or insert an appropriate matching row in the
> referenced
> table, if it does not exist.  In the function, I just raise a notice
> that I am doing this.  This is a simple example:
> create or replace function tgf_insert_gene_id() returns trigger as
> $$
> declare
>      gene_id_chk integer;
> begin
>      select into gene_id_chk gene_id from g_main where
> gene_id=NEW.gene_id;
>      if (NOT FOUND) THEN
>            BEGIN
>               INSERT into g_main (gene_id,name) values
> (NEW.gene_id,'INSERTED FOR FOREIGN KEY CHECK');
>                 RAISE NOTICE 'Inserting a gene ID to meet foreign
> key
> requirements';
>            END;
>      END IF;
>      return NEW;
> end;$$ language plpgsql;
>
> Just make a trigger for the tables of interest (I have several
> tables
> that reference gene_id in g_main) so they can all use the same
> trigger.
>
> This function simply looks up the proposed key to the foreign key
> table
> (gene_id) in the foreign key table.  If it is (NOT FOUND) then it
> inserts the key into the foreign table (g_main) the gene_id about to
> be
> inserted and then raises a notice (which goes into the log if you
> have
> logging set up to log notices).  Then, the function returns and,
> because we did the insert, foreign key checks work fine.  You could
> just as easily choose to NOT insert a record that doesn't meet
> foreign
> key checks, or even do an insert into a separate table instead that
> contains all of the rows that don't match a foreign key.
>
> Hope this helps....
> Sean

Thanks Sean, but in my situation I don't want the database to be so
versatile as to handle all the errors itself, this would create
unneccesary load during all operations on the tables, not only when
running the import.

But I'm getting lots of great feedback from the list, thanks everyone!

Re: error-tolerant COPY FROM

From

Sean Davis

Date:

04 February 2005, 13:42:02

On Feb 4, 2005, at 8:30 AM, Joolz wrote:

> Sean Davis zei:
>> I use a trigger on tables with foreign key references to either
>> ignore
>> the insert row or insert an appropriate matching row in the
>> referenced
> Thanks Sean, but in my situation I don't want the database to be so
> versatile as to handle all the errors itself, this would create
> unneccesary load during all operations on the tables, not only when
> running the import.
>

Just drop the trigger once the table is copied.  In my application (a
hundred thousand to million rows or so per table), using copy rather
than DBI inserts is an order of magnitude or two faster, even with the
triggers in place.

Sean

Re: error-tolerant COPY FROM

From

Sean Davis

Date:

04 February 2005, 13:44:48

On Feb 4, 2005, at 8:27 AM, Joolz wrote:

>
> Csaba Nagy zei:
>> [snip]
>>> I'm afraid this is a bit too indirect IMHO. As I want to know the
>>> line number in which an error occurs, I would have to traverse the
>>> error-tolerant table with limit 1 offset N, and report N when an
>>> error occurs, hoping that the row order is identical to the line
>>> order in the csv file.
>>
>> So why don't you insert the line number too in the error tolerant
>> table
>> when you do the import ? Then you will have the line number all the
>> time. Not that I would understand what will you do with that line
>> number, once you already have all the data in the table...
>
> I need the line number to let the customer know which data were
> refused, and the corresponding errormessage to let him know why.
>

Not to belabor the point, but using a trigger with a RAISED notice
gives the "context" in which the error is raised like:

NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 58245: "284462        Hs.128382"
NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 60235: "285936        Hs.174768"
NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 66339: "347813        Hs.406568"
NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 114352: "401610       Hs.454283"
NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 134275: "75875        Mm.300210"
NOTICE:  Inserting a gene ID to meet foreign key requirements
CONTEXT:  COPY g_ug, line 165798: "441201       Hs.535027"

Sean

Re: error-tolerant COPY FROM

From

Mike Rylander

Date:

04 February 2005, 15:08:02

On Fri, 4 Feb 2005 13:56:23 +0100 (CET), Joolz
<joolz@arbodienst-limburg.nl> wrote:
> If is has to be perl, so be it, although I'm not a big fan. Do you
> think this is possible in python?
>

Sure.  I just suggested Perl since that's my QnD tool of choice.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: error-tolerant COPY FROM

From

Christopher Browne

Date:

04 February 2005, 23:04:20

joolz@arbodienst-limburg.nl ("Joolz") writes:

> Hello everyone,
>
> I'm building a postgresql db which will have to get lots of data
> from "the outside" (customers, that is). The db has lots of
> constraints, and I'm sure that our customers will offer lots of
> invalid information. We receive the information in csv format. My
> first thought was to read them into the database with COPY, but
> "COPY stops operation at the first error."
>
> What I need is an import where all valid lines from the csv files
> are read into the db, and I also get a logfile for all invalid
> lines, stating the line number plus the pg error message so I can
> see which constraint was violated.
>
> I can't think of a direct, elegant solution for this, does anyone
> have any suggestions? Thanks a lot!

First, load it into a table that has really, really _weak_
constraints, but which also adds a sequence column at the end that
gets automagically populated.

- The sequence gets you the "line numbers" you want.

- You then run queries that separate the "good" from the "crud."

This might ultimately turn into having 9 tables:

 - The original data, that you NEVER touch again, as the "pristine"
   form to look at if you find a problem;

 - 3 tables that collect entries with 3 different sorts of problems
   that mandate discarding the data [well, sending it back for
   retreading...]

 - 4 tables that contain entries that could get rewritten in some
   automatic fashion based on 4 policies you discovered

 - 1 "final results" table that aggregates those 4 tables along with
   the entries that were OK, which are fine to use as the resulting data
   import that can safely go into your application.

I'm making up numbers, but you hopefully get the idea...
--
"cbbrowne","@","ca.afilias.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)