Re: Using the database to validate data - Mailing list pgsql-general

From Jeff Janes
Subject Re: Using the database to validate data
Date
Msg-id CAMkU=1zh4nr3U5ggM90kGLmBSeR_vzRJpAtdOyioRws0AEqDRA@mail.gmail.com
Whole thread Raw
In response to Using the database to validate data  (JPLapham <lapham@jandr.org>)
List pgsql-general
On Thu, Jul 23, 2015 at 5:55 AM, JPLapham <lapham@jandr.org> wrote:
Hello,

I have an application that occasionally performs large batch inserts of user
hand-generated data. Input is a tab delimited file with typically hundreds
to a thousand lines of data.

Because the data is generated by hand, there are always many
transaction-stopping errors in a typical input run. For example, missing
datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
we chose PostgreSQL exactly because of these problems, because of the robust
transactional control, rollback on errors, etc.

My question is the following. I would like to *test* the data input for
integrity in such a way that I can create a report to the user informing
them of exactly where in their input file to correct the problems.

IDEA 1: My first attempt at this was to simply slurp the data into the
database, collect the errors, and then rollback. Of course (as I now know),
this doesn't work because after the first problem, the database reports,
"current transaction is aborted, commands ignored until end of transaction
block". This means that I can only report to the user the location of the
first problem, and then they run the data again, and keep looping through
the process until the data is good, a huge waste of time.

This is how I usually do it, until it become unbearable or an order comes down from on high to do it differently.  If the errors are due to people being sloppy, then it ought to be annoying for them to be sloppy.  Why make it convenient for them?  If the errors are more excusable than just sloppiness, or if the annoyance is more to you than to the people creating the errors, then you have to go on to other methods.
 

IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
This allows me to check for things like NOT NULL and data type issues, but
not violations of UNIQUE within the new data.

Issue a savepoint before each insert, and then issue a "release savepoint" if the insert succeeds or a "rollback savepoint" if it does not.  If you release a savepoint, remember that fact so that at the end you rollback the entire transaction instead of committing it.  I rarely actually resort to this.  It might miss some errors in which one failed row failed for multiple reasons, or where one row would have failed had another row not already failed for a different reason.
 

IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
is not trivial. It seems ridiculous to do this when I have the *actual*
database available to test against!

I do this one a lot when I can't get away with method 1, and I don't see why it is a nightmare.  Writing queries against the existing database to see if the new proposed keys exist *is* trivial.  You write them in SQL, not a low level language.  

Testing for internal duplication within the new dataset is a bit harder, I usually do that in Perl with a hash.  There might be cases where Perl and PostgreSQL disagree about when two values are equal, but I've almost never run into them in practise.

Check constraints or character encoding issues or typing issues can be harder to deal with.  If those are likely to be a problem, create a temp or unlogged table with the same check constraints as the real table but without the unique or foreign key constraints and see if each row inserts.

Cheers,

Jeff

pgsql-general by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Delete rule does not prevent truncate
Next
From: Chris Withers
Date:
Subject: pgbench tps drop from 5000 to 37 going from localhost to a server 13ms away