Thread: Using the database to validate data

Using the database to validate data

From
JPLapham
Date:
Hello,

I have an application that occasionally performs large batch inserts of user
hand-generated data. Input is a tab delimited file with typically hundreds
to a thousand lines of data.

Because the data is generated by hand, there are always many
transaction-stopping errors in a typical input run. For example, missing
datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
we chose PostgreSQL exactly because of these problems, because of the robust
transactional control, rollback on errors, etc.

My question is the following. I would like to *test* the data input for
integrity in such a way that I can create a report to the user informing
them of exactly where in their input file to correct the problems.

IDEA 1: My first attempt at this was to simply slurp the data into the
database, collect the errors, and then rollback. Of course (as I now know),
this doesn't work because after the first problem, the database reports,
"current transaction is aborted, commands ignored until end of transaction
block". This means that I can only report to the user the location of the
first problem, and then they run the data again, and keep looping through
the process until the data is good, a huge waste of time.

IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
This allows me to check for things like NOT NULL and data type issues, but
not violations of UNIQUE within the new data.

IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
is not trivial. It seems ridiculous to do this when I have the *actual*
database available to test against!

Has anyone dealt with this kind of issue before? What are your opinions on
best practice for this? Of course I do not want to actually COMMIT until the
data is perfect!

Thanks for your time!
-Jon



--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Using the database to validate data

From
Adrian Klaver
Date:
On 07/23/2015 05:55 AM, JPLapham wrote:
> Hello,
>
> I have an application that occasionally performs large batch inserts of user
> hand-generated data. Input is a tab delimited file with typically hundreds
> to a thousand lines of data.
>
> Because the data is generated by hand, there are always many
> transaction-stopping errors in a typical input run. For example, missing
> datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
> mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
> we chose PostgreSQL exactly because of these problems, because of the robust
> transactional control, rollback on errors, etc.
>
> My question is the following. I would like to *test* the data input for
> integrity in such a way that I can create a report to the user informing
> them of exactly where in their input file to correct the problems.
>
> IDEA 1: My first attempt at this was to simply slurp the data into the
> database, collect the errors, and then rollback. Of course (as I now know),
> this doesn't work because after the first problem, the database reports,
> "current transaction is aborted, commands ignored until end of transaction
> block". This means that I can only report to the user the location of the
> first problem, and then they run the data again, and keep looping through
> the process until the data is good, a huge waste of time.
>
> IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
> This allows me to check for things like NOT NULL and data type issues, but
> not violations of UNIQUE within the new data.
>
> IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
> like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
> is not trivial. It seems ridiculous to do this when I have the *actual*
> database available to test against!

That has already been done:

http://pgloader.io/


>
> Has anyone dealt with this kind of issue before? What are your opinions on
> best practice for this? Of course I do not want to actually COMMIT until the
> data is perfect!
>
> Thanks for your time!
> -Jon
>
>
>
> --
> View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046.html
> Sent from the PostgreSQL - general mailing list archive at Nabble.com.
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Using the database to validate data

From
Adrian Klaver
Date:
On 07/23/2015 12:04 PM, Jon Lapham wrote:
> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
>> http://pgloader.io/
>
> Ok, thanks, I'll look into pgloader's data validation abilities.
>
> However, my naive understanding of pgloader is that it is used to
> quickly load data into a database, which is not what I am looking to do.
> I want to validate data integrity *before* putting it into the database.
> If there is a problem with any part of the data, I don't want any of it
> in the database.

I misunderstood, I thought you just wanted  information on the rows that
did not get in. pgloader does this by including the rejected data in
*.dat and the Postgres log of why it was rejected in *.log.

<Thinking out loud, not tested>

I could still see making use of this by using the --before <file_name>,
where file_name contains a CREATE TEMPORARY TABLE some_table script that
mimics the permanent table. Then it would load against the temporary
table, write out any errors and then drop the table at the end. This
would not put data into the permanent table on complete success though.
That would require some magic in AFTER LOAD EXECUTE that I have not come
up with yet:)

<Thinking out loud, not tested>
>
> -Jon
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Using the database to validate data

From
Tim Clarke
Date:
Shouldn't be too difficult to import those new rows into one table,
write a procedure that inserts them into the real table one by one and
logs the validation failure if any - committing good rows and rolling
back bad. In fact if you could then write the failures to a third table
with a completely relaxed (or no) validation?

Tim Clarke

On 23/07/15 21:48, Adrian Klaver wrote:
> On 07/23/2015 12:04 PM, Jon Lapham wrote:
>> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
>>> http://pgloader.io/
>>
>> Ok, thanks, I'll look into pgloader's data validation abilities.
>>
>> However, my naive understanding of pgloader is that it is used to
>> quickly load data into a database, which is not what I am looking to do.
>> I want to validate data integrity *before* putting it into the database.
>> If there is a problem with any part of the data, I don't want any of it
>> in the database.
>
> I misunderstood, I thought you just wanted  information on the rows
> that did not get in. pgloader does this by including the rejected data
> in *.dat and the Postgres log of why it was rejected in *.log.
>
> <Thinking out loud, not tested>
>
> I could still see making use of this by using the --before
> <file_name>, where file_name contains a CREATE TEMPORARY TABLE
> some_table script that mimics the permanent table. Then it would load
> against the temporary table, write out any errors and then drop the
> table at the end. This would not put data into the permanent table on
> complete success though. That would require some magic in AFTER LOAD
> EXECUTE that I have not come up with yet:)
>
> <Thinking out loud, not tested>
>>
>> -Jon
>>
>
>



Re: Using the database to validate data

From
林士博
Date:
If I am following correctly, you can do it in your application as follows.
1.begin transaction
2.insert each data. Catch db exception, 
and save exception message and other information you need to array.
3.in the end ,you can get all the information about the wrong data in array if there is any.
   and then you can decide whether it is need to rollback or to commit.
   
By the way, this is about programming but not postgresql.

2015-07-24 5:58 GMT+09:00 Tim Clarke <tim.clarke@manifest.co.uk>:
Shouldn't be too difficult to import those new rows into one table,
write a procedure that inserts them into the real table one by one and
logs the validation failure if any - committing good rows and rolling
back bad. In fact if you could then write the failures to a third table
with a completely relaxed (or no) validation?

Tim Clarke

On 23/07/15 21:48, Adrian Klaver wrote:
> On 07/23/2015 12:04 PM, Jon Lapham wrote:
>> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
>>> http://pgloader.io/
>>
>> Ok, thanks, I'll look into pgloader's data validation abilities.
>>
>> However, my naive understanding of pgloader is that it is used to
>> quickly load data into a database, which is not what I am looking to do.
>> I want to validate data integrity *before* putting it into the database.
>> If there is a problem with any part of the data, I don't want any of it
>> in the database.
>
> I misunderstood, I thought you just wanted  information on the rows
> that did not get in. pgloader does this by including the rejected data
> in *.dat and the Postgres log of why it was rejected in *.log.
>
> <Thinking out loud, not tested>
>
> I could still see making use of this by using the --before
> <file_name>, where file_name contains a CREATE TEMPORARY TABLE
> some_table script that mimics the permanent table. Then it would load
> against the temporary table, write out any errors and then drop the
> table at the end. This would not put data into the permanent table on
> complete success though. That would require some magic in AFTER LOAD
> EXECUTE that I have not come up with yet:)
>
> <Thinking out loud, not tested>
>>
>> -Jon
>>
>
>



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
─repica group──────────────────
▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供!

▼フォローアップメールや外部連携に対応!
【mail solution】http://ms.repica.jp/

▼9年連続シェアNo.1 個人情報漏えい対策ソフト
【P-Pointer】http://ppointer.jp/

▼単月導入可能!AR動画再生アプリ

▼ITビジネスを創造しながら未来を創る
───────────────────────────

Re: Using the database to validate data

From
Adrian Klaver
Date:
On 07/23/2015 08:20 PM, 林士博 wrote:
> If I am following correctly, you can do it in your application as follows.
> 1.begin transaction
> 2.insert each data. Catch db exception,
> and save exception message and other information you need to array.
> 3.in <http://3.in> the end ,you can get all the information about the
> wrong data in array if there is any.
>     and then you can decide whether it is need to rollback or to commit.
> By the way, this is about programming but not postgresql.

Not sure I am following, the OP was asking the best way to catch any
errors Postgres might throw using built in database features as much as
possible. That seems to to be Postgres related.

>
> 2015-07-24 5:58 GMT+09:00 Tim Clarke <tim.clarke@manifest.co.uk
> <mailto:tim.clarke@manifest.co.uk>>:
>
>     Shouldn't be too difficult to import those new rows into one table,
>     write a procedure that inserts them into the real table one by one and
>     logs the validation failure if any - committing good rows and rolling
>     back bad. In fact if you could then write the failures to a third table
>     with a completely relaxed (or no) validation?
>
>     Tim Clarke
>
>     On 23/07/15 21:48, Adrian Klaver wrote:
>      > On 07/23/2015 12:04 PM, Jon Lapham wrote:
>      >> On 07/23/2015 03:02 PM, Adrian Klaver wrote:
>      >>> http://pgloader.io/
>      >>
>      >> Ok, thanks, I'll look into pgloader's data validation abilities.
>      >>
>      >> However, my naive understanding of pgloader is that it is used to
>      >> quickly load data into a database, which is not what I am
>     looking to do.
>      >> I want to validate data integrity *before* putting it into the
>     database.
>      >> If there is a problem with any part of the data, I don't want
>     any of it
>      >> in the database.
>      >
>      > I misunderstood, I thought you just wanted  information on the rows
>      > that did not get in. pgloader does this by including the rejected
>     data
>      > in *.dat and the Postgres log of why it was rejected in *.log.
>      >
>      > <Thinking out loud, not tested>
>      >
>      > I could still see making use of this by using the --before
>      > <file_name>, where file_name contains a CREATE TEMPORARY TABLE
>      > some_table script that mimics the permanent table. Then it would load
>      > against the temporary table, write out any errors and then drop the
>      > table at the end. This would not put data into the permanent table on
>      > complete success though. That would require some magic in AFTER LOAD
>      > EXECUTE that I have not come up with yet:)
>      >
>      > <Thinking out loud, not tested>
>      >>
>      >> -Jon
>      >>
>      >
>      >
>
>
>
>     --
>     Sent via pgsql-general mailing list (pgsql-general@postgresql.org
>     <mailto:pgsql-general@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>
> --
> ─repica group──────────────────
> ▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供!
> 【point+plus】http://www.repica.jp/pointplus/
>
> ▼フォローアップメールや外部連携に対応!
> 【mail solution】http://ms.repica.jp/
>
> ▼9年連続シェアNo.1 個人情報漏えい対策ソフト
> 【P-Pointer】http://ppointer.jp/
>
> ▼単月導入可能!AR動画再生アプリ
> 【marcs】http://www.arappli.com/service/marcs/
>
> ▼ITビジネスを創造しながら未来を創る
> 【VARCHAR】http://varchar.co.jp/
> ───────────────────────────


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Using the database to validate data

From
林士博
Date:
Yes. You are right.
Do it in postgresql procedure is faster than in application.

Re: Using the database to validate data

From
Zdeněk Bělehrádek
Date:
What about creating a SAVEPOINT before each INSERT, and if the INSERT returns
an error, then ROLLBACK TO SAVEPOINT? This way you will have all the
insertable data in your table, and you can still ROLLBACK the whole
transaction, or COMMIT it if there were no errors.

It will probably be quite slow, but if you have only thousands of lines, it
should be fast enough for your usecase IMHO.

-- Zdeněk Bělehrádek

> Hello,
>
> I have an application that occasionally performs large batch inserts of user
> hand-generated data. Input is a tab delimited file with typically hundreds
> to a thousand lines of data.
>
> Because the data is generated by hand, there are always many
> transaction-stopping errors in a typical input run. For example, missing
> datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
> mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
> we chose PostgreSQL exactly because of these problems, because of the robust
> transactional control, rollback on errors, etc.
>
> My question is the following. I would like to *test* the data input for
> integrity in such a way that I can create a report to the user informing
> them of exactly where in their input file to correct the problems.
>
> IDEA 1: My first attempt at this was to simply slurp the data into the
> database, collect the errors, and then rollback. Of course (as I now know),
> this doesn't work because after the first problem, the database reports,
> "current transaction is aborted, commands ignored until end of transaction
> block". This means that I can only report to the user the location of the
> first problem, and then they run the data again, and keep looping through
> the process until the data is good, a huge waste of time.
>
> IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
> This allows me to check for things like NOT NULL and data type issues, but
> not violations of UNIQUE within the new data.
>
> IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
> like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
> is not trivial. It seems ridiculous to do this when I have the *actual*
> database available to test against!
>
> Has anyone dealt with this kind of issue before? What are your opinions on
> best practice for this? Of course I do not want to actually COMMIT until the
> data is perfect!
>
> Thanks for your time!
> -Jon
>
>
>
> --
> View this message in context:
> http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046.
> html Sent from the PostgreSQL - general mailing list archive at Nabble.com.



Re: Using the database to validate data

From
Jeff Janes
Date:
On Thu, Jul 23, 2015 at 5:55 AM, JPLapham <lapham@jandr.org> wrote:
Hello,

I have an application that occasionally performs large batch inserts of user
hand-generated data. Input is a tab delimited file with typically hundreds
to a thousand lines of data.

Because the data is generated by hand, there are always many
transaction-stopping errors in a typical input run. For example, missing
datum in a NOT NULL column, a duplicate value in a UNIQUE column, data type
mismatch, FOREIGN KEY reference to something non-existing, etc. Of course,
we chose PostgreSQL exactly because of these problems, because of the robust
transactional control, rollback on errors, etc.

My question is the following. I would like to *test* the data input for
integrity in such a way that I can create a report to the user informing
them of exactly where in their input file to correct the problems.

IDEA 1: My first attempt at this was to simply slurp the data into the
database, collect the errors, and then rollback. Of course (as I now know),
this doesn't work because after the first problem, the database reports,
"current transaction is aborted, commands ignored until end of transaction
block". This means that I can only report to the user the location of the
first problem, and then they run the data again, and keep looping through
the process until the data is good, a huge waste of time.

This is how I usually do it, until it become unbearable or an order comes down from on high to do it differently.  If the errors are due to people being sloppy, then it ought to be annoying for them to be sloppy.  Why make it convenient for them?  If the errors are more excusable than just sloppiness, or if the annoyance is more to you than to the people creating the errors, then you have to go on to other methods.
 

IDEA 2: My second idea on how to do this was to ROLLBACK after each INSERT.
This allows me to check for things like NOT NULL and data type issues, but
not violations of UNIQUE within the new data.

Issue a savepoint before each insert, and then issue a "release savepoint" if the insert succeeds or a "rollback savepoint" if it does not.  If you release a savepoint, remember that fact so that at the end you rollback the entire transaction instead of committing it.  I rarely actually resort to this.  It might miss some errors in which one failed row failed for multiple reasons, or where one row would have failed had another row not already failed for a different reason.
 

IDEA 3: Write my own data pre-conditioner. Ugh, what a nightmare, I feel
like I'm writing my own database! Checking for FKEY constraints, UNIQUE, etc
is not trivial. It seems ridiculous to do this when I have the *actual*
database available to test against!

I do this one a lot when I can't get away with method 1, and I don't see why it is a nightmare.  Writing queries against the existing database to see if the new proposed keys exist *is* trivial.  You write them in SQL, not a low level language.  

Testing for internal duplication within the new dataset is a bit harder, I usually do that in Perl with a hash.  There might be cases where Perl and PostgreSQL disagree about when two values are equal, but I've almost never run into them in practise.

Check constraints or character encoding issues or typing issues can be harder to deal with.  If those are likely to be a problem, create a temp or unlogged table with the same check constraints as the real table but without the unique or foreign key constraints and see if each row inserts.

Cheers,

Jeff

Re: Using the database to validate data

From
Jeff Janes
Date:
On Fri, Jul 24, 2015 at 5:17 AM, Zdeněk Bělehrádek <zdenek.belehradek@superhosting.cz> wrote:
What about creating a SAVEPOINT before each INSERT, and if the INSERT returns
an error, then ROLLBACK TO SAVEPOINT?

Make sure you release the savepoint if there was no error.  Otherwise you will quickly run out of memory.

Also, if there was an error, then after rolling back to the savepoint either release it, or refrain from starting a new one at the beginning of the next insert.

Cheers,

Jeff 

Re: Using the database to validate data

From
Jon Lapham
Date:
On 07/23/2015 03:02 PM, Adrian Klaver wrote:
> http://pgloader.io/

Ok, thanks, I'll look into pgloader's data validation abilities.

However, my naive understanding of pgloader is that it is used to
quickly load data into a database, which is not what I am looking to do.
I want to validate data integrity *before* putting it into the database.
If there is a problem with any part of the data, I don't want any of it
in the database.

-Jon

--
-**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
  Jon Lapham  <lapham@jandr.org>                Rio de Janeiro, Brasil
  Personal: http://www.jandr.org/
***-*--*----*-------*------------*--------------------*---------------



Re: Using the database to validate data

From
JPLapham
Date:
Tim Clarke wrote
> Shouldn't be too difficult to import those new rows into one table,
> write a procedure that inserts them into the real table one by one and
> logs the validation failure if any - committing good rows and rolling
> back bad. In fact if you could then write the failures to a third table
> with a completely relaxed (or no) validation?

Tim-

Thanks for your response. Yes, you are right, it shouldn't be too difficult,
and in fact I have already implemented basically what you suggest, see "Idea
2" from my original message.

The problem with this approach is that it fails to find violations such as
UNIQUE (there are probably others) from within the input data until after
the first has been committed to the database. But, the error may have been
with that earlier row, not the later.

I want my users to fix all the problems with their data and then load it in
an "all or none" fashion.

-Jon

--
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859160.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Using the database to validate data

From
rob stone
Date:
On Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:
> Tim Clarke wrote
> > Shouldn't be too difficult to import those new rows into one table,
> > write a procedure that inserts them into the real table one by one
> > and
> > logs the validation failure if any - committing good rows and
> > rolling
> > back bad. In fact if you could then write the failures to a third
> > table
> > with a completely relaxed (or no) validation?
>
> Tim-
>
> Thanks for your response. Yes, you are right, it shouldn't be too
> difficult,
> and in fact I have already implemented basically what you suggest,
> see "Idea
> 2" from my original message.
>
> The problem with this approach is that it fails to find violations
> such as
> UNIQUE (there are probably others) from within the input data until
> after
> the first has been committed to the database. But, the error may have
> been
> with that earlier row, not the later.
>
> I want my users to fix all the problems with their data and then load
> it in
> an "all or none" fashion.
>
> -Jon
>
If you have multiple users loading (possibly) multiple files and
(possibly) concurrently, then the only solution is to write some code
to process the data.
You also need to consider load sequence. If user A creates a file that
contains data that will end up creating a new primary key and a file
from user B refers to that, then user B needs to wait until user A's
file has been processed successfully.
Without knowing all the details I can envisage a scenario where data
being loaded could reference "good" data already in the DB as well as
referencing data that exists within that file load, possibly giving a
rejection or an update of the "good" data.

My 2 cents worth!

Cheers,
Rob (in Floripa)



Re: Using the database to validate data

From
Allan Kamau
Date:
To add onto what others a have said. I would use a bash script (and awk) to prepared each record of the raw CSV file with a dataset name, name of the file, timestamp and a serial number and place the newly generated data into a new file. In this bash script, the value of the dataset name, name of file and timestamp will come from bash variables declared and initialized at the start ensuring that all the records of a given file will have the same values for these three fields.
The dataset name is to help identify that dataset to which these data belongs when we load it into the database. The file name would indicate the name of the file from which these data was obtained. The timestamp is mainly used for versioning purposes. The sequential numbers (one unique number per row) is for auditing purposes.

I would then create a staging schema inside the postgresql database to which you intend to load the data. In this staging schema I would create a table containing all the fields as present in the newly generated CSV file and use TEXT datatype (and NULL constraint) for all these fields with the exception of the field that would hold the serial number which would be set to INTEGER.

Now load the data using COPY staging_schema.sometable(field1, field2, ...) FROM STDIN WITH (FORMAT csv, ...)



Now that we have the data loaded into the "staging_schema.sometable" table, we may now create another table ("staging_schema.sometable_good") also in the staging schema having the same structure as the previous table but this time having all the constrains as well as the "correct" datatypes you plan to have. Create yet another table identical to the first table and call is something like "staging_schema.sometable_bad" to this table we shall write the records that fail insertion into the "staging_schema.sometable_good" table.

Create a PLPGSQL script (with exception handling) to read the staging table one record at a time (perhaps using CURSORS), populate the "staging_schema.sometable_good" table with the data you have just read (from "staging_schema.sometable"). On failure of insert, write the record into the "staging_schema.sometable_bad" table.


After processing of these data, check to see if "staging_schema.sometable_bad" table has no records matching the specific dataset name, file name and timestamp (the first three fields). If any records are found, have a look at them to determine why the insertions to the "staging_schema.sometable_good" table failed.

If no matching records are found in the "staging_schema.sometable_bad" table, write another table ("schema.sometable") in the main schema of the same database. This table should have the same structure as "staging_schema.sometable_good". Simply insert into this table the records returned by doing a filtered query on the "staging_schema.sometable_good". This query should be filtered on the given dataset name, file name and timestamp (the first three fields).


Allan.



On Sat, Jul 25, 2015 at 4:19 PM, rob stone <floriparob@gmail.com> wrote:
On Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:
> Tim Clarke wrote
> > Shouldn't be too difficult to import those new rows into one table,
> > write a procedure that inserts them into the real table one by one
> > and
> > logs the validation failure if any - committing good rows and
> > rolling
> > back bad. In fact if you could then write the failures to a third
> > table
> > with a completely relaxed (or no) validation?
>
> Tim-
>
> Thanks for your response. Yes, you are right, it shouldn't be too
> difficult,
> and in fact I have already implemented basically what you suggest,
> see "Idea
> 2" from my original message.
>
> The problem with this approach is that it fails to find violations
> such as
> UNIQUE (there are probably others) from within the input data until
> after
> the first has been committed to the database. But, the error may have
> been
> with that earlier row, not the later.
>
> I want my users to fix all the problems with their data and then load
> it in
> an "all or none" fashion.
>
> -Jon
>
If you have multiple users loading (possibly) multiple files and
(possibly) concurrently, then the only solution is to write some code
to process the data.
You also need to consider load sequence. If user A creates a file that
contains data that will end up creating a new primary key and a file
from user B refers to that, then user B needs to wait until user A's
file has been processed successfully.
Without knowing all the details I can envisage a scenario where data
being loaded could reference "good" data already in the DB as well as
referencing data that exists within that file load, possibly giving a
rejection or an update of the "good" data.

My 2 cents worth!

Cheers,
Rob (in Floripa)



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Using the database to validate data

From
JPLapham
Date:
林士博 wrote
> If I am following correctly, you can do it in your application as follows.
> 1.begin transaction
> 2.insert each data. Catch db exception,
> and save exception message and other information you need to array.
> 3.in the end ,you can get all the information about the wrong data in
> array
> if there is any.
>    and then you can decide whether it is need to rollback or to commit.

Yes, I agree that I could do that, which I believe is my "IDEA 1" from my
original message. This method will naturally work, but it is a very slow
iterative process because you can only catch the *first* error, after which
new INSERTS are not allowed. If you have a data input with say 1000 record,
and there are 50 errors, it would require 50 iterations of fixing the input
data, running it again, to find them all.


林士博 wrote
> By the way, this is about programming but not postgresql.

I was hoping that there would be a way to have Postgresql run in a mode
where it allows INSERTS within a transaction even after an error. Naturally
when the error condition occurs, COMMIT would not be allowed at the end of
the transaction block.

But, this way, you could collect all the error information in one pass.
Seemed postgresql related to me.  :)



--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859237.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Using the database to validate data

From
JPLapham
Date:
Zdeněk Bělehrádek wrote
> What about creating a SAVEPOINT before each INSERT, and if the INSERT
> returns
> an error, then ROLLBACK TO SAVEPOINT? This way you will have all the
> insertable data in your table, and you can still ROLLBACK the whole
> transaction, or COMMIT it if there were no errors.
>
> It will probably be quite slow, but if you have only thousands of lines,
> it
> should be fast enough for your usecase IMHO.
>
> -- Zdeněk Bělehrádek

Hmmm, interesting. Thanks, if that works, it would be exactly what I'm
looking for!

You are right, speed is not an issue.

-Jon



--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859239.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Using the database to validate data

From
Adam Brusselback
Date:
A little late to the party, but i'll share how I do my data imports / validation for anyone interested.

I have a bunch of data that comes in from various sources, and it isn't always guaranteed to be in the correct format, have the right foreign keys, or even the right data types.

I have a staging table that is in the format of the feed I have coming in, with all columns text and no constraints at all on the data columns. Example:
CREATE TABLE import_sale
(
  client_id uuid NOT NULL,
  row_id uuid NOT NULL DEFAULT gen_random_uuid(),
  row_date timestamp with time zone NOT NULL DEFAULT now(),
  file_id uuid NOT NULL,
  sale_number character varying,
  company_number character varying,
  invoice_number character varying,
  invoice_date character varying,
  order_date character varying,
  ship_date character varying,
  sale_date character varying,
  product_number character varying,
  quantity character varying,
  quantity_uom character varying,
  price character varying,
  reduction character varying,
  direct_indicator character varying,
  redistributor_company_number character varying,
  freight numeric,
  processed_ind boolean DEFAULT false,
  CONSTRAINT import_sales_client_id_fkey FOREIGN KEY (client_id)
      REFERENCES client (client_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT import_sales_file_id_fkey FOREIGN KEY (file_id)
      REFERENCES import_file (file_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT import_sales_row_id_unique UNIQUE (row_id)
);

I use a talend package, or COPY to get the data into this table. However you want to do that is up to you.

I have a final table that I want all this data to eventually get to once there are no issues with it.  Example:
CREATE TABLE sale
(
  sale_id uuid NOT NULL DEFAULT gen_random_uuid(),
  client_id uuid NOT NULL,
  source_row_id uuid NOT NULL,
  sale_number character varying NOT NULL,
  company_id uuid NOT NULL,
  invoice_number character varying NOT NULL,
  invoice_date date,
  order_date date,
  ship_date date,
  sale_date date NOT NULL,
  product_id uuid NOT NULL,
  quantity numeric NOT NULL,
  uom_type_id uuid NOT NULL,
  price numeric NOT NULL,
  reduction numeric NOT NULL,
  redistributor_company_id uuid,
  freight numeric,
  active_range tstzrange DEFAULT tstzrange(now(), NULL::timestamp with time zone),
  CONSTRAINT sale_pkey PRIMARY KEY (sale_id),
  CONSTRAINT sale_client_id_fkey FOREIGN KEY (client_id)
      REFERENCES client (client_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_company_id_fkey FOREIGN KEY (company_id)
      REFERENCES company (company_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_product_id_fkey FOREIGN KEY (product_id)
      REFERENCES product (product_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_redistributor_company_id_fkey FOREIGN KEY (redistributor_company_id)
      REFERENCES company (company_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_source_row_id_fkey FOREIGN KEY (source_row_id)
      REFERENCES import_sale (row_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_uom_type_id_fkey FOREIGN KEY (uom_type_id)
      REFERENCES uom_type (uom_type_id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT sale_sale_number_active_range_excl EXCLUDE 
  USING gist (sale_number WITH =, (client_id::character varying) WITH =, active_range WITH &&),
  CONSTRAINT sale_unique UNIQUE (sale_number, client_id, active_range)
);

I then have couple functions which run over the data and do the validations / insert / update where necessary.

This one validates that the data is able to map to all the foreign keys, the data types can be converted properly, and that not null constraints are enforced.

CREATE OR REPLACE FUNCTION import_validate_sale()
  RETURNS void AS
$BODY$
/*
Remove any prior exceptions
*/
DELETE FROM import_sale_error
WHERE EXISTS (
SELECT 1
FROM import_sale
WHERE import_sale_error.row_id = import_sale.row_id);
/*
Null checks for required fields
*/
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'sale_number is null, but required.'
FROM import_sale s
WHERE s.sale_number IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'distributor company_number is null, but required.'
FROM import_sale s
WHERE s.company_number IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'invoice_number is null, but required.'
FROM import_sale s
WHERE s.invoice_number IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'sale_date is null, but required.'
FROM import_sale s
WHERE s.sale_date IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'product_number is null, but required.'
FROM import_sale s
WHERE s.product_number IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'quantity is null, but required.'
FROM import_sale s
WHERE s.quantity IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'quantity_uom is null, but required.'
FROM import_sale s
WHERE s.quantity_uom IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'price is null, but required.'
FROM import_sale s
WHERE s.price IS NULL;

/*
Check type conversions
*/
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'invoice_date cannot be converted to date.'
FROM import_sale s
WHERE can_convert_date(s.invoice_date) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'order_date cannot be converted to date.'
FROM import_sale s
WHERE can_convert_date(s.order_date) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'ship_date cannot be converted to date.'
FROM import_sale s
WHERE can_convert_date(s.ship_date) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'sale_date cannot be converted to date.'
FROM import_sale s
WHERE can_convert_date(s.sale_date) = FALSE;
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'quantity cannot be converted to numeric.'
FROM import_sale s
WHERE can_convert_numeric(s.quantity) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'price cannot be converted to numeric.'
FROM import_sale s
WHERE can_convert_numeric(s.price) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'reduction cannot be converted to numeric.'
FROM import_sale s
WHERE can_convert_numeric(s.reduction) = FALSE; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'freight cannot be converted to numeric.'
FROM import_sale s
WHERE can_convert_numeric(s.freight) = FALSE;
 
/*
Check item resolutions
*/
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'Distributor company record could not be mapped.'
FROM import_sale s
LEFT JOIN company c
ON s.company_number = c.company_number
WHERE c.company_id IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'Re-Distributor company record could not be mapped.'
FROM import_sale s
LEFT JOIN company c
ON s.redistributor_company_number = c.company_number
WHERE c.company_id IS NULL
AND s.redistributor_company_number IS NOT NULL;
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'Product record could not be mapped.'
FROM import_sale s
LEFT JOIN product p
ON s.product_number = p.product_number
WHERE p.product_id IS NULL;

INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'Unit Of Measure record could not be mapped.'
FROM import_sale s
LEFT JOIN uom_type u
ON u.uom_type_cd = s.quantity_uom
WHERE u.uom_type_id IS NULL; 
 
INSERT INTO import_sale_error(row_id, error_message)
SELECT s.row_id, 'Product does not have a conversion rate for this Unit Of Measure.'
FROM import_sale s
INNER JOIN product p
ON s.product_number = p.product_number
INNER JOIN uom_type u
ON u.uom_type_cd = s.quantity_uom
LEFT JOIN product_uom_conversion puc
ON p.product_id = puc.product_id
AND u.uom_type_id = puc.uom_type_id
WHERE puc.rate IS NULL;

$BODY$
  LANGUAGE sql VOLATILE
  COST 100;

The other function is then in charge of the insert into the final table:
CREATE OR REPLACE FUNCTION import_sale()
  RETURNS void AS
$BODY$
/**
Insert into the production sale table where no errors exist.
**/

INSERT INTO sale(
  client_id
, source_row_id
, sale_number
, company_id
, invoice_number
, invoice_date
, order_date
, ship_date
, sale_date
, product_id
, quantity
, uom_type_id
, price
, reduction
, redistributor_company_id
, freight)
SELECT 
  s.client_id 
, s.row_id
, s.sale_number
, cmpd.company_id
, s.invoice_number
, s.invoice_date::date
, s.order_date::date
, s.ship_date::date
, s.sale_date::date
, p.product_id
, s.quantity::numeric
, uom.uom_type_id
, s.price::numeric
, s.reduction::numeric
, cmpr.company_id
, s.freight
FROM import_sale s
INNER JOIN company cmpd
ON cmpd.company_number = s.company_number
LEFT JOIN company cmpr
ON cmpr.company_number = s.redistributor_company_number
INNER JOIN product p
ON s.product_number = p.product_number
INNER JOIN uom_type uom
ON uom.uom_type_cd = s.quantity_uom
WHERE NOT EXISTS (
SELECT 1
FROM import_sale_error se
WHERE se.row_id = s.row_id)
----new sale_number
AND (NOT EXISTS (
SELECT 1
FROM sale s2
WHERE s.row_id = s2.source_row_id
AND s.client_id = s2.client_id)
----existing sale_number with changes
OR EXISTS (
SELECT 1
FROM sale s2
WHERE s.processed_ind = false
AND s.sale_number = s2.sale_number
AND s.client_id = s2.client_id
AND (cmpd.company_id != s2.company_id
OR s.invoice_number != s2.invoice_number
OR s.invoice_date::date != s2.invoice_date
OR s.order_date::date != s2.order_date
OR s.ship_date::date != s2.ship_date
OR s.sale_date::date != s2.sale_date
OR p.product_id != s2.product_id
OR s.quantity::numeric != s2.quantity
OR uom.uom_type_id != s2.uom_type_id
OR  s.price::numeric != s2.price
OR s.reduction::numeric != s2.reduction
OR cmpr.company_id != s2.redistributor_company_id)
);


-------------------------------------------------------------------------
--Update processed_ind
UPDATE import_sale AS s
SET processed_ind = true
WHERE NOT EXISTS (
SELECT 1
FROM import_sale_error se
WHERE se.row_id = s.row_id);

$BODY$
  LANGUAGE sql VOLATILE
  COST 1000; 

So all of that together is a pretty solid system for importing data, and showing a user what went wrong with the data they sent if it is a bad row.

The other good part of this, is it's all set based. This process will run through 20,000 lines on a single core server in around 5-10 seconds.

At one of my old jobs, we had something that did the same type of validations / inserts, but did it row by row in a cursor (not written by me), and that took a good 5 min (if I remember correctly) to process 20,000 lines.  This was also on a server running Sql Server on a 32 core machine.

Anyways, good luck!
-Adam

On Fri, Jul 24, 2015 at 9:55 AM, JPLapham <lapham@jandr.org> wrote:
Zdeněk Bělehrádek wrote
> What about creating a SAVEPOINT before each INSERT, and if the INSERT
> returns
> an error, then ROLLBACK TO SAVEPOINT? This way you will have all the
> insertable data in your table, and you can still ROLLBACK the whole
> transaction, or COMMIT it if there were no errors.
>
> It will probably be quite slow, but if you have only thousands of lines,
> it
> should be fast enough for your usecase IMHO.
>
> -- Zdeněk Bělehrádek

Hmmm, interesting. Thanks, if that works, it would be exactly what I'm
looking for!

You are right, speed is not an issue.

-Jon



--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859239.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Using the database to validate data

From
林士博
Date:
No, when you catch an insert exception , you can save the information you needed to array or something likes that, and continue to do the next insert.
In your application,  you can write codes as follows.

begin transaction
loop
    try
        do insert
    catch exception
        save error info
end loop
if have any error
   print erro
   rollback transaction
else
  commit transaction

or you can you that in postgresql procedure.




2015-07-24 22:51 GMT+09:00 JPLapham <lapham@jandr.org>:
林士博 wrote
> If I am following correctly, you can do it in your application as follows.
> 1.begin transaction
> 2.insert each data. Catch db exception,
> and save exception message and other information you need to array.
> 3.in the end ,you can get all the information about the wrong data in
> array
> if there is any.
>    and then you can decide whether it is need to rollback or to commit.

Yes, I agree that I could do that, which I believe is my "IDEA 1" from my
original message. This method will naturally work, but it is a very slow
iterative process because you can only catch the *first* error, after which
new INSERTS are not allowed. If you have a data input with say 1000 record,
and there are 50 errors, it would require 50 iterations of fixing the input
data, running it again, to find them all.


林士博 wrote
> By the way, this is about programming but not postgresql.

I was hoping that there would be a way to have Postgresql run in a mode
where it allows INSERTS within a transaction even after an error. Naturally
when the error condition occurs, COMMIT would not be allowed at the end of
the transaction block.

But, this way, you could collect all the error information in one pass.
Seemed postgresql related to me.  :)



--
View this message in context: http://postgresql.nabble.com/Using-the-database-to-validate-data-tp5859046p5859237.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
─repica group──────────────────
▼ポイント×電子マネー×メールで店舗販促に必要な機能を全て提供!

▼フォローアップメールや外部連携に対応!
【mail solution】http://ms.repica.jp/

▼9年連続シェアNo.1 個人情報漏えい対策ソフト
【P-Pointer】http://ppointer.jp/

▼単月導入可能!AR動画再生アプリ

▼ITビジネスを創造しながら未来を創る
───────────────────────────