Re: Using the database to validate data - Mailing list pgsql-general
From | Allan Kamau |
---|---|
Subject | Re: Using the database to validate data |
Date | |
Msg-id | CAF3N6oSGOdnccgjsxJ-ys4GPR4bvpjiYo=XGUug36r+++79OOw@mail.gmail.com Whole thread Raw |
In response to | Re: Using the database to validate data (rob stone <floriparob@gmail.com>) |
List | pgsql-general |
To add onto what others a have said. I would use a bash script (and awk) to prepared each record of the raw CSV file with a dataset name, name of the file, timestamp and a serial number and place the newly generated data into a new file. In this bash script, the value of the dataset name, name of file and timestamp will come from bash variables declared and initialized at the start ensuring that all the records of a given file will have the same values for these three fields.
The dataset name is to help identify that dataset to which these data belongs when we load it into the database. The file name would indicate the name of the file from which these data was obtained. The timestamp is mainly used for versioning purposes. The sequential numbers (one unique number per row) is for auditing purposes.On Sat, Jul 25, 2015 at 4:19 PM, rob stone <floriparob@gmail.com> wrote:
If you have multiple users loading (possibly) multiple files andOn Thu, 2015-07-23 at 15:34 -0700, JPLapham wrote:
> Tim Clarke wrote
> > Shouldn't be too difficult to import those new rows into one table,
> > write a procedure that inserts them into the real table one by one
> > and
> > logs the validation failure if any - committing good rows and
> > rolling
> > back bad. In fact if you could then write the failures to a third
> > table
> > with a completely relaxed (or no) validation?
>
> Tim-
>
> Thanks for your response. Yes, you are right, it shouldn't be too
> difficult,
> and in fact I have already implemented basically what you suggest,
> see "Idea
> 2" from my original message.
>
> The problem with this approach is that it fails to find violations
> such as
> UNIQUE (there are probably others) from within the input data until
> after
> the first has been committed to the database. But, the error may have
> been
> with that earlier row, not the later.
>
> I want my users to fix all the problems with their data and then load
> it in
> an "all or none" fashion.
>
> -Jon
>
(possibly) concurrently, then the only solution is to write some code
to process the data.
You also need to consider load sequence. If user A creates a file that
contains data that will end up creating a new primary key and a file
from user B refers to that, then user B needs to wait until user A's
file has been processed successfully.
Without knowing all the details I can envisage a scenario where data
being loaded could reference "good" data already in the DB as well as
referencing data that exists within that file load, possibly giving a
rejection or an update of the "good" data.
My 2 cents worth!
Cheers,
Rob (in Floripa)
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
pgsql-general by date: