> My experience with that type of load process is that doing this
> row-by-row is a very expensive approach and your results bear that out.
I expected this, and had warned the client before the project started that
this is exactly where SQL underperforms.
> It is often better to write each step as an SQL statement that operates
> on a set of rows at one time.
The problem with this approach is that every row of data is dependent on the
previous row's data being validated and imported. e.g.
Import Row 1:
John Q Smith
Foobar Corp
123 Main St,
Bigtown, MD 12345-6789
Import Row 2:
John Quincy Smith
FuzzyLoginc Inc
123 Main St, Suite 301
Bigtown, MD 12345-6789
Import Row 3:
Bobby Jones
Foobar Corp
123 Main Strett Suite 300,
Bigtown, MD 12345
Every row must be imported into the table so that the next row may see the
data and consider it when assigning ID's to the name, company and address.
(all data must be normalised) How can this be done using set logic?
> You can also improve performance by ordering your checks so that the
> ones most likely to fail happen first.
Already done - I believe the problem is definitely in the navigational
access model. What I am doing now makes perfect sense as far as the logic of
the process goes - any other developer will read it and understand what is
going on. At 3000 lines of code, this will be tedious, but understandable.
But SQL hates it.
> Trying to achieve a high level of data quality in one large project is
> not often possible. Focus on the most critical areas of checking and get
> that working first with acceptable performance, then layer on additional
> checks while tuning. The complexity of the load programs you have also
> means they are susceptible to introducing data quality problems rather
> than removing them, so an incremental approach will also aid debugging
> of the load suite.
I couldn't agree more.
Carlo