Thread: Double checking my logic?

Double checking my logic?

From
"Ted Byers"
Date:
I have to deal with the following situation.
 
I get a daily feed of data for several tables.  This data is retrieved from the supplier by ftp in the form of CSV text files.  I know I can load the files using "load data".  The relational structure between the files and tables is simple: one file contains data for a lookup table.  But the data in the lookup table is dynamic, potentially changing from one day to the next (though in practice, the rate of change is likely to be very slow).  I will also need to maintain a history in such a way that any and all changes in the data can be fully audited.  This means that, for auditing purposes, I will need to be able to extract valid time for each item in the lookup table.  The only likely change in the data will be that occassionally items will be added or removed from the lookup table.  None of the data already present in the database will be editable (data validation happens before I get the feed).  The only analysis of which I have been told is done daily, as the feed is received.  At present, no-one looks at earlier data (although I expect that will change as I create a better structure for the database to support audits).
 
I am trying to create a flexible design so that refactoring will be simple if and when the assumptions or practices need to be changed.
 
I know how to handle all this using a brute force approach, but I expect that approach will be painfully slow.  So here is what I am planning.
 
1) create a suite of tables corresponding to the files in the feed, with the addition of a date/time stamp containing the date and time on which the data being processed was received.
2) create a suite of temporary tables corresponding to the tables created in step 1 (but without the date/time stamp)
3) load the data into the temporary tables
4) analyse the data while it is in the temporary tables, storing the result of the analysis in new tables
5) copy the data into the permanent tables, and add the date and time stamp for the data (this date/time stamp is not present in the files retrieved).
6) free the temporary tables
 
Now, this second last step is brute force, adequate for all but one of the tables: the look up table.  If I stick with the brute force approach, the lookup table will waste a significant amount of space.  This won't be much initially, but it is guaranteed to get worse as time passes and I'd expect the lookup performance to degrade as the amount of data in the lookup table increases.
 
Each record in the lookup table represents a product, and both the name and the product will have valid time intervals that may not be related.  The name may change because the product has been renamed for whatever reason (and we don't care why), or the name may disappear altogether because the product has been discontinued.  We can distinguish the two cases because each product has an ID that remains valid while the product exists, and the ID won't be in the data at all if the product is discontinued. 
 
I am considering creating an additional table just to lookup product names, but with two date and time stamps.  The first would represent  the first time the product name appears in the data and the last would represent the last time the product name is present in the data.  The first of these would be edited only once, and that is on the first day for which we have data.  I am torn between updating the last of these every day, until the name disappears, or leave it null until the name disappears.  leaving it null would save on space, but updating it with the current data and time should save time since it would not be necessary to execute a complex conditional on every product ever included in the database.  If we update it only for those items in today's data, those that were terminated before today will not have their date/time stamp updated, so the two fields will always represent the time interval for which the name is valid. A similar logic applies to the product ID.
 
Have I overlooked anything that is obvious to you?  Any gotchas I should be aware of?
 
What opportunities for improving performance do you see?
 
The ultimate objective is to have this fully automated from a shell script that is configured by our sysop to execute at a specific time every day Monday through Friday.  Can all of the steps I describe above be implemented using ANSI standard  SQL, and thus put into a stored procedure, or should I look at doing some of it in a Perl script or java application?  I suppose I will have to have at least a basic Java application, or perl script, if only to connect to the database and invoke any functions I have created to do this.
 
Any suggestions would be appreciated.
 
Thanks,
 
Ted
 
R.E. (Ted) Byers, Ph.D., Ed.D.
R & D Decision Support Solutions
http://www.randddecisionsupportsolutions.com/

Re: Double checking my logic?

From
Adrian Klaver
Date:
On Monday 20 March 2006 08:25 am, Ted Byers wrote:
> I have to deal with the following situation.
>
> I get a daily feed of data for several tables.  This data is retrieved from
> the supplier by ftp in the form of CSV text files.  I know I can load the
> files using "load data".  The relational structure between the files and
> tables is simple: one file contains data for a lookup table.  But the data
> in the lookup table is dynamic, potentially changing from one day to the
> next (though in practice, the rate of change is likely to be very slow).  I
> will also need to maintain a history in such a way that any and all changes
> in the data can be fully audited.  This means that, for auditing purposes,
> I will need to be able to extract valid time for each item in the lookup
> table.  The only likely change in the data will be that occassionally items
> will be added or removed from the lookup table.  None of the data already
> present in the database will be editable (data validation happens before I
> get the feed).  The only analysis of which I have been told is done daily,
> as the feed is received.  At present, no-one looks at earlier data
> (although I expect that will change as I create a better structure for the
> database to support audits).
>
> I am trying to create a flexible design so that refactoring will be simple
> if and when the assumptions or practices need to be changed.
>
> I know how to handle all this using a brute force approach, but I expect
> that approach will be painfully slow.  So here is what I am planning.
>
> 1) create a suite of tables corresponding to the files in the feed, with
> the addition of a date/time stamp containing the date and time on which the
> data being processed was received. 2) create a suite of temporary tables
> corresponding to the tables created in step 1 (but without the date/time
> stamp) 3) load the data into the temporary tables
> 4) analyse the data while it is in the temporary tables, storing the result
> of the analysis in new tables 5) copy the data into the permanent tables,
> and add the date and time stamp for the data (this date/time stamp is not
> present in the files retrieved). 6) free the temporary tables
>
> Now, this second last step is brute force, adequate for all but one of the
> tables: the look up table.  If I stick with the brute force approach, the
> lookup table will waste a significant amount of space.  This won't be much
> initially, but it is guaranteed to get worse as time passes and I'd expect
> the lookup performance to degrade as the amount of data in the lookup table
> increases.
>
> Each record in the lookup table represents a product, and both the name and
> the product will have valid time intervals that may not be related.  The
> name may change because the product has been renamed for whatever reason
> (and we don't care why), or the name may disappear altogether because the
> product has been discontinued.  We can distinguish the two cases because
> each product has an ID that remains valid while the product exists, and the
> ID won't be in the data at all if the product is discontinued.
>
> I am considering creating an additional table just to lookup product names,
> but with two date and time stamps.  The first would represent  the first
> time the product name appears in the data and the last would represent the
> last time the product name is present in the data.  The first of these
> would be edited only once, and that is on the first day for which we have
> data.  I am torn between updating the last of these every day, until the
> name disappears, or leave it null until the name disappears.  leaving it
> null would save on space, but updating it with the current data and time
> should save time since it would not be necessary to execute a complex
> conditional on every product ever included in the database.  If we update
> it only for those items in today's data, those that were terminated before
> today will not have their date/time stamp updated, so the two fields will
> always represent the time interval for which the name is valid. A similar
> logic applies to the product ID.
>
> Have I overlooked anything that is obvious to you?  Any gotchas I should be
> aware of?
>
> What opportunities for improving performance do you see?
>
> The ultimate objective is to have this fully automated from a shell script
> that is configured by our sysop to execute at a specific time every day
> Monday through Friday.  Can all of the steps I describe above be
> implemented using ANSI standard  SQL, and thus put into a stored procedure,
> or should I look at doing some of it in a Perl script or java application?
> I suppose I will have to have at least a basic Java application, or perl
> script, if only to connect to the database and invoke any functions I have
> created to do this.
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Ted
>
> R.E. (Ted) Byers, Ph.D., Ed.D.
> R & D Decision Support Solutions
> http://www.randddecisionsupportsolutions.com/

I faced a similar problem where I was trying to keep track of changes to a
FoxPro database that I only had indirect access to.  My solution followed
your proposal to a degree. I imported the new data on a daily basis to
holding tables. I then ran a series of functions to compare the data in the
holding tables to the data in my 'real'  tables. The differences
(added,deleted,changed) were written to audit tables with a timestamp. The
'real' tables where TRUNCATED and the new data transferred into them and then
the holding tables were cleaned out. This way my 'real'  tables only
contained the minimum data necessary. The audit tables grew but where not
queried as much as the 'real' tables so the apparent speed of the lookup
process stayed relatively stable.
--
Adrian Klaver
aklaver@comcast.net

Re: Double checking my logic?

From
"John D. Burger"
Date:
Adrian Klaver wrote:

> I faced a similar problem where I was trying to keep track of changes
> to a
> FoxPro database that I only had indirect access to.  My solution
> followed
> your proposal to a degree. I imported the new data on a daily basis to
> holding tables. I then ran a series of functions to compare the data
> in the
> holding tables to the data in my 'real'  tables. The differences
> (added,deleted,changed) were written to audit tables with a timestamp.
> The
> 'real' tables where TRUNCATED and the new data transferred into them
> and then
> the holding tables were cleaned out. This way my 'real'  tables only
> contained the minimum data necessary. The audit tables grew but where
> not
> queried as much as the 'real' tables so the apparent speed of the
> lookup
> process stayed relatively stable.

I do something similar, but because I do not require precise update
timestamps on each row, my setup has the following wrinkle:  Instead of
the audit tables having a timestamp column, I have a separate
updateSessions table, with start and end timestamps.  My audit tables
then just have a foreign key into this sessions table.  A minor
advantage of this is that the session ID (possibly) takes up less space
than a full timestamp.  A more important advantage, from my point of
view, is that the session table has columns for who is running the
update, the filename on which the update is based, an MD5 digest of the
update, etc.  My update scripts fill these in, as well as a general
comment string that they take as a parameter.

- John D. Burger
   MITRE