A database I am currently using is built and updated periodically from a
flat csv file (The situation is rather unfortunate, but that's all I
have right now). The schema I use is more complex than the flat file,
so I follow a process to populate the tables with the data from the
file. First I slurp the whole file into one temporary table, whose
columns correspond to the columns in the file. Then I DELETE all the
existing rows from the tables in the schema and perform a series of
queries on that table to INSERT and UPDATE rows in the tables that are
in the schema. Then I DELETE the data from the temporary table. I do
it this way, rather than trying to synchronize it, because of the
inconsistencies and redundancies in the flat file.
There is more than one problem with this, but the largest is that I
would like to perform this whole database rebuild within one
transaction, so other processes that need to access the database can do
so without noticing the disturbance. However, performing this set of
events (besides populating the temporary table) within a single
transaction takes a long time--over an hour in some cases.
What are some suggestions to help improve performance with replacing one
set of data in a schema with another?
Casey