Thread: Building a database from a flat file

Building a database from a flat file

From

"Casey T. Deccio"

Date:

03 March 2005, 00:17:22

A database I am currently using is built and updated periodically from a
flat csv file (The situation is rather unfortunate, but that's all I
have right now).  The schema I use is more complex than the flat file,
so I follow a process to populate the tables with the data from the
file.  First I slurp the whole file into one temporary table, whose
columns correspond to the columns in the file.  Then I DELETE all the
existing rows from the tables in the schema and perform a series of
queries on that table to INSERT and UPDATE rows in the tables that are
in the schema.  Then I DELETE the data from the temporary table.  I do
it this way, rather than trying to synchronize it, because of the
inconsistencies and redundancies in the flat file.

There is more than one problem with this, but the largest is that I
would like to perform this whole database rebuild within one
transaction, so other processes that need to access the database can do
so without noticing the disturbance.  However, performing this set of
events (besides populating the temporary table) within a single
transaction takes a long time--over an hour in some cases.

What are some suggestions to help improve performance with replacing one
set of data in a schema with another?

Casey

Re: Building a database from a flat file

From

Sean Davis

Date:

03 March 2005, 13:23:57

On Mar 2, 2005, at 7:15 PM, Casey T. Deccio wrote:

> A database I am currently using is built and updated periodically from 
> a
> flat csv file (The situation is rather unfortunate, but that's all I
> have right now).  The schema I use is more complex than the flat file,
> so I follow a process to populate the tables with the data from the
> file.  First I slurp the whole file into one temporary table, whose
> columns correspond to the columns in the file.  Then I DELETE all the
> existing rows from the tables in the schema and perform a series of
> queries on that table to INSERT and UPDATE rows in the tables that are
> in the schema.  Then I DELETE the data from the temporary table.  I do
> it this way, rather than trying to synchronize it, because of the
> inconsistencies and redundancies in the flat file.
>
> There is more than one problem with this, but the largest is that I
> would like to perform this whole database rebuild within one
> transaction, so other processes that need to access the database can do
> so without noticing the disturbance.  However, performing this set of
> events (besides populating the temporary table) within a single
> transaction takes a long time--over an hour in some cases.
>
> What are some suggestions to help improve performance with replacing 
> one
> set of data in a schema with another?

Why not rebuild the entire thing in a separate "build" schema then do 
only the stuff like copying tables inside the transaction block:

BEGIN;

truncate table1;
truncate table2;

--reset any sequences you feel you need to have reset;

select * into table1 from build.table1;
select * into table2 from build.table2;

vacuum analyze table1;
vacuum analyze table2;

COMMIT;

I haven't tried this method exactly, but building in a separate schema 
(expensive) and then doing cheap operations like copying the table into 
the working schema should minimize the amount of time you spend inside 
the transaction block.

**I don't know what effect this will have on performance of the whole 
process, though**.

Sean

Re: Building a database from a flat file

From

Markus Schaber

Date:

03 March 2005, 15:28:45

Hi, Casey,

Casey T. Deccio schrieb:

> There is more than one problem with this, but the largest is that I
> would like to perform this whole database rebuild within one
> transaction, so other processes that need to access the database can do
> so without noticing the disturbance.  However, performing this set of
> events (besides populating the temporary table) within a single
> transaction takes a long time--over an hour in some cases.
> 
> What are some suggestions to help improve performance with replacing one
> set of data in a schema with another?

- Create the new date in another schema, and then simply rename those
two schemas for "switch over"

- Create the new data in differently named tables, and then simply
rename all the old and new tables for "switch over".

- Have two different set of tables (maybe two identical schemas), and
let your application work on a set of views. Then you can change the
views via "create or replace view" for switch over.

Markus

-- 
Markus Schaber | Dipl. Informatiker | Software Development GIS

Fight against software patents in EU! http://ffii.org/
http://nosoftwarepatents.org/

Re: Building a database from a flat file

From

"Casey T. Deccio"

Date:

03 March 2005, 18:31:07

On Thu, 2005-03-03 at 06:23 -0700, Sean Davis wrote:
> Why not rebuild the entire thing in a separate "build" schema then do
> only the stuff like copying tables inside the transaction block:
> 

Building everything in the separate "build" schema works great, but it
is the DELETE (TRUNCATE won't work when foreign keys point to the table)
that took the most time in the transaction.  However, renaming the
schemas, as suggested in other posts, worked great.

Casey

Re: Building a database from a flat file

From

"Casey T. Deccio"

Date:

03 March 2005, 18:38:55

On Thu, 2005-03-03 at 08:28 -0700, Markus Schaber wrote:
> - Create the new date in another schema, and then simply rename those
> two schemas for "switch over"
> 

This worked very well.  I created another schema ("build") and populated
the tables within build.  Then:

BEGIN;
ALTER SCHEMA public RENAME TO public_old;
ALTER SCHEMA build RENAME TO public;
COMMIT;
/* remove data from tables in build schema... */

It was very quick and seamless.

Question: is there an "easy" way to duplicate an existing schema
(tables, functions, sequences, etc.)--not the data; only the schema?
This way, I would only need to modify one schema (public) to make
changes, and the build schema could be created each time as a duplicate
of the public schema.  Maintenance would be much simpler.

> - Create the new data in differently named tables, and then simply
> rename all the old and new tables for "switch over".
> 

This probably would work too, but there may be problems with foreign
keys in renaming the tables one at a time (unless deferrable is used).
To avoid any mess, the previous one works well.

> - Have two different set of tables (maybe two identical schemas), and
> let your application work on a set of views. Then you can change the
> views via "create or replace view" for switch over.
> 
> Markus

Casey

Re: Building a database from a flat file

From

Markus Schaber

Date:

03 March 2005, 18:44:53

Hi, Casey,

Casey T. Deccio schrieb:

> Question: is there an "easy" way to duplicate an existing schema
> (tables, functions, sequences, etc.)--not the data; only the schema?
> This way, I would only need to modify one schema (public) to make
> changes, and the build schema could be created each time as a duplicate
> of the public schema.  Maintenance would be much simpler.

I do not know about schemas, but for tables you can "CREATE TABLE alpha
(LIKE beta)".

>>- Create the new data in differently named tables, and then simply
>>rename all the old and new tables for "switch over".
> This probably would work too, but there may be problems with foreign
> keys in renaming the tables one at a time (unless deferrable is used).
> To avoid any mess, the previous one works well.

AFAIK, the foreign key relations are adopted when a table is renamed,
they stick to the same table disregarding name changes.


Markus
-- 
Markus Schaber | Dipl. Informatiker | Software Development GIS

Fight against software patents in EU! http://ffii.org/
http://nosoftwarepatents.org/

Re: Building a database from a flat file

From

Bricklen Anderson

Date:

03 March 2005, 18:47:51

Casey T. Deccio wrote:
> Question: is there an "easy" way to duplicate an existing schema
> (tables, functions, sequences, etc.)--not the data; only the schema?
> This way, I would only need to modify one schema (public) to make
> changes, and the build schema could be created each time as a duplicate
> of the public schema.  Maintenance would be much simpler.
> 
check the docs for pg_dump (-s) for doing structural dumps of your schema.

-- 
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

Re: Building a database from a flat file

From

Sean Davis

Date:

03 March 2005, 19:30:34

On Mar 3, 2005, at 1:37 PM, Casey T. Deccio wrote:

> On Thu, 2005-03-03 at 08:28 -0700, Markus Schaber wrote:
>> - Create the new date in another schema, and then simply rename those
>> two schemas for "switch over"
>>
>
> This worked very well.  I created another schema ("build") and 
> populated
> the tables within build.  Then:
>
> BEGIN;
> ALTER SCHEMA public RENAME TO public_old;
> ALTER SCHEMA build RENAME TO public;
> COMMIT;
> /* remove data from tables in build schema... */
>
> It was very quick and seamless.
>
> Question: is there an "easy" way to duplicate an existing schema
> (tables, functions, sequences, etc.)--not the data; only the schema?
> This way, I would only need to modify one schema (public) to make
> changes, and the build schema could be created each time as a duplicate
> of the public schema.  Maintenance would be much simpler.

Why not just create a dump of your schema (without data), drop the 
schema, rebuild it from the dump and then populate as normal.  Then do 
the name switch.

Sean

Re: Building a database from a flat file

From

Markus Schaber

Date:

03 March 2005, 20:22:07

Hi, Casey,

Casey T. Deccio schrieb:

> Building everything in the separate "build" schema works great, but it
> is the DELETE (TRUNCATE won't work when foreign keys point to the table)

Just as an additional info: You could temporarily drop the foreing key
constraints, run TRUNCATE, and recreate the constraints.

Markus

-- 
Markus Schaber | Dipl. Informatiker | Software Development GIS

Fight against software patents in EU! http://ffii.org/
http://nosoftwarepatents.org/

Re: Building a database from a flat file

From

Andrew - Supernews

Date:

06 March 2005, 19:58:38

On 2005-03-03, Markus Schaber <schabi@logix-tt.com> wrote:
> - Create the new date in another schema, and then simply rename those
> two schemas for "switch over"

This kind of thing looks superficially attractive but it has some serious
problems if you try and do it while there is other database activity.
Specifically, cached plans for functions and prepared statements will not
be invalidated by the rename, and will still refer to the old tables, so
you may have inconsistent results.

> - Have two different set of tables (maybe two identical schemas), and
> let your application work on a set of views. Then you can change the
> views via "create or replace view" for switch over.

Same problem applies here.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services