[NOVICE] Normalizing Unnormalized Input - Mailing list pgsql-novice

From Stephen Froehlich
Subject [NOVICE] Normalizing Unnormalized Input
Date
Msg-id DM5PR06MB28919D0E5AB84F8D18CD207EE5C50@DM5PR06MB2891.namprd06.prod.outlook.com
Whole thread Raw
Responses Re: [NOVICE] Normalizing Unnormalized Input  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-novice

I have been building a PostgreSQL database for about a year and (shame) when I first built it, I built it unnormalized.  I am in the process of normalizing it at the moment (thankfully it isn’t live for anyone but me, so I can freeze it until I get it all normalized).

 

The part of the problem that I haven’t solved conceptually yet is how to normalize the incoming data.  Historically, I have been pulling a csv into R and perform whatever cleaning is necessary and then writing that unnormalized table straight out to PostgreSQL using the cumbersome and slow “caroline::dbWriteTable2” function.

 

I have seen:

 

https://stackoverflow.com/questions/23708278/how-to-normalize-data-efficently-while-inserting-into-sql-table-postgres

 

… but given the batch nature of my process, I was thinking of something a little different than this line-by-line function.

 

Instead, I was thinking about the following algorithm:

  • Have an empty unnormalized table without indexes that is written into in batches by R.
  • Have a function triggered by the update of that table that will:
    • Rename the table to a temp table name
    • Create a fresh blank table
    • “UPSERT” INSERT INTO … ON CONFLICT DO NOTHING each of the foreign key tables to capture any new values
    • Then build indexes on each of the text fields of the unnormalized table and then do a INSERT INTO the main normalized table with a SELECT with all of the proper joins ..
    • Delete the temp table

 

Does this sound like a good method?  How will it behave if I am doing two simultaneous writes to the unnormalized input table?

 

One thing I’m not sure of yet is how in a function to retain a variable name of the new temp table name, but I haven’t Googled that yet.

 

I am running PostgreSQL 9.6 and given that I both benefit from parallelization and also don’t run anything “live”, I’ll likely be upgrading to PostgreSQL 10 in August when it is final. I do share the server with a couple of colleagues, but that’s all the coordination I need.

 

 

Stephen Froehlich
Sr. Strategist, CableLabs®


s.froehlich@cablelabs.com

Tel: +1 (303) 661-3708

 

pgsql-novice by date:

Previous
From: Mick
Date:
Subject: Re: [NOVICE] Question on installation of postgresql
Next
From: "David G. Johnston"
Date:
Subject: Re: [NOVICE] Normalizing Unnormalized Input