I have been building a PostgreSQL database for about a year and (shame) when I first built it, I built it unnormalized. I am in the process of normalizing it at the moment (thankfully it isn’t live for anyone but me, so I can freeze it until I get it all normalized).
The part of the problem that I haven’t solved conceptually yet is how to normalize the incoming data. Historically, I have been pulling a csv into R and perform whatever cleaning is necessary and then writing that unnormalized table straight out to PostgreSQL using the cumbersome and slow “caroline::dbWriteTable2” function.
I have seen:
https://stackoverflow.com/questions/23708278/how-to-normalize-data-efficently-while-inserting-into-sql-table-postgres
… but given the batch nature of my process, I was thinking of something a little different than this line-by-line function.
Instead, I was thinking about the following algorithm:
- Have an empty unnormalized table without indexes that is written into in batches by R.
- Have a function triggered by the update of that table that will:
- Rename the table to a temp table name
- Create a fresh blank table
- “UPSERT” INSERT INTO … ON CONFLICT DO NOTHING each of the foreign key tables to capture any new values
- Then build indexes on each of the text fields of the unnormalized table and then do a INSERT INTO the main normalized table with a SELECT with all of the proper joins ..
- Delete the temp table
Does this sound like a good method? How will it behave if I am doing two simultaneous writes to the unnormalized input table?
One thing I’m not sure of yet is how in a function to retain a variable name of the new temp table name, but I haven’t Googled that yet.
I am running PostgreSQL 9.6 and given that I both benefit from parallelization and also don’t run anything “live”, I’ll likely be upgrading to PostgreSQL 10 in August when it is final. I do share the server with a couple of colleagues, but that’s all the coordination I need.
Stephen Froehlich
Sr. Strategist, CableLabs®
s.froehlich@cablelabs.com
Tel: +1 (303) 661-3708