Re: Bulk inserts into two (related) tables - Mailing list pgsql-general

From Jeremy Finzel
Subject Re: Bulk inserts into two (related) tables
Date
Msg-id CAMa1XUhNmzNOzmFrY8N83Jii_H53AzrhT57BMTSiNG-sQSi6=g@mail.gmail.com
Whole thread Raw
In response to Re: Bulk inserts into two (related) tables  (Rich Shepard <rshepard@appl-ecosys.com>)
Responses Re: Bulk inserts into two (related) tables
List pgsql-general
On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

To me, this is the key to your problem what will either make this a time saver or time waster.  Somehow you are accounting for what uniquely identifies organizations, right?

Say there are 5 fields that correspond to an organization.  I assume then you are creating only one new org_id for each unique combination of these fields?

Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier.  Then you can use ctid (unique internal identifier for each row) to join back.  You use SQL like this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;

Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:
test=# table organizations;
 org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
      1 | a        |             | b
      2 | a        |             |
      3 | a        |             | c
      4 | b        |             | c
(4 rows)

test=# table loader;
 org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
 a        |             |             | Jerry       |      2 |
 a        |             | b           | Bob         |      1 |
 a        |             | b           | Janice      |      1 |
 a        |             | c           | Chris       |      3 |
 b        |             | c           | Jason       |      4 |
 a        |             |             | Alice       |      2 |
(6 rows)


Hope this helps!
Thanks,
Jeremy

pgsql-general by date:

Previous
From: Rich Shepard
Date:
Subject: Re: Bulk inserts into two (related) tables
Next
From: Will Hartung
Date:
Subject: Re: Loading table with indexed jsonb field is stalling