Re: Bulk inserts into two (related) tables - Mailing list pgsql-general
From | Jeremy Finzel |
---|---|
Subject | Re: Bulk inserts into two (related) tables |
Date | |
Msg-id | CAMa1XUhNmzNOzmFrY8N83Jii_H53AzrhT57BMTSiNG-sQSi6=g@mail.gmail.com Whole thread Raw |
In response to | Re: Bulk inserts into two (related) tables (Rich Shepard <rshepard@appl-ecosys.com>) |
Responses |
Re: Bulk inserts into two (related) tables
|
List | pgsql-general |
On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
On Tue, 21 May 2019, Francisco Olarte wrote:
> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.
Francisco,
Not yet with these new data.
I'll manually insert the org_id numbers from the organizations table into
the people table.
To me, this is the key to your problem what will either make this a time saver or time waster. Somehow you are accounting for what uniquely identifies organizations, right?
Say there are 5 fields that correspond to an organization. I assume then you are creating only one new org_id for each unique combination of these fields?
Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier. Then you can use ctid (unique internal identifier for each row) to join back. You use SQL like this:
SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;
Assume in example below that your unique "org" rows are the first 3 fields:
1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;
Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');
2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4
3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)
-- build hash of org fields in organizations table to join back to loader and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)
-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)
-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;
Final data ready for the person table to be populated:
test=# table organizations;
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)
test=# table loader;
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)
Hope this helps!
Thanks,
Jeremy
pgsql-general by date: