Home > mailing lists

Re: Bulk inserts into two (related) tables - Mailing list pgsql-general

From	Jeremy Finzel
Subject	Re: Bulk inserts into two (related) tables
Date	May 21, 2019 21:57:47
Msg-id	CAMa1XUhNmzNOzmFrY8N83Jii_H53AzrhT57BMTSiNG-sQSi6=g@mail.gmail.com Whole thread Raw
In response to	Re: Bulk inserts into two (related) tables (Rich Shepard <rshepard@appl-ecosys.com>)
Responses	Re: Bulk inserts into two (related) tables
List	pgsql-general

Tree view

On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

To me, this is the key to your problem what will either make this a time saver or time waster. Somehow you are accounting for what uniquely identifies organizations, right?

Say there are 5 fields that correspond to an organization. I assume then you are creating only one new org_id for each unique combination of these fields?

Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier. Then you can use ctid (unique internal identifier for each row) to join back. You use SQL like this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier

FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:

CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text);

\copy loader from 'my_data.csv' with csv header

ALTER TABLE loader ADD COLUMN org_id INT;

Example data:

INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');

INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:

test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:

-- build hash of org fields in loader table, take ctid in order to map back later

WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm

WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:

test=# table organizations;
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)

test=# table loader;
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)

Hope this helps!

Thanks,

Jeremy

pgsql-general by date:

From: Rich Shepard
Date: 21 May 2019, 18:28:58
Subject: Re: Bulk inserts into two (related) tables

From: Will Hartung
Date: 21 May 2019, 23:26:20
Subject: Re: Loading table with indexed jsonb field is stalling

Re: Bulk inserts into two (related) tables - Mailing list pgsql-general

Previous

Next