Thread: Bulk inserts into two (related) tables

Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 16:56:38

I'm cleaning and formatting a 800-line data file to be inserted into a
database. Some of the input file fields will be inserted into an
'organizations' table letting postgres assign sequential org_id numbers.
Other fields will be inserted into a separate 'people' table associated with
each organization. The people table insert allows postgres to assign the
person_id number and this table's foreign key, people.org_id ==
organizations.org_id.

I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.

Is there a way to insert the two tables sequentially without manually adding
the organizations.org_id to the appropriate foreign key column (people.org_id)
of the people table?

Rich

Re: Bulk inserts into two (related) tables

From

Adrian Klaver

Date:

21 May 2019, 17:03:09

On 5/21/19 9:56 AM, Rich Shepard wrote:
> I'm cleaning and formatting a 800-line data file to be inserted into a
> database. Some of the input file fields will be inserted into an
> 'organizations' table letting postgres assign sequential org_id numbers.
> Other fields will be inserted into a separate 'people' table associated 
> with
> each organization. The people table insert allows postgres to assign the
> person_id number and this table's foreign key, people.org_id ==
> organizations.org_id.
> 
> I could insert all new rows into the organizations table, then produce a
> list of the org_id and org_name to manually insert the foreign key in the
> related people table. Time and effort intense.
> 
> Is there a way to insert the two tables sequentially without manually 
> adding
> the organizations.org_id to the appropriate foreign key column 
> (people.org_id)
> of the people table?

Well you are not going to know the org_id until the organization table 
is loaded, which means something like:

1) First run through file load the organizations table.

2) Build a mapping of org_id to organization.

3) Run through data file again and load people data using the mapping in 
2) to provide the people.org_id.

> 
> Rich
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

21 May 2019, 17:20:34

Rich:

On Tue, May 21, 2019 at 6:56 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
> I'm cleaning and formatting a 800-line data file to be inserted into a
> database. Some of the input file fields will be inserted into an
> 'organizations' table letting postgres assign sequential org_id numbers.
> Other fields will be inserted into a separate 'people' table associated with
> each organization. The people table insert allows postgres to assign the
> person_id number and this table's foreign key, people.org_id ==
> organizations.org_id.
> I could insert all new rows into the organizations table, then produce a
> list of the org_id and org_name to manually insert the foreign key in the
> related people table. Time and effort intense.
> Is there a way to insert the two tables sequentially without manually adding
> the organizations.org_id to the appropriate foreign key column (people.org_id)
> of the people table?

From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.

You can do it in a couple steps, first do a copy (org_id, org_name,
org.others) into the organizations table,

then create a temporary table and copy (people.id, people.org_name,
people.*)  into it

and then insert into people "select people.id, org.id as
people_org_id, people.* from tmp_people, orgs where
tmp_people.org_name = org.org_name)

and drop the temp table.

If the matching data is some king of code you do not want in the final
organization table, create a temporary organizations table, copy into
it, join it with the temporary people table, insert into the final
organizations table slicing with a select, drop it.

If you are using a sequence or similar thing for generating
organization ids and use the second approach, remember to use the same
sequence to generate the default values into the temporary table, or
chaos will ensue ( not really, but you'll need to advance it manually
).

This is a classic problem, the classic way to solve is that, prepare
an script which loads some temporary tables and then insert joins into
the final ones. On small data sets like yours you can just edit
everything into a single sql script.

Francisco Olarte.

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 17:22:03

On Tue, 21 May 2019, Adrian Klaver wrote:

> Well you are not going to know the org_id until the organization table is 
> loaded, which means something like:
>
> 1) First run through file load the organizations table.
>
> 2) Build a mapping of org_id to organization.
>
> 3) Run through data file again and load people data using the mapping in 2) 
> to provide the people.org_id.

Adrian,

That's what I planned to do. I wondered if there was a more direct way known
to those with more experience than I have. The short answer is 'no.'

Thanks,

Rich

Re: Bulk inserts into two (related) tables

From

Michael Lewis

Date:

21 May 2019, 17:22:07

For each row-

Insert into organizations table if the record does not exist, returning ID.

Insert into people using that ID.

Else, load all the data with empty ID column on person table,then just update the person table afterward and drop the org name column.

Perhaps I am missing something.

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 17:24:03

On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

Thanks,

Rich

Re: Bulk inserts into two (related) tables

From

Adrian Klaver

Date:

21 May 2019, 17:25:44

On 5/21/19 10:22 AM, Rich Shepard wrote:
> On Tue, 21 May 2019, Adrian Klaver wrote:
> 
>> Well you are not going to know the org_id until the organization table 
>> is loaded, which means something like:
>>
>> 1) First run through file load the organizations table.
>>
>> 2) Build a mapping of org_id to organization.
>>
>> 3) Run through data file again and load people data using the mapping 
>> in 2) to provide the people.org_id.
> 
> Adrian,
> 
> That's what I planned to do. I wondered if there was a more direct way 
> known
> to those with more experience than I have. The short answer is 'no.'

The other way is to create the org_id for each organization ahead of 
time and put it into the data file. Either way you have create the 
org_id for the FK relationship, it is just a matter of where and when.

> 
> Thanks,
> 
> Rich
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 17:27:12

On Tue, 21 May 2019, Michael Lewis wrote:

> For each row-
> Insert into organizations table if the record does not exist, returning ID.
> Insert into people using that ID.

Michael,

The org_id will not exist until I run the insert script.

> Else, load all the data with empty ID column on person table,then just
> update the person table afterward and drop the org name column.

Then I'll copy the org_id numbers to the appropriate row(s) in the people
table.

Thanks,

Rich

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 17:29:35

On Tue, 21 May 2019, Adrian Klaver wrote:

> The other way is to create the org_id for each organization ahead of time
> and put it into the data file. Either way you have create the org_id for
> the FK relationship, it is just a matter of where and when.

Adrian,

I had thought of that but overlooked it. The max(org_id) number is 338 so I
can assign sequential numbers above that as I format the data for the insert
scripts.

Thanks for the reminder,

Rich

Re: Bulk inserts into two (related) tables

From

Ron

Date:

21 May 2019, 18:06:43

On 5/21/19 12:27 PM, Rich Shepard wrote:
> On Tue, 21 May 2019, Michael Lewis wrote:
>
>> For each row-
>> Insert into organizations table if the record does not exist, returning ID.
>> Insert into people using that ID.
>
> Michael,
>
> The org_id will not exist until I run the insert script.
>
>> Else, load all the data with empty ID column on person table,then just
>> update the person table afterward and drop the org name column.
>
> Then I'll copy the org_id numbers to the appropriate row(s) in the people
> table.

A Perl/Python/etc script would solve your problem.

-- 
Angular momentum makes the world go 'round.

Re: Bulk inserts into two (related) tables

From

Will Hartung

Date:

21 May 2019, 18:27:10

On May 21, 2019, at 9:56 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:

I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.

You can either use a script for the entire process, or, convert the people table to INSERT statements that have a SELECT for the foreign key as part of the insert.

INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org WHERE org_name=‘Main Office’))

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

21 May 2019, 18:28:58

On Tue, 21 May 2019, Will Hartung wrote:

> You can either use a script for the entire process, or, convert the people
> table to INSERT statements that have a SELECT for the foreign key as part
> of the insert.
>
> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id
> FROM org WHERE org_name=‘Main Office’))

Thanks, Will. That's a good alternative.

Regards,

Rich

Re: Bulk inserts into two (related) tables

From

Jeremy Finzel

Date:

21 May 2019, 21:57:47

On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

To me, this is the key to your problem what will either make this a time saver or time waster. Somehow you are accounting for what uniquely identifies organizations, right?

Say there are 5 fields that correspond to an organization. I assume then you are creating only one new org_id for each unique combination of these fields?

Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier. Then you can use ctid (unique internal identifier for each row) to join back. You use SQL like this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier

FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:

CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text);

\copy loader from 'my_data.csv' with csv header

ALTER TABLE loader ADD COLUMN org_id INT;

Example data:

INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');

INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:

test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:

-- build hash of org fields in loader table, take ctid in order to map back later

WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm

WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:

test=# table organizations;
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)

test=# table loader;
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)

Hope this helps!

Thanks,

Jeremy

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

22 May 2019, 10:57:17

On Tue, May 21, 2019 at 7:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
> > From how you say it, I assume you have some data in your original
> > dumps which can relate boths, lets assume it's org_name, but may be an
> > org-code. If you do not have it it means you cannot match people to
> > orgs in your data, all is lost.
> Not yet with these new data.
> I'll manually insert the org_id numbers from the organizations table into
> the people table.

I'm curious, what org_id do you put (manually) to the people? you must
have some way to match it ( like, say, "I have an organization line,
followed by lines for people in this organization"

Because if you cannot match them by hand, the Db is not going to be of use.

Knowing how you know the org for a people is crucial for a solution (
i.e., for the previous example I'll just add a synthetic org_code line
to each one and do the to temp tables trick for a long file, or, just
advance the sequence (if needed manually ( to reserve org_id )) and
them do a perl one liner to augment the data ( but I've been perling
since the 4.019 times, so this may not be as easy for others ).

Francisco Olarte.

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

22 May 2019, 11:02:00

Jeremy:

On Tue, May 21, 2019 at 11:58 PM Jeremy Finzel <finzelj@gmail.com> wrote:

> Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier.
Thenyou can use ctid (unique internal identifier for each row) to join back.  You use SQL like this:
 

Sadly my suggestion only works if you can ( manually ) assign an
organization line to a people line, md5, field concatenation,
everything else is just optimization.

From what the OP has already told you have a heap of people, a heap of
organizations and a magic device to assign one to the others, the
org_id assignment ( using an unknown algorithm, we do not know if he
wants sequences, texts or cat gifs as IDs ) is easy, the pairing part
is unsolvable with the data we have.

At this moment I think the only useful link for this is
http://xyproblem.info/ ( for the OP, not for U ).

Francisco Olarte.

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 12:40:16

On Wed, 22 May 2019, Francisco Olarte wrote:

> I'm curious, what org_id do you put (manually) to the people? you must
> have some way to match it ( like, say, "I have an organization line,
> followed by lines for people in this organization"

Francisco,

The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.

With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their organization.
I'll do more thinking about this.

I don't do perl, though.

Thanks,

Rich

Re: Bulk inserts into two (related) tables

From

Jeremy Finzel

Date:

22 May 2019, 13:01:41

On Wed, May 22, 2019 at 7:40 AM Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Wed, 22 May 2019, Francisco Olarte wrote:

> I'm curious, what org_id do you put (manually) to the people? you must
> have some way to match it ( like, say, "I have an organization line,
> followed by lines for people in this organization"

Francisco,

The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.

With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their organization.
I'll do more thinking about this.

I don't do perl, though.

There's absolutely no need to use anything beyond SQL here, though you could if you want to.

I really wonder how much we are just talking past each other simply because we don't know what your data looks like, so we can't show you how our examples apply to your use case. If you provided a sample scrubbed data file, this whole thread probably would have been much shorter :). Can you do that?

You said here again the most time consuming part is associating people with their organization. Well, that's the whole question I was trying to optimize on. You told Francisco that the data file does not have a unique org name that could be used as a unique organization identifier. However you seem to have contradicted that by responding favorably to this solution:

INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org WHERE org_name=‘Main Office’))

How can this solution work if you said org_name cannot be used to link a person in the data file?

So, the question again becomes: does your data file have a combination of org fields that allows you to assign a person to a unique organization?

If the answer is still no, how could you ever assign people to a unique organization? In that case you need to massage your data file first before trying to build a database schema, as Francisco noted. This is basically a non-starter for your database schema.
If the answer is yes, that means you can use the combination of those fields to uniquely identify an organization, and thus link people to it later. That's the reason for the md5 - to easily use many fields in combination as a unique id

Thanks,

Jeremy

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 14:38:29

On Wed, 22 May 2019, Jeremy Finzel wrote:

> There's absolutely no need to use anything beyond SQL here, though you
> could if you want to.

Jeremy,

This is a new experience for me so I didn't think of a SQL solution.

> I really wonder how much we are just talking past each other simply because
> we don't know what your data looks like, so we can't show you how our
> examples apply to your use case. If you provided a sample scrubbed data
> file, this whole thread probably would have been much shorter :).  Can you
> do that?

Not necessary; see below. Also, these data come from a regulator and
provided as an Excel spreadsheet. If they were extracted from a database
then that was very poorly designed because there's no consistency in how
fields/columns are formatted. This requires manual cleaning.

Each row in the source file (exported from the spreadsheet as .csv and
renamed to .txt for processing in emacs and awk) is a mixture of attributes
that belong in either or both of the organization and people tables in my
database. An awk script will extract the appropriate fields for each table.

> You told Francisco that the data file does not have a unique org name that
> could be used as a unique organization identifier. However you seem to
> have contradicted that by responding favorably to this solution:

The org_name is not the PK; the org_id is. This was assigned by postgres
when the original rows were inserted. Now, I can add the org_id in the
values to be inserted as I know the maximum org_id number in that table.

> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
> org WHERE org_name=‘Main Office’))

Question: do I use this same syntax for each row to be inserted or can I
make it one long insert statement by separating the parenthesized values
with commas as I do when I update multiple rows in a table?

Thanks very much,

Rich

Re: Bulk inserts into two (related) tables

From

Rob Sargent

Date:

22 May 2019, 14:42:58

> 
> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
Absolutely no need to rename the .csv for those tools.
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.

Re: Bulk inserts into two (related) tables

From

Adrian Klaver

Date:

22 May 2019, 14:44:16

On 5/22/19 7:38 AM, Rich Shepard wrote:
> On Wed, 22 May 2019, Jeremy Finzel wrote:
> 
>> There's absolutely no need to use anything beyond SQL here, though you
>> could if you want to.
> 
> Jeremy,
> 
> This is a new experience for me so I didn't think of a SQL solution.
> 
>> I really wonder how much we are just talking past each other simply 
>> because
>> we don't know what your data looks like, so we can't show you how our
>> examples apply to your use case. If you provided a sample scrubbed data
>> file, this whole thread probably would have been much shorter :).  Can 
>> you
>> do that?
> 
> Not necessary; see below. Also, these data come from a regulator and
> provided as an Excel spreadsheet. If they were extracted from a database
> then that was very poorly designed because there's no consistency in how
> fields/columns are formatted. This requires manual cleaning.
> 
> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.

So does the people data have an organization attribute?

If so why not just assign the org_id while cleaning up the data?

> 
>> You told Francisco that the data file does not have a unique org name 
>> that
>> could be used as a unique organization identifier. However you seem to
>> have contradicted that by responding favorably to this solution:
> 
> The org_name is not the PK; the org_id is. This was assigned by postgres
> when the original rows were inserted. Now, I can add the org_id in the
> values to be inserted as I know the maximum org_id number in that table.
> 
>> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id 
>> FROM
>> org WHERE org_name=‘Main Office’))
> 
> Question: do I use this same syntax for each row to be inserted or can I
> make it one long insert statement by separating the parenthesized values
> with commas as I do when I update multiple rows in a table?
> 
> Thanks very much,
> 
> Rich
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 14:59:17

On Wed, 22 May 2019, Adrian Klaver wrote:

> So does the people data have an organization attribute?

Adrian,

Yes. It's the FK to the organization table.

> If so why not just assign the org_id while cleaning up the data?

That's what I thought to do based on your suggestion yesterday. It would
make life simpler.

Regards,

Rich

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

22 May 2019, 15:26:58

Rich:

On Wed, May 22, 2019 at 2:40 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
> > I'm curious, what org_id do you put (manually) to the people? you must
> > have some way to match it ( like, say, "I have an organization line,
> > followed by lines for people in this organization"

> The first data entered was in small chunks so I'd add a few rows to the
> organization table, output a file of org_id and org_name, then use the
> source data to associate that org_id to the people associated with it.

This is my point, if you explain how you "use the source data to
assiociate that org_id to the people" I ( the list in general ) MAY be
able to suggest something to help you.

But, I do not know how you do that association step. You may have the
them in an original directory, may have it commited to memory, may
have people typed in green cards attached with red strings to
organizations carved in wooden blocks. And this information is
crucial.

> With this new data source I want to populate the two tables more quickly. I
> think the time consuming part is associating people with their organization.
> I'll do more thinking about this.

Some sample lines may be useful, even with made up data, structure is
what is important. I.e., something like
>>
O "clavelitos locos" 24 37 82
O "bits cuadrados" 36 0 27
P "Oscar Ortega", 46, "manager", "clavelitos locos"
P "Abel Alonso", 37, "boss", "bits cuadrados"
P "Pepe Perez", 52, null, "clavelitos locos"
<<
From this ( and a description of the type of ids in your tables, i.e.,
are they sequences? texts? hashes? uuids? ) I could suggest something
like "put them into two temp tables, add org_id to Os, join in the
appropiate column to put org_id in P, slice them with a select and
send data to final table".

Or, from something like this:
>>
clavelitos_locos 24 37 82
 Pepe_Perez 52 \N
 Oscar_Ortega 46 manager
bits_cuadrados 36 0 27
  Abel_Alonso 37 boss
<<
( I've used single spaces instead of the usual tabs for separators in
the example )
I could suggest "pipe it through this, use your start id instead of
1200": perl -ne 'BEGIN{$o=1200}; if (/^\s/) { print STDERR "$o$_"; }
else { ++$o; print "$o $_"; }' > orgs 2>people
( I would add a little explaining, in case it sounds like line noise )
Which will give you:
>>
0:~$ fgrep '' orgs people
orgs:1201 clavelitos_locos 24 37 82
orgs:1202 bits_cuadrados 36 0 27
people:1201 Pepe_Perez 52 \N
people:1201 Oscar_Ortega 46 manager
people:1202  Abel_Alonso 37 boss
<<
Which could easily be copied in, perhaps with an intermediate table
for a little clean up.



That's why I pointed to xyproblem, I think you are not stating your
problem correctly. I've done thousands of things like this over the
years, I they are normally not that hard.

Regards.
   Francisco Olarte.











>
> I don't do perl, though.
>
> Thanks,
>
> Rich
>
>

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

22 May 2019, 15:40:35

Rich:

On Wed, May 22, 2019 at 4:38 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:

> Not necessary; see below. Also, these data come from a regulator and
> provided as an Excel spreadsheet. If they were extracted from a database
> then that was very poorly designed because there's no consistency in how
> fields/columns are formatted. This requires manual cleaning.

Welcome to my world! I routinely ( dayly / weekly ) receive things
which I know are in a database but they decide to pass through excel (
a program well know in the data base comunity for eating data, as it
does not always store what it gets, I've managed to make excel export
a sheet to csv, import it, reexport it and get a (noticeabley)
different csv ). Or to denormalize. Or they send me files with a
mixture of utf-8 and latin1 ( in the same file ). Some of my importing
programs could qualify as IA.

> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.

Now we know a bit.

> > You told Francisco that the data file does not have a unique org name that
> > could be used as a unique organization identifier. However you seem to
> > have contradicted that by responding favorably to this solution:
> The org_name is not the PK; the org_id is. This was assigned by postgres
> when the original rows were inserted. Now, I can add the org_id in the
> values to be inserted as I know the maximum org_id number in that table.

You are not reading what we write to you. Note YOU AND ONLY YOU are
the one speaking of PK. We are speaking of "unique identifier" ( that
would be, IIRC, "candidate keys", you can peek any as your PK, or even
introduce a new synthetic one with a sequence, or a femto second exact
timestamp or whatever ).

> > INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
> > org WHERE org_name=‘Main Office’))
> Question: do I use this same syntax for each row to be inserted or can I
> make it one long insert statement by separating the parenthesized values
> with commas as I do when I update multiple rows in a table?

If this works you do neither. You insert 'Main Office' as an extra
column in a TEMPORARY table and the n you do some thing like:

INSERT INTO PEOPLE (id, name, org_id ) (SELECT tp.id, tp.name,
o.org_id FROM temp_people tp , organizations o WHERE o.org_name =
tp.extra_column_for_org_name).

and drop the temporaty table after it.

When you are fluent in SQL you do not try to play with files, you
import every column of your data into temporary tables, clean them up,
and join ( if needed ) them until you have a select that gives you
what you want and then insert this. Normally you insert several
SELECTS into temporary tables ( specially when you only have thousands
of records ) so you can do the clean up in steps.

Francisco Olarte.

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 16:07:29

On Wed, 22 May 2019, Francisco Olarte wrote:

> You are not reading what we write to you. Note YOU AND ONLY YOU are the
> one speaking of PK. We are speaking of "unique identifier" ( that would
> be, IIRC, "candidate keys", you can peek any as your PK, or even introduce
> a new synthetic one with a sequence, or a femto second exact timestamp or
> whatever ).

Francisco,

Let me clarify.

The organizations table has org_id (an integer) as PK.

The people table has person_id (an interger) as PK and org_id as the
reference to organization.org_id.

Does this help?

> When you are fluent in SQL you do not try to play with files, you import
> every column of your data into temporary tables, clean them up, and join (
> if needed ) them until you have a select that gives you what you want and
> then insert this. Normally you insert several SELECTS into temporary
> tables ( specially when you only have thousands of records ) so you can do
> the clean up in steps.

Most of my time is spent writing using LaTeX/LyX. Depending on the project's
needs I'll also use SQL, R, GRASS, and other tools. I'm a generalist, like
your PCP, not a specialist. But, I also rely on emacs, grep, sed, and awk
for data munging and am more fluent with these tools than I am with SQL or
Python.

For me, the quickest and simplest appoach is to add the PKs to each table,
and the org_id into the people table, when I separate the cleaned text file
into the columns for each table.

Regards,

Rich

Re: Bulk inserts into two (related) tables

From

Francisco Olarte

Date:

22 May 2019, 16:46:42

Rich:

On Wed, May 22, 2019 at 6:07 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
> On Wed, 22 May 2019, Francisco Olarte wrote:
> > You are not reading what we write to you. Note YOU AND ONLY YOU are the
> > one speaking of PK. We are speaking of "unique identifier" ( that would
> > be, IIRC, "candidate keys", you can peek any as your PK, or even introduce
> > a new synthetic one with a sequence, or a femto second exact timestamp or
> > whatever ).

> The organizations table has org_id (an integer) as PK.
> The people table has person_id (an interger) as PK and org_id as the
> reference to organization.org_id.
> Does this help?

It does. But if, as I assume, you are referring to the FINAL tables
where you have the data inserted, to suggest a good method we would
also need to know HOW are the org_id / person_id generated ( i.e.
sequences, by hand, externally assigned ).

Also, when I speak of "unique identifier" I'm not speaking of the one
if your FINAL tables, I assume you would have at least the *_id field
as PKEY, so nothing else needed, but the one in your SOURCE data set (
it can be anything, like the row number in the original excel ). If
you read the examples I sent you I have "unique identifiers" for
organizations in both of them, in the first one it was the
organization name (clavelitos/bits cuadrados) in the other one it was
"the previously non indented line". Both can be processed with a
little aws/perl  +sql or with a lot of sql ( i.e., the indented one
can be loaded in a big temp table with a lot of text fields, or a
single one, and then split with sql functions, but perl/awk is much
easier ).

> > When you are fluent in SQL you do not try to play with files, you import
> > every column of your data into temporary tables, clean them up, and join (
> > if needed ) them until you have a select that gives you what you want and
> > then insert this. Normally you insert several SELECTS into temporary
> > tables ( specially when you only have thousands of records ) so you can do
> > the clean up in steps.
> Most of my time is spent writing using LaTeX/LyX. Depending on the project's
> needs I'll also use SQL, R, GRASS, and other tools. I'm a generalist, like
> your PCP, not a specialist. But, I also rely on emacs, grep, sed, and awk
> for data munging and am more fluent with these tools than I am with SQL or
> Python.

MMM, apart from angel dust I do not know what PCP could stand for. But
anyway, for simple one liners and data filtering, sed/awk are more
than enough ( I substitute them with perl because it can do the same
in a little more verbose way and I had already learned it when
confronted with awk/sed ) ( I do not find Python particularly
palatable for one-shot text processing, YMMV ). If you are not too
fluent in SQL you normally do the pre-cleaning step in emacs ( marking
lines on it's type, pasting long ones, there is no substitute to
eyeballing for that ), then use a couple scripts to separate data
types (orgs and persons ) in distinct files and clean extraneous
quotes, commas etc(sed, pure shell, awk, any language can do this ).
At this stage you have something like a "clean CSV dump" ( although I
would recommend to use the old format of postgres dumps, newline
terminated, tab delimited, backslash scaped, it's much, much easier to
work with ). Then you import this into a couple of temporary table and
use sql to do what it is good at ( adding a sequential column and
populating it with unique ids, populating foreign key columns by
joining on a collection of fields, and slicing the useful columns to
its final destination).

And, if you hit a snag, you ask, but ask with a good description of
your problem. You have extra knowledge which you are trickling down to
us, which leads to huge frustration. We want to help, but you are not
letting us, probably not on purpose, by posting an incomplete
description of your problem and refusing to answer what we think are
simple and basic questions ( I assume this is due to "impedance
mismatch", not bad faith, otherwise I would just have blacklisted you
several messages ago ). From what I know of your problem so far, A
thousand lines, in csv, two tables, with some way to match one with
the other, my money is it could have been solved in much less time
than this message is taking giving the right info in the first place.

> For me, the quickest and simplest appoach is to add the PKs to each table,
> and the org_id into the people table, when I separate the cleaned text file
> into the columns for each table.

That I do a lot, for small files. Being very fluent in perl, which
excels at text file crunching, I normally go to one-liners as soon as
I hit the couple of screens size.

The problem is you stated a problem, we tried to help you, but it
seemed like you did not want to be. We still do not know zilch about
your problem. I.e., ¿ Are there more than one person per organization
( I assume so, as you are talking of FK in persons to org ). So, given
two persons belonging to the same organization, how do you know that
info ( when seeing the excel/csv ) ? The answer to this questions is
crucial. It may be something as "the organization data is duplicated
in every row", in which case the solution to your problem is trivial.
Or "the first person has organization data in it's row, and the ones
below them do not have it", which is trivially mapped to your previous
case using a window query. Or "organizations with one person have all
the data in one row, organizations with more have one row with only
the organization and one more row per person below, without
organization data", which can be done too with a little more effort,
especially if you can do a couple of grep/awk + some emacs editing +
some simple sql.

Anyway, my final recomendation after all this mails, do it in emacs.

Regards
   Francisco Olarte.

Re: Bulk inserts into two (related) tables

From

Jeremy Finzel

Date:

22 May 2019, 17:22:46

On Wed, May 22, 2019 at 11:07 AM Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Wed, 22 May 2019, Francisco Olarte wrote:

> You are not reading what we write to you. Note YOU AND ONLY YOU are the
> one speaking of PK. We are speaking of "unique identifier" ( that would
> be, IIRC, "candidate keys", you can peek any as your PK, or even introduce
> a new synthetic one with a sequence, or a femto second exact timestamp or
> whatever ).

Francisco,

Let me clarify.

The organizations table has org_id (an integer) as PK.

The people table has person_id (an interger) as PK and org_id as the
reference to organization.org_id.

Does this help?

Francisco hit just the point I was making. I never said org_name should be the primary key. We all acknowledge org_id will the the primary key. But the whole question that would solve your issue is what other field or fields identify an organization in your data file.

You need two unique identifiers:

A "natural" one from your data file that identifies unique information about an organization. This is NOT the primary key on organizations table
A "synthetic" one - org_id, generated in the organizations table.

The whole point of the (1) natural unique identifier is this is how you can *easily* link org_id back to the person record in your data file.

You say we don't need to see your data file to answer the question. To me, it's still clear that if you were to share a sample of your data file (obviously, with fake data), all of this back and forth would quickly end. This is all way too abstract.

Thanks,

Jeremy

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 17:53:18

On Wed, 22 May 2019, Francisco Olarte wrote:

> Also, when I speak of "unique identifier" I'm not speaking of the one if
> your FINAL tables, I assume you would have at least the *_id field as
> PKEY, so nothing else needed, but the one in your SOURCE data set (it can
> be anything, like the row number in the original excel).

Francisco/Jeremy,

I'm grateful for you patient help. The 'unique identifier' in the source
file has been provided (just now) using nl <https://ss64.com/bash/nl.html>.
The syntax I used is:

nl -b a -n ln -s , -v 339 source.txt > out.txt

because the organizations table has 338 as the maximum org_id number.

I believe this fulfills the need for a known unique ID in the source file,
and when I parse each row using gawk to create the two files for table input
I can use it in both the organizations table (as the PK) and the people
table (as the FK referring to the organizations table). I can let postgres
assign the unique ID for the new rows in the people table.

Am I still missing something critical?

> MMM, apart from angel dust I do not know what PCP could stand for.

Primary Care Physician.

Best regards,

Rich

Re: Bulk inserts into two (related) tables

From

Jeremy Finzel

Date:

22 May 2019, 18:30:29

On Wed, May 22, 2019 at 12:53 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:

On Wed, 22 May 2019, Francisco Olarte wrote:

> Also, when I speak of "unique identifier" I'm not speaking of the one if
> your FINAL tables, I assume you would have at least the *_id field as
> PKEY, so nothing else needed, but the one in your SOURCE data set (it can
> be anything, like the row number in the original excel).

Francisco/Jeremy,

I'm grateful for you patient help. The 'unique identifier' in the source
file has been provided (just now) using nl <https://ss64.com/bash/nl.html>.
The syntax I used is:

nl -b a -n ln -s , -v 339 source.txt > out.txt

because the organizations table has 338 as the maximum org_id number.

I believe this fulfills the need for a known unique ID in the source file,
and when I parse each row using gawk to create the two files for table input
I can use it in both the organizations table (as the PK) and the people
table (as the FK referring to the organizations table). I can let postgres
assign the unique ID for the new rows in the people table.

Am I still missing something critical?

Are you saying your database already has an organizations table, and this data file is appending to it with all-brand-new organizations and people?

Say you have 800 people. Are you honestly saying that there are also 800 organizations? If so, I guess your solution would work. However, I thought that it was the case that multiple people belong to the same organization. Hence, the entire problem of mapping we have been discussing.

The data model doesn't make much sense if that's true.

Also, you said previously that some fields from your data file would be populated in *both tables*. That is a fundamental violation of DB design. Why are you duplicating that data in two places?

Thanks,
Jeremy

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 18:58:52

On Wed, 22 May 2019, Jeremy Finzel wrote:

> Are you saying your database already has an organizations table, and this
> data file is appending to it with all-brand-new organizations and people?

Jeremy,

The database has both organizations and people tables (among others) which
are already populated. The source file appends rows to both tables.

> Say you have 800 people. Are you honestly saying that there are also 800
> organizations? If so, I guess your solution would work. However, I thought
> that it was the case that multiple people belong to the same organization.
> Hence, the entire problem of mapping we have been discussing.

No. But, each organization may have several people, each one at a different
facility. Think of a bank with one headquarters but different mangers in
each of their branches.

> Also, you said previously that some fields from your data file would be
> populated in *both tables*. That is a fundamental violation of DB design.
> Why are you duplicating that data in two places?

Not if the duplicated data field is the primary key in the organizations
table and the referenced foreign key in the people table.

Regards,

Rich

Re: Bulk inserts into two (related) tables

From

Adrian Klaver

Date:

22 May 2019, 19:09:38

On 5/22/19 10:53 AM, Rich Shepard wrote:
> On Wed, 22 May 2019, Francisco Olarte wrote:
> 
>> Also, when I speak of "unique identifier" I'm not speaking of the one if
>> your FINAL tables, I assume you would have at least the *_id field as
>> PKEY, so nothing else needed, but the one in your SOURCE data set (it can
>> be anything, like the row number in the original excel).
> 
> Francisco/Jeremy,
> 
> I'm grateful for you patient help. The 'unique identifier' in the source
> file has been provided (just now) using nl <https://ss64.com/bash/nl.html>.
> The syntax I used is:
> 
> nl -b a -n ln -s , -v 339 source.txt > out.txt
> 
> because the organizations table has 338 as the maximum org_id number.
> 
> I believe this fulfills the need for a known unique ID in the source file,
> and when I parse each row using gawk to create the two files for table 
> input
> I can use it in both the organizations table (as the PK) and the people
> table (as the FK referring to the organizations table). I can let postgres
> assign the unique ID for the new rows in the people table.
> 
> Am I still missing something critical?

A sample of the data you are cleaning up.

I think what people are trying to wrap there head around is how 800 
lines in the file is being split into two subsets: the organization data 
and the people data. In particular how that is being done to preserve 
the relationship between organizations and people? This is before it 
ever gets to the database.


> 
>> MMM, apart from angel dust I do not know what PCP could stand for.
> 
> Primary Care Physician.
> 
> Best regards,
> 
> Rich
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Bulk inserts into two (related) tables

From

Jeremy Finzel

Date:

22 May 2019, 19:13:07

A sample of the data you are cleaning up.

I think what people are trying to wrap there head around is how 800
lines in the file is being split into two subsets: the organization data
and the people data. In particular how that is being done to preserve
the relationship between organizations and people? This is before it
ever gets to the database.

I agree. We keep going in circles and until we have a sample of the data, it's not a good use of time for us to keep guessing at what none of us is clear about but you - what the data actually looks like.

Thanks,

Jeremy

Re: Bulk inserts into two (related) tables

From

Rich Shepard

Date:

22 May 2019, 19:24:44

On Wed, 22 May 2019, Adrian Klaver wrote:

> A sample of the data you are cleaning up.

Adrian, et al.:

I have it working properly now. Both org_id and person_id numbers are
prepended to each row in the appropriate table and they are unique because
each series begins one greater than the max(*_id) in each table.

> I think what people are trying to wrap there head around is how 800 lines
> in the file is being split into two subsets: the organization data and the
> people data. In particular how that is being done to preserve the
> relationship between organizations and people? This is before it ever gets
> to the database.

After cleaning there are 655 lines rather than 800, but the short answer is
that spliting the data preserves the relationship between organization and
its people:

#!/usr/bin/gawk

# Read input file, write fields to both organizations and people
# input files.

BEGIN { FS=OFS="," }
# for organizations table input:
{ print $1, $2, $4, "," $8, $9, $10, "'US'," $11, ",,," "'Opportunity','');" > "z-orgs.sql" }
# for people table input:
{ print $6, $5, "," $1, $3, $4, $5, $7, $8, $9, $10, "'US'," $11, "," $12, "," $13, "'true','');" > "z-people.sql" }

You can see that the org_id field ($1) is used in both files, but in
different columns in the two tables.

Regards,

Rich