Re: Bulk inserts into two (related) tables - Mailing list pgsql-general

From Francisco Olarte
Subject Re: Bulk inserts into two (related) tables
Date
Msg-id CA+bJJbxkuX0=t8zyd2TV6Gf6xL1qXhLXstBm_koq63a7aK=i3w@mail.gmail.com
Whole thread Raw
In response to Re: Bulk inserts into two (related) tables  (Rich Shepard <rshepard@appl-ecosys.com>)
Responses Re: Bulk inserts into two (related) tables
List pgsql-general
Rich:

On Wed, May 22, 2019 at 4:38 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:

> Not necessary; see below. Also, these data come from a regulator and
> provided as an Excel spreadsheet. If they were extracted from a database
> then that was very poorly designed because there's no consistency in how
> fields/columns are formatted. This requires manual cleaning.

Welcome to my world! I routinely ( dayly / weekly ) receive things
which I know are in a database but they decide to pass through excel (
a program well know in the data base comunity for eating data, as it
does not always store what it gets, I've managed to make excel export
a sheet to csv, import it, reexport it and get a (noticeabley)
different csv ). Or to denormalize. Or they send me files with a
mixture of utf-8 and latin1 ( in the same file ). Some of my importing
programs could qualify as IA.

> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.

Now we know a bit.

> > You told Francisco that the data file does not have a unique org name that
> > could be used as a unique organization identifier. However you seem to
> > have contradicted that by responding favorably to this solution:
> The org_name is not the PK; the org_id is. This was assigned by postgres
> when the original rows were inserted. Now, I can add the org_id in the
> values to be inserted as I know the maximum org_id number in that table.

You are not reading what we write to you. Note YOU AND ONLY YOU are
the one speaking of PK. We are speaking of "unique identifier" ( that
would be, IIRC, "candidate keys", you can peek any as your PK, or even
introduce a new synthetic one with a sequence, or a femto second exact
timestamp or whatever ).


> > INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
> > org WHERE org_name=‘Main Office’))
> Question: do I use this same syntax for each row to be inserted or can I
> make it one long insert statement by separating the parenthesized values
> with commas as I do when I update multiple rows in a table?

If this works you do neither. You insert 'Main Office' as an extra
column in a TEMPORARY table and the n you do some thing like:

INSERT INTO PEOPLE (id, name, org_id ) (SELECT tp.id, tp.name,
o.org_id FROM temp_people tp , organizations o WHERE o.org_name =
tp.extra_column_for_org_name).

and drop the temporaty table after it.

When you are fluent in SQL you do not try to play with files, you
import every column of your data into temporary tables, clean them up,
and join ( if needed ) them until you have a select that gives you
what you want and then insert this. Normally you insert several
SELECTS into temporary tables ( specially when you only have thousands
of records ) so you can do the clean up in steps.

Francisco Olarte.



pgsql-general by date:

Previous
From: Francisco Olarte
Date:
Subject: Re: Bulk inserts into two (related) tables
Next
From: Rich Shepard
Date:
Subject: Re: Bulk inserts into two (related) tables