Re: Creating composite keys from csv - Mailing list pgsql-general

From John McKown
Subject Re: Creating composite keys from csv
Date
Msg-id CAAJSdjh5m81JYR4U3kr_mfU3Mm0=Q_esGmVVAgPA030ZPacn5A@mail.gmail.com
Whole thread Raw
In response to Creating composite keys from csv  (Eli Murray <ejmurra2@illinimedia.com>)
List pgsql-general
On Sun, Mar 8, 2015 at 10:49 PM, Eli Murray <ejmurra2@illinimedia.com> wrote:
> Hi all,
>
> I'm a student journalist working on a project for our student paper which
> lists salaries and positions for every staff member at the university. We
> received the data from an FOI request but the university is refusing to give
> us primary keys for the data.
>
> The issue we've run into is that if there are two staff members with the
> same name (and there are) our current web app adds their salaries together
> and considers them one person. Now, luckily, we can create a composite key
> if we combine their name column with their salary column. Unfortunately, the
> format of the data we have makes it more difficult than that (of course!)
> because some employees can hold multiple paying positions.
>
> Here's some example data:
>
> Name, Position, Salary,Total Salary, ...
> Jane Doe, Dean, 100.000, 148.000, ...
> John Locke, Custodian, 30.000, 30.000, ...
> Jane Doe, Academic Adviser, 48.000, 148.000, ...
> Jane Doe, Trainer, 46.000, 46.000, ...

Looking at the above, The "Jane Doe" in relative lines #2 and #4 are
the same person, but are different from Jane Doe in relative line #5.
You would like another column, of type integer, which could be a
unique key. My approach would be something like the following:

-- create an intermediate table for the PersonID.
create table AssignID (PersonID serial primary key,
Name text, TotalSalary money,
constraint NameTotSal unique(Name, TotalSalary));
--
-- Generate the PersonID from the DISTINCT Name, TotalSalary columns
in CollegeData
insert into AssignID(Name, TotalSalary) select DISTINCT Name,
TotalSalary from CollegeData;
--
-- Add the PersonID column to the CollegeData table
ALTER TABLE CollegeData ADD COLUMN PersonID int;
--
-- Update the PersonID in CollegeData from the corresponding row in AssignID.
UPDATE CollegeData SET PersonID=SELECT PersonID from AssignID where
CollegeData.PersonID = AssignID.PersonID;
--
--

Note that the above is designed as a "one shot" to generate the
PersonID from the data in an existing CollegeData table. It is not
meant to handle any case where you do updates of the CollegeData
table, such as adding new employees or updating the salaries. Doing
that is much more difficult. And, of course, you'd better be
absolutely sure that there does not exist more than one "Jane Doe"
with equal "Total Salary" values.

>
> Basically, what we'd like to do is create a serial primary key but instead
> of having it increment every row, it needs to check the name and total
> salary columns and only increment if that person doesn't already exist. If
> they do exist, it should just assign the previously created number to the
> column. However, our team is small and between us we have very little
> experience working with databases and we haven't found a way to accomplish
> this goal yet. In fact, we may be trying to solve this in the wrong way
> entirely.
>
> So, to put it succinctly, how would you approach this problem? What are our
> options? Do we need to write a script to clean the data into separate csv
> tables before we import it to postgres, or is this something we can do in
> postgres? We'd really appreciate any help you all may be able to offer.
>
> Best!
> Eli Murray
>
>



--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown


pgsql-general by date:

Previous
From: pinker
Date:
Subject: Re: VACUUM FULL doesn't reduce table size
Next
From: Adrian Klaver
Date:
Subject: Re: VACUUM FULL doesn't reduce table size