Thread: Creating composite keys from csv

Creating composite keys from csv

From

Eli Murray

Date:

09 March 2015, 03:49:41

Hi all,

I'm a student journalist working on a project for our student paper which lists salaries and positions for every staff member at the university. We received the data from an FOI request but the university is refusing to give us primary keys for the data.

The issue we've run into is that if there are two staff members with the same name (and there are) our current web app adds their salaries together and considers them one person. Now, luckily, we can create a composite key if we combine their name column with their salary column. Unfortunately, the format of the data we have makes it more difficult than that (of course!) because some employees can hold multiple paying positions.

Here's some example data:

Name, Position, Salary,Total Salary, ...

Jane Doe, Dean, 100.000, 148.000, ...

John Locke, Custodian, 30.000, 30.000, ...

Jane Doe, Academic Adviser, 48.000, 148.000, ...

Jane Doe, Trainer, 46.000, 46.000, ...

Basically, what we'd like to do is create a serial primary key but instead of having it increment every row, it needs to check the name and total salary columns and only increment if that person doesn't already exist. If they do exist, it should just assign the previously created number to the column. However, our team is small and between us we have very little experience working with databases and we haven't found a way to accomplish this goal yet. In fact, we may be trying to solve this in the wrong way entirely.

So, to put it succinctly, how would you approach this problem? What are our options? Do we need to write a script to clean the data into separate csv tables before we import it to postgres, or is this something we can do in postgres? We'd really appreciate any help you all may be able to offer.

Best!

Eli Murray

Re: Creating composite keys from csv

From

Roxanne Reid-Bennett

Date:

09 March 2015, 04:51:48

On 3/8/2015 11:49 PM, Eli Murray wrote:
> Hi all,
>
> I'm a student journalist working on a project for our student paper
> which lists salaries and positions for every staff member at the
> university. We received the data from an FOI request but the
> university is refusing to give us primary keys for the data.
>
> The issue we've run into is that if there are two staff members with
> the same name (and there are) our current web app adds their salaries
> together and considers them one person. Now, luckily, we can create a
> composite key if we combine their name column with their salary
> column. Unfortunately, the format of the data we have makes it more
> difficult than that (of course!) because some employees can hold
> multiple paying positions.

Take a look at the windowing functions:

http://www.postgresql.org/docs/9.4/static/functions-window.html

Roxanne
>
> Here's some example data:
>
> Name, Position, Salary,Total Salary, ...
> Jane Doe, Dean, 100.000, 148.000, ...
> John Locke, Custodian, 30.000, 30.000, ...
> Jane Doe, Academic Adviser, 48.000, 148.000, ...
> Jane Doe, Trainer, 46.000, 46.000, ...
>
> Basically, what we'd like to do is create a serial primary key but
> instead of having it increment every row, it needs to check the name
> and total salary columns and only increment if that person doesn't
> already exist. If they do exist, it should just assign the previously
> created number to the column. However, our team is small and between
> us we have very little experience working with databases and we
> haven't found a way to accomplish this goal yet. In fact, we may be
> trying to solve this in the wrong way entirely.
>
> So, to put it succinctly, how would you approach this problem? What
> are our options? Do we need to write a script to clean the data into
> separate csv tables before we import it to postgres, or is this
> something we can do in postgres? We'd really appreciate any help you
> all may be able to offer.
>
> Best!
> Eli Murray
>
>


--
[At other schools] I think the most common fault in general is to teach students how to pass exams instead of teaching
themthe science. 
Donald Knuth

Re: Creating composite keys from csv

From

David G Johnston

Date:

09 March 2015, 05:32:41

Eli Murray wrote
> Basically, what we'd like to do is create a serial primary key but instead
> of having it increment every row, it needs to check the name and total
> salary columns and only increment if that person doesn't already exist.

So you already have a PK, (Name, Total Salary), but the source data doesn't
provide a usable surrogate key to operate with.

I would create Person and Person-Position tables and after importing the CSV
data to a Staging area write a query to insert any unknown (Name, Total
Salary) records into Person with a serial PK field.

You can then join Person back onto Staging using (Name, Total Salary) but
now include the PK and select just the PK, Position, and Salary fields which
you can then add to the Person-Role table.

You now have a Person table with (PK, Name, Total Salary) and
Person-Position with (PK, Position, Role Salary) and you can discard the
imported CSV data.

This solves the explicit problem given the assumption that (Name, Total
Salary) is indeed a uniquely identifying constraint.

David J.




--
View this message in context: http://postgresql.nabble.com/Creating-composite-keys-from-csv-tp5841038p5841043.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Creating composite keys from csv

From

John R Pierce

Date:

09 March 2015, 05:39:39

On 3/8/2015 10:32 PM, David G Johnston wrote:
> This solves the explicit problem given the assumption that (Name, Total
> Salary) is indeed a uniquely identifying constraint.

that constraint seems flawed to me.


--
john r pierce                                      37N 122W
somewhere on the middle of the left coast

Re: Creating composite keys from csv

From

John McKown

Date:

09 March 2015, 12:11:14

On Sun, Mar 8, 2015 at 10:49 PM, Eli Murray <ejmurra2@illinimedia.com> wrote:
> Hi all,
>
> I'm a student journalist working on a project for our student paper which
> lists salaries and positions for every staff member at the university. We
> received the data from an FOI request but the university is refusing to give
> us primary keys for the data.
>
> The issue we've run into is that if there are two staff members with the
> same name (and there are) our current web app adds their salaries together
> and considers them one person. Now, luckily, we can create a composite key
> if we combine their name column with their salary column. Unfortunately, the
> format of the data we have makes it more difficult than that (of course!)
> because some employees can hold multiple paying positions.
>
> Here's some example data:
>
> Name, Position, Salary,Total Salary, ...
> Jane Doe, Dean, 100.000, 148.000, ...
> John Locke, Custodian, 30.000, 30.000, ...
> Jane Doe, Academic Adviser, 48.000, 148.000, ...
> Jane Doe, Trainer, 46.000, 46.000, ...

Looking at the above, The "Jane Doe" in relative lines #2 and #4 are
the same person, but are different from Jane Doe in relative line #5.
You would like another column, of type integer, which could be a
unique key. My approach would be something like the following:

-- create an intermediate table for the PersonID.
create table AssignID (PersonID serial primary key,
Name text, TotalSalary money,
constraint NameTotSal unique(Name, TotalSalary));
--
-- Generate the PersonID from the DISTINCT Name, TotalSalary columns
in CollegeData
insert into AssignID(Name, TotalSalary) select DISTINCT Name,
TotalSalary from CollegeData;
--
-- Add the PersonID column to the CollegeData table
ALTER TABLE CollegeData ADD COLUMN PersonID int;
--
-- Update the PersonID in CollegeData from the corresponding row in AssignID.
UPDATE CollegeData SET PersonID=SELECT PersonID from AssignID where
CollegeData.PersonID = AssignID.PersonID;
--
--

Note that the above is designed as a "one shot" to generate the
PersonID from the data in an existing CollegeData table. It is not
meant to handle any case where you do updates of the CollegeData
table, such as adding new employees or updating the salaries. Doing
that is much more difficult. And, of course, you'd better be
absolutely sure that there does not exist more than one "Jane Doe"
with equal "Total Salary" values.

>
> Basically, what we'd like to do is create a serial primary key but instead
> of having it increment every row, it needs to check the name and total
> salary columns and only increment if that person doesn't already exist. If
> they do exist, it should just assign the previously created number to the
> column. However, our team is small and between us we have very little
> experience working with databases and we haven't found a way to accomplish
> this goal yet. In fact, we may be trying to solve this in the wrong way
> entirely.
>
> So, to put it succinctly, how would you approach this problem? What are our
> options? Do we need to write a script to clean the data into separate csv
> tables before we import it to postgres, or is this something we can do in
> postgres? We'd really appreciate any help you all may be able to offer.
>
> Best!
> Eli Murray
>
>



--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

Re: Creating composite keys from csv

From

Adrian Klaver

Date:

09 March 2015, 13:36:53

On 03/08/2015 08:49 PM, Eli Murray wrote:
> Hi all,
>
> I'm a student journalist working on a project for our student paper
> which lists salaries and positions for every staff member at the
> university. We received the data from an FOI request but the university
> is refusing to give us primary keys for the data.
>
> The issue we've run into is that if there are two staff members with the
> same name (and there are) our current web app adds their salaries
> together and considers them one person. Now, luckily, we can create a
> composite key if we combine their name column with their salary column.
> Unfortunately, the format of the data we have makes it more difficult
> than that (of course!) because some employees can hold multiple paying
> positions.
>
> Here's some example data:
>
> Name, Position, Salary,Total Salary, ...
> Jane Doe, Dean, 100.000, 148.000, ...
> John Locke, Custodian, 30.000, 30.000, ...
> Jane Doe, Academic Adviser, 48.000, 148.000, ...
> Jane Doe, Trainer, 46.000, 46.000, ...
>
> Basically, what we'd like to do is create a serial primary key but
> instead of having it increment every row, it needs to check the name and
> total salary columns and only increment if that person doesn't already
> exist. If they do exist, it should just assign the previously created
> number to the column.

Well the above is not going to work, because the id would not be unique
across rows and therefore could not be a primary key. If I am following
what you want is a staff id that identifies a particular staff member
across rows and is derived from the (Name, Total Salary) combination, is
that correct? If so you could use a serial column to generate a
surrogate primary key for each row without worrying about the names and
total salary. Then it becomes an issue of generating the staff id for
unique staff members. For that I would see John McKowns answer.

  However, our team is small and between us we have
> very little experience working with databases and we haven't found a way
> to accomplish this goal yet. In fact, we may be trying to solve this in
> the wrong way entirely.
>
> So, to put it succinctly, how would you approach this problem? What are
> our options? Do we need to write a script to clean the data into
> separate csv tables before we import it to postgres, or is this
> something we can do in postgres? We'd really appreciate any help you all
> may be able to offer.
>
> Best!
> Eli Murray
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Creating composite keys from csv

From

Eli Murray

Date:

09 March 2015, 15:12:48

Thank you all for your help. I'm following along with John McKown's suggestion but when I run the update query I get "UPDATE 32956" but the personid column in my rawdata table has null values for every record.

Here's the exact query I ran:

UPDATE rawdata SET personid = (SELECT personid FROM assignid WHERE rawdata.personid = assignid.personid);

I think the issue might be that it's only selecting records where personid is the same in both tables and right now there are only null values in rawdata.personid. What query should I write to SET rawdata.personid = assignid.personid WHERE rawdata.employeename && rawdata.totalsalary = assignid.name && assignid.totalsalary?

On Mon, Mar 9, 2015 at 8:36 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 03/08/2015 08:49 PM, Eli Murray wrote:
Hi all,

I'm a student journalist working on a project for our student paper
which lists salaries and positions for every staff member at the
university. We received the data from an FOI request but the university
is refusing to give us primary keys for the data.

The issue we've run into is that if there are two staff members with the
same name (and there are) our current web app adds their salaries
together and considers them one person. Now, luckily, we can create a
composite key if we combine their name column with their salary column.
Unfortunately, the format of the data we have makes it more difficult
than that (of course!) because some employees can hold multiple paying
positions.

Here's some example data:

Name, Position, Salary,Total Salary, ...
Jane Doe, Dean, 100.000, 148.000, ...
John Locke, Custodian, 30.000, 30.000, ...
Jane Doe, Academic Adviser, 48.000, 148.000, ...
Jane Doe, Trainer, 46.000, 46.000, ...

Basically, what we'd like to do is create a serial primary key but
instead of having it increment every row, it needs to check the name and
total salary columns and only increment if that person doesn't already
exist. If they do exist, it should just assign the previously created
number to the column.

Well the above is not going to work, because the id would not be unique across rows and therefore could not be a primary key. If I am following what you want is a staff id that identifies a particular staff member across rows and is derived from the (Name, Total Salary) combination, is that correct? If so you could use a serial column to generate a surrogate primary key for each row without worrying about the names and total salary. Then it becomes an issue of generating the staff id for unique staff members. For that I would see John McKowns answer.

However, our team is small and between us we have
very little experience working with databases and we haven't found a way
to accomplish this goal yet. In fact, we may be trying to solve this in
the wrong way entirely.

So, to put it succinctly, how would you approach this problem? What are
our options? Do we need to write a script to clean the data into
separate csv tables before we import it to postgres, or is this
something we can do in postgres? We'd really appreciate any help you all
may be able to offer.

Best!
Eli Murray

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reporter at The Daily Illini

ejmurra2@illinimedia.com

(815) 985-8760

Re: Creating composite keys from csv

From

John McKown

Date:

09 March 2015, 16:49:55

On Mon, Mar 9, 2015 at 10:12 AM, Eli Murray <ejmurra2@illinimedia.com> wrote:
> Thank you all for your help. I'm following along with John McKown's
> suggestion but when I run the update query I get "UPDATE 32956" but the
> personid column in my rawdata table has null values for every record.
>
> Here's the exact query I ran:
>
> UPDATE rawdata SET personid = (SELECT personid FROM assignid WHERE
> rawdata.personid = assignid.personid);

My apologies. I really messed that one up. And your analysis is
correct. I will plead "early morning brain failure". The UPDATE should
look like:

UPDATE CollegeData SET PersonID=SELECT PersonID from AssignID where
CollegeData.Name = AssignID.Name & CollegeData.TotalSalary =
AssignID.TotalSalary;

>
> I think the issue might be that it's only selecting records where personid
> is the same in both tables and right now there are only null values in
> rawdata.personid. What query should I write to SET rawdata.personid =
> assignid.personid WHERE rawdata.employeename && rawdata.totalsalary =
> assignid.name && assignid.totalsalary?
>

--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown