Thread: Primary keys for companies and people

Primary keys for companies and people

From

Michael Glaesemann

Date:

02 February 2006, 04:05:14

Hello, all!

Recently there was quite a bit of discussion regarding surrogate keys
and natural keys. I'm not interested in discussing the pros and cons
of surrogate keys. What I'd like to find out are the different
methods people actually use to uniquely identify companies and people
*besides* surrogate keys.

I'm currently working on an application that will include contact
information, so being able to uniquely identify these two entities is
of interest to me. Right now I'm thinking of uniquely identifying
companies by telephone number. For example:

create table companies (
    company_id integer primary key -- telephone number, not serial
    , company_name text not null
);

Of course, the company may have more than one telephone number
associated with it, so there will also be a table associating
telephone numbers and companies.

create table companies__telephone_numbers (
    company_id integer not null
        references companies (company_id)
        on update cascade on delete cascade
    , telephone_number integer not null
    , unique (company_id, telephone_number)
);

There should also be a trigger that will check that the company_id
matches an existing telephone number associated with the company,
something like:

create function assert_company_id_telephone_number_exists
returns trigger
language plpgsql as $$
begin
if exists (
    select company_id
    from companies
    except
    select company_id
    from companies
    join companies__telephone_numbers on (company_id = telephone_number)
    )
then raise exception 'company_id must match existing company
telephone number';
end if;
return null;
end;
$$;

For people I'm more or less stumped. I can't think of a combination
of things that I know I'll be able to get from people that I'll want
to be able to add to the database. Starting off we'll have at least
7,000 individuals in the database, and I don't think that just family
and given names are going to be enough. I don't think we'll be able
to get telephone numbers for all of them, and definitely aren't going
to be getting birthdays for all.

I'm very interested to hear what other use in their applications for
holding people and companies.

Michael Glaesemann
grzm myrealbox com

Re: Primary keys for companies and people

From

"Leif B. Kristensen"

Date:

02 February 2006, 05:08:03

On Thursday 02 February 2006 09:05, Michael Glaesemann wrote:

>For people I'm more or less stumped. I can't think of a combination
>of things that I know I'll be able to get from people that I'll want
>to be able to add to the database. Starting off we'll have at least
>7,000 individuals in the database, and I don't think that just family
>and given names are going to be enough. I don't think we'll be able
>to get telephone numbers for all of them, and definitely aren't going
>to be getting birthdays for all.
>
>I'm very interested to hear what other use in their applications for
>holding people and companies.

I've been thinking long and hard about the same thing myself, in
developing my genealogy database. For identification of people, there
seems to be no realistic alternative to an arbitrary ID number.

Still, I'm struggling with the basic concept of /identity/, eg. is the
William Smith born to John Smith and Jane Doe in 1733, the same William
Smith who marries Mary Jones in the same parish in 1758? You may never
really know. Still, collecting such disparate "facts" under the same ID
number, thus taking the identity more or less for granted, is the modus
operandi of computer genealogy. Thus, one of the major objectives of
genealogy research, the assertion of identity, becomes totally hidden
the moment that you decide to cluster disparate evidence about what may
actually have been totally different persons, under a single ID number.

The alternative is of course to collect each cluster of evidence under a
separate ID, but then the handling of a "person" becomes a programmer's
nightmare.

I have been writing about my genealogy data model here:
<url:http://solumslekt.org/forays/blue.php> The model has been slightly
modified since I wrote this; due to what I perceive as 'gotchas' in the
PostgreSQL implementation of table inheritance, I have dropped the
'citations' table. Besides, I've dropped some of the surrogate keys,
and more will follow. I really should update this article soon.

I should perhaps be posting this under another subject, but I feel that
beneath the surface, Michael's problem and my own are strongly related.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE

Re: Primary keys for companies and people

From

David Goodenough

Date:

02 February 2006, 07:09:29

On Thursday 02 February 2006 09:07, Leif B. Kristensen wrote:
> On Thursday 02 February 2006 09:05, Michael Glaesemann wrote:
> >For people I'm more or less stumped. I can't think of a combination
> >of things that I know I'll be able to get from people that I'll want
> >to be able to add to the database. Starting off we'll have at least
> >7,000 individuals in the database, and I don't think that just family
> >and given names are going to be enough. I don't think we'll be able
> >to get telephone numbers for all of them, and definitely aren't going
> >to be getting birthdays for all.
> >
> >I'm very interested to hear what other use in their applications for
> >holding people and companies.
>
> I've been thinking long and hard about the same thing myself, in
> developing my genealogy database. For identification of people, there
> seems to be no realistic alternative to an arbitrary ID number.
>
> Still, I'm struggling with the basic concept of /identity/, eg. is the
> William Smith born to John Smith and Jane Doe in 1733, the same William
> Smith who marries Mary Jones in the same parish in 1758? You may never
> really know. Still, collecting such disparate "facts" under the same ID
> number, thus taking the identity more or less for granted, is the modus
> operandi of computer genealogy. Thus, one of the major objectives of
> genealogy research, the assertion of identity, becomes totally hidden
> the moment that you decide to cluster disparate evidence about what may
> actually have been totally different persons, under a single ID number.
>
> The alternative is of course to collect each cluster of evidence under a
> separate ID, but then the handling of a "person" becomes a programmer's
> nightmare.
>
> I have been writing about my genealogy data model here:
> <url:http://solumslekt.org/forays/blue.php> The model has been slightly
> modified since I wrote this; due to what I perceive as 'gotchas' in the
> PostgreSQL implementation of table inheritance, I have dropped the
> 'citations' table. Besides, I've dropped some of the surrogate keys,
> and more will follow. I really should update this article soon.
>
> I should perhaps be posting this under another subject, but I feel that
> beneath the surface, Michael's problem and my own are strongly related.
There is also the problem that a name can change.  People change names
by deed-poll, and also women can adopt a married name or keep their old
one.  All in all an ID is about the only answer.

David

Re: Primary keys for companies and people

From

Martijn van Oosterhout

Date:

02 February 2006, 16:09:13

On Thu, Feb 02, 2006 at 10:36:54AM +0000, David Goodenough wrote:
> > Still, I'm struggling with the basic concept of /identity/, eg. is the
> > William Smith born to John Smith and Jane Doe in 1733, the same William
> > Smith who marries Mary Jones in the same parish in 1758? You may never
> > really know. Still, collecting such disparate "facts" under the same ID
> > number, thus taking the identity more or less for granted, is the modus
> > operandi of computer genealogy. Thus, one of the major objectives of
> > genealogy research, the assertion of identity, becomes totally hidden
> > the moment that you decide to cluster disparate evidence about what may
> > actually have been totally different persons, under a single ID number.
> >
> > The alternative is of course to collect each cluster of evidence under a
> > separate ID, but then the handling of a "person" becomes a programmer's
> > nightmare.

> There is also the problem that a name can change.  People change names
> by deed-poll, and also women can adopt a married name or keep their old
> one.  All in all an ID is about the only answer.

True, the issue being ofcourse that changing a name doesn't change
their identity.

To the GP, your page is an interesting one and raises several
interesting points. In particular the one about the "person" being the
conclusion of the rest of the database. You essentially have a set of
facts "A married B in C on date D" and you're trying to correlate
these. In the end it's just a certain amount of guess work, especially
since back then they wern't that particular about spelling as they are
today.

My naive view is that you're basically assigning trust values to each
fact and the chance that two citations refer to the same person. In
principle you'd be able to cross-reference all these citations and
build the structure quasi-automatically. I suppose in practice this is
done by hand.

As for your question, I think you're stuck with having a person ID.
Basically because you need to identify a person somehow. Given you
still have the original citiations, you can split a person into
multiple if the situation appears to not work out.

One thing I find odd though, your "person" objects have no birthdate or
deathdate. Or birth place either. I would have thought these elements
would be fundamental in determining if two people are the same, given
that they can't change and people are unlikely to forget them.

Put another way, two people with the same birthday in the same place
with similar names are very likely to be the same. If you can
demostrate this is not the case that's another fact. In the end you're
dealing with probabilities, you can never know for sure.

Anyway, hope this helps. It's a subject I've been vaguely interested in
but never really had the time to look into.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

signature.asc

Re: Primary keys for companies and people

From

"Ted Byers"

Date:

02 February 2006, 18:08:17

----- Original Message -----
From: "Leif B. Kristensen" <leif@solumslekt.org>
To: <pgsql-general@postgresql.org>
Sent: Thursday, February 02, 2006 4:07 AM
Subject: Re: [GENERAL] Primary keys for companies and people

>> [snip]
>>I'm very interested to hear what other use in their applications for
>>holding people and companies.
>
> I've been thinking long and hard about the same thing myself, in
> developing my genealogy database. For identification of people, there
> seems to be no realistic alternative to an arbitrary ID number.
>
> Still, I'm struggling with the basic concept of /identity/, eg. is the
> William Smith born to John Smith and Jane Doe in 1733, the same William
[snip]

I have long been interested in this issue, and it is one that transcends the
problem of IDs in IT.  For my second doctorate, I examined this in the
context of historical investigation, applying numerical classification
techniques to biographical information that can be extracted from historical
documents.  It is, I fear, a problem for which only a probabilistic answer
can be obtained in most historical cases.  For example, there was an
eleventh century viking king Harold who as a teenager was part of his
cousin's court, and then found it necessary to flee to Kiev when his cousin
found hiimself on the losing side of a rebellion.  He then made his way into
the Byzantine empire and served the emperor as a mercenary through much of
the mediterranean, finally returning in fame and glory to Norway where he
found another relative (a nephew IIRC) on the throne, which he inherited
about a year after his return.  Impagine yourself as a historian trying to
write his biography.  You'd find various documents all over the western
world (as known in the viking age) written in a variety of languages, and
using different names to refer to him.  It isn't an easy task to determine
which documents refer specifically to him.  And to make things even more
interesting, many documents refer to a given person only by his official
title, and in other cases, the eldest son of each generation was given the
same name as his father.

In my own case, in the time I was at the University of Toronto, I know of
four other men who had precisely the same name I have.  I know this from
strange phone calls from faculty I never studied with about assignments and
examinations for courses I had never taken.  In each case, the professor
checked again with the university's records department and found the correct
student.  The last case was particularly disturbing since in that case,
things were a bit different in that I had taken a graduate course with the
professor in question, and he stopped me on campus and asked about an
assignment for a given advanced undergraduate course that I had not taken,
but my namesake had.  What made this disturbing is that not only did the
other student carry my name, but he also looked enough like me that our
professor could mistake me for him on campus!  I can only hope that he is a
well behaved, law abiding citizen!  The total time period in question was 18
years.  In general, the problem only gets more challenging as the length,
and as the age, of the historical period considered increases.

The point is, not only are the combinations of family and given names not
reliably unique, even certain biological data, such as photographs of the
human face, not adequately unique.  Even DNA fingerprints, putatively the
best available biometric technology, are not entirely reliable since even
that can not distinguish between identical twins, and at the same time,
there can be, admittedly extremely rare as far as we know, developmental
anomalies resulting in a person being his own twin (this results from twin
fetuses merging, with the consequence that the resulting person has some
organ systems from one of the original fetuses and some from the other).
For historical questions, I don't believe one can get any better than
inference based on a balance of probabilities.  A geneologist has no option
but to become an applied statistician!  For purposes of modern investigation
or for the purpose of modern business, one may do better through an
appropriate use of a combination of technologies.  This is a hard problem,
even with the use of the best available technologies and especially given
the current problems associated with identity theft.

For software developers in general, and database developers in particular,
there are several distinct questions to consider.:

1) How does one reliably determine identity to begin with, and then use that
identity with whatever technology one might use to represent it?

2) How good does this technology, and identification process need to be?  In
other words, how does the cost of a mistake (esp. in identification) relate
to the increased cost of using better technology?  In this analysis, one
needs to consider both the cost of such a mistake to the person identified
or misidentified, and the cost to the owner or user of the application or
database.  Who will suffer if a mistake is made?  Will, or can, bad things
happen if a given person ends up with more than one ID?  What is the cost,
and who bears this cost, if more than one person can use the same ID?

3) Can we construct a suite of best practices from which we can select given
specific functional or non-functional constraints as developed for our
application?  Included with this question is consideration of protection of
sensitive data in general, and protection of data that might conceivably be
used by cyber-criminals in activity related to identity theft, or to use
sensitive data to the harm of the person so identified.

4) How is biometric data best stored and searched for use in authentication
processes within an arbitrary application?  I guess this question assumes
that biometric data needs to be used in an authentication request, and it
occurs to me that for some applications, it may be sufficient to use
biometric data in creation of a unique user id, and subsequently may be
needed only for certain sensitive processes or resources.

My own feeling is that some options are very easy, and some of these are
adequate for some situations, but that there are others that may be needed
depending on the sentivity of the data in question or on the potential cost
to one or more parties to a given business process.  I expect to be
considering these issues extensively over the next few years since they are
relevant to some of the web applications I am designing.  Any insights you,
or others, may have on these questions would be greatly appreciated.

Cheers,

Ted

R.E. (Ted) Byers, Ph.D., Ed.D.
R & D Decision Support Solutions
http://www.randddecisionsupportsolutions.com/

Re: Primary keys for companies and people

From

"Leif B. Kristensen"

Date:

02 February 2006, 18:12:01

On Thursday 02 February 2006 21:09, Martijn van Oosterhout wrote:
>To the GP, your page is an interesting one and raises several
>interesting points. In particular the one about the "person" being the
>conclusion of the rest of the database. You essentially have a set of
>facts "A married B in C on date D" and you're trying to correlate
>these. In the end it's just a certain amount of guess work, especially
>since back then they wern't that particular about spelling as they are
>today.
>
>My naive view is that you're basically assigning trust values to each
>fact and the chance that two citations refer to the same person. In
>principle you'd be able to cross-reference all these citations and
>build the structure quasi-automatically. I suppose in practice this is
>done by hand.

Yes it is. As I stated in the article, I'd like to quantify a
'participant' of an 'event' as a "vector in genealogy space", but I
haven't really figured out a sensible entry mode for that evidence yet.
For now, I'm trying to enter as much information as possible into the
source citations.

>As for your question, I think you're stuck with having a person ID.
>Basically because you need to identify a person somehow. Given you
>still have the original citiations, you can split a person into
>multiple if the situation appears to not work out.
>
>One thing I find odd though, your "person" objects have no birthdate
> or deathdate. Or birth place either.

I've appropriated the model from my previous program, The Master
Genealogist. I like the approach that the "person" entity should
contain the least possible number of assertions. I've got views and
functions that retrieves a primary birth and death date from the
database automatically.

> I would have thought these
> elements would be fundamental in determining if two people are the
> same, given that they can't change and people are unlikely to forget
> them.

Yes. But in 18th century genealogy, at least in Norway, you're unlikely
to find a birth date and place in other records than the christening.
As a matter of fact, the birth date wasn't usually recorded either, but
as the christening usually took place within a week after birth, you've
got a pretty good approximation.

>Put another way, two people with the same birthday in the same place
>with similar names are very likely to be the same. If you can
>demostrate this is not the case that's another fact. In the end you're
>dealing with probabilities, you can never know for sure.

18th century genealogy has a lot in common with crime investigation.
You've basically got a few clues, and try to figure out a picture from
the sparse evidence that may be found. I love that challenge, but it
may be quite taxing sometimes.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE

Re: Primary keys for companies and people

From

Merlin Moncure

Date:

02 February 2006, 18:25:47

> > I should perhaps be posting this under another subject, but I feel that
> > beneath the surface, Michael's problem and my own are strongly related.
> There is also the problem that a name can change.  People change names
> by deed-poll, and also women can adopt a married name or keep their old
> one.  All in all an ID is about the only answer.

I'll take the other side of this issue.  The fact that a primary key
is mutable does not make it any less primary.  As long as we can can
count on it to be unique, how often identiying info changes has no
bearing on its selection as a p-key from a relational standpoint.  SQL
gives us RI to help deal with this but in this is not always practical
if for example you you have a changing p-key that cascades to a
million records.  The performance issue has zero meaning in a
conceptual sense however and I think you are trying to grapple things
in conceptual terms.

By assigning a surrogate key to a person, you are simply moving the
guess work from one place to another.  If you can't logically
determine who 'John Smith' is, how can you possibly expect to relate
information to him? (which john smith? why, etc)...you are just hiding
a guess behind a number.  Put into other words, *a record must have
unique identifiying criteria or the table containing it cannot be
expected to give correct results*.  This is, more or less, a
mathematical truth.  The non key attributes of the tuple are now
undefined because they can give two or more different answers to the
same question.  Surrogates do not change this principle, they just
hide it but it still has to be dealt with.

Merlin

Re: Primary keys for companies and people

From

Michael Glaesemann

Date:

02 February 2006, 18:41:02

On Feb 3, 2006, at 7:25 , Merlin Moncure wrote:

>> There is also the problem that a name can change.  People change
>> names
>> by deed-poll, and also women can adopt a married name or keep
>> their old
>> one.  All in all an ID is about the only answer.
>
> I'll take the other side of this issue.  The fact that a primary key
> is mutable does not make it any less primary.  As long as we can can
> count on it to be unique, how often identiying info changes has no
> bearing on its selection as a p-key from a relational standpoint.


> The performance issue has zero meaning in a
> conceptual sense however and I think you are trying to grapple things
> in conceptual terms.

I definitely agree with you here, Merlin. Mutability is not the issue
at hand. May I ask what strategies you use for determining uniqueness
for people?

Michael Glaesemann
grzm myrealbox com

Re: Primary keys for companies and people

From

Mark Dilger

Date:

02 February 2006, 23:36:24

Michael Glaesemann wrote:
> Hello, all!
>
> Recently there was quite a bit of discussion regarding surrogate keys
> and natural keys. I'm not interested in discussing the pros and cons  of
> surrogate keys. What I'd like to find out are the different  methods
> people actually use to uniquely identify companies and people  *besides*
> surrogate keys.
>
> I'm currently working on an application that will include contact
> information, so being able to uniquely identify these two entities is
> of interest to me. Right now I'm thinking of uniquely identifying
> companies by telephone number. For example:
>
> create table companies (
>     company_id integer primary key -- telephone number, not serial
>     , company_name text not null
> );
>
> Of course, the company may have more than one telephone number
> associated with it, so there will also be a table associating  telephone
> numbers and companies.
>
> create table companies__telephone_numbers (
>     company_id integer not null
>         references companies (company_id)
>         on update cascade on delete cascade
>     , telephone_number integer not null
>     , unique (company_id, telephone_number)
> );
>
> There should also be a trigger that will check that the company_id
> matches an existing telephone number associated with the company,
> something like:
>
> create function assert_company_id_telephone_number_exists
> returns trigger
> language plpgsql as $$
> begin
> if exists (
>     select company_id
>     from companies
>     except
>     select company_id
>     from companies
>     join companies__telephone_numbers on (company_id = telephone_number)
>     )
> then raise exception 'company_id must match existing company  telephone
> number';
> end if;
> return null;
> end;
> $$;
>
> For people I'm more or less stumped. I can't think of a combination  of
> things that I know I'll be able to get from people that I'll want  to be
> able to add to the database. Starting off we'll have at least  7,000
> individuals in the database, and I don't think that just family  and
> given names are going to be enough. I don't think we'll be able  to get
> telephone numbers for all of them, and definitely aren't going  to be
> getting birthdays for all.
>
> I'm very interested to hear what other use in their applications for
> holding people and companies.
>
> Michael Glaesemann
> grzm myrealbox com
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Telephone numbers make bad primary keys because they get recycled.  A phone
number that belongs to me this year may belong to somebody else next year.

Re: Primary keys for companies and people

From

Merlin Moncure

Date:

03 February 2006, 13:23:36

> I definitely agree with you here, Merlin. Mutability is not the issue
> at hand. May I ask what strategies you use for determining uniqueness
> for people?

Well, that depends on the particular problem at hand.  If you had two
john smiths in your system, how would you distinguish them? If you
assinged an account number in your system and gave it to the person to
refer back to you, this ok...this is not a surrogate key per se, but a
meaningful alias.

However, that technique can't always be applied, take the case of the
'contacts' table for an account.  Since you don't give each contact of
each accunt a number and you don't print a number representing them on
various reports (there would be no reason to), adding a surrogate to
the table adds nothing to your database, it's just meaningless
infromation with no semantic value.  There *must* be a semantic
difference between the two John Smiths or you should be storing one
record in the database, not two.

If you kind determine an easy natural differentiator, invent one:
create table contact
(
  account text, name text, memo text,
  primary key(account, name, memo)
);

The memo field is blank in most cases unlees it's needed.  Suppose you
were filling contact information in your databse and Taking your
second John Smith from an account...your operator says, 'we already
have a john smith for your account, can you give us something to
identify him?'  Put that in the memo field and there you go.

Now your operator is taking information which has value pertaining to
the scope of the problem domain your application exists in.  This is
just one example of how to approach the problem  Now there is no
ambiguiity about which john smith you are dealing with. This may not
be a perfect solution for every application but there is basically has
to be a method of finding semantic unquenes to your data or you have a
hole in your data model.  Glossing over that hole with artificial
information solves nothing.

There are pracitcal reasons to use surrogates but the uniqueness
argument generally holds no water.  By 'generating' uniqueness you are
breaking your data on mathematical terms.  Until you truly understand
the ramifcations of that statement you can't really understand when
the practical cases apply.  Many of the arguments for surrugates based
on mutability and uniqueness are simply illogical.

The performance issue is more complex and leads to the issue of
practicality.  I wouldn't find any fault with a modeler who
benchmarked his app with a surrogate vs. a 5 part key  and chose the
former as long as he truly understood the downside to doing that
(extra joins).

Merlin

Re: Primary keys for companies and people

From

Michael Glaesemann

Date:

06 February 2006, 23:29:13

On Feb 4, 2006, at 2:23 , Merlin Moncure wrote:

> If you kind determine an easy natural differentiator, invent one:
> create table contact
> (
>   account text, name text, memo text,
>   primary key(account, name, memo)
> );
>
> The memo field is blank in most cases unlees it's needed.  Suppose you
> were filling contact information in your databse and Taking your
> second John Smith from an account...your operator says, 'we already
> have a john smith for your account, can you give us something to
> identify him?'  Put that in the memo field and there you go.

Merlin,

Thanks for the blissfully simple solution! I think this can work well
for me.

Michael Glaesemann
grzm myrealbox com

Re: Primary keys for companies and people

From

"John D. Burger"

Date:

07 February 2006, 17:26:49

Leif B. Kristensen wrote:

> Still, I'm struggling with the basic concept of /identity/, eg. is the
> William Smith born to John Smith and Jane Doe in 1733, the same William
> Smith who marries Mary Jones in the same parish in 1758? You may never
> really know. Still, collecting such disparate "facts" under the same ID
> number, thus taking the identity more or less for granted, is the modus
> operandi of computer genealogy. Thus, one of the major objectives of
> genealogy research, the assertion of identity, becomes totally hidden
> the moment that you decide to cluster disparate evidence about what may
> actually have been totally different persons, under a single ID number.

We have a similar issue in a database in which we are integrating
multiple geographic gazetteers, e.g., USGS, NGA, Wordnet.  We cannot be
sure that source A's Foobar City is the same as source B's.  Our
approach is to =always= import them as separate entities, and use a
table of equivalences that gets filled out using various heuristics.
For example, if each source indicates that its Foobar City is in Baz
County, and we decide to equate the counties, we may equate the cities.

> The alternative is of course to collect each cluster of evidence under
> a
> separate ID, but then the handling of a "person" becomes a programmer's
> nightmare.

Our intent is to have views and/or special versions of the database
that collapse equivalent entities, but I must confess that we have not
done much along these lines - I hope it is not too nightmarish.

- John D. Burger
   MITRE