Thread: Primary keys for companies and people
Hello, all! Recently there was quite a bit of discussion regarding surrogate keys and natural keys. I'm not interested in discussing the pros and cons of surrogate keys. What I'd like to find out are the different methods people actually use to uniquely identify companies and people *besides* surrogate keys. I'm currently working on an application that will include contact information, so being able to uniquely identify these two entities is of interest to me. Right now I'm thinking of uniquely identifying companies by telephone number. For example: create table companies ( company_id integer primary key -- telephone number, not serial , company_name text not null ); Of course, the company may have more than one telephone number associated with it, so there will also be a table associating telephone numbers and companies. create table companies__telephone_numbers ( company_id integer not null references companies (company_id) on update cascade on delete cascade , telephone_number integer not null , unique (company_id, telephone_number) ); There should also be a trigger that will check that the company_id matches an existing telephone number associated with the company, something like: create function assert_company_id_telephone_number_exists returns trigger language plpgsql as $$ begin if exists ( select company_id from companies except select company_id from companies join companies__telephone_numbers on (company_id = telephone_number) ) then raise exception 'company_id must match existing company telephone number'; end if; return null; end; $$; For people I'm more or less stumped. I can't think of a combination of things that I know I'll be able to get from people that I'll want to be able to add to the database. Starting off we'll have at least 7,000 individuals in the database, and I don't think that just family and given names are going to be enough. I don't think we'll be able to get telephone numbers for all of them, and definitely aren't going to be getting birthdays for all. I'm very interested to hear what other use in their applications for holding people and companies. Michael Glaesemann grzm myrealbox com
On Thursday 02 February 2006 09:05, Michael Glaesemann wrote: >For people I'm more or less stumped. I can't think of a combination >of things that I know I'll be able to get from people that I'll want >to be able to add to the database. Starting off we'll have at least >7,000 individuals in the database, and I don't think that just family >and given names are going to be enough. I don't think we'll be able >to get telephone numbers for all of them, and definitely aren't going >to be getting birthdays for all. > >I'm very interested to hear what other use in their applications for >holding people and companies. I've been thinking long and hard about the same thing myself, in developing my genealogy database. For identification of people, there seems to be no realistic alternative to an arbitrary ID number. Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William Smith who marries Mary Jones in the same parish in 1758? You may never really know. Still, collecting such disparate "facts" under the same ID number, thus taking the identity more or less for granted, is the modus operandi of computer genealogy. Thus, one of the major objectives of genealogy research, the assertion of identity, becomes totally hidden the moment that you decide to cluster disparate evidence about what may actually have been totally different persons, under a single ID number. The alternative is of course to collect each cluster of evidence under a separate ID, but then the handling of a "person" becomes a programmer's nightmare. I have been writing about my genealogy data model here: <url:http://solumslekt.org/forays/blue.php> The model has been slightly modified since I wrote this; due to what I perceive as 'gotchas' in the PostgreSQL implementation of table inheritance, I have dropped the 'citations' table. Besides, I've dropped some of the surrogate keys, and more will follow. I really should update this article soon. I should perhaps be posting this under another subject, but I feel that beneath the surface, Michael's problem and my own are strongly related. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE
On Thursday 02 February 2006 09:07, Leif B. Kristensen wrote: > On Thursday 02 February 2006 09:05, Michael Glaesemann wrote: > >For people I'm more or less stumped. I can't think of a combination > >of things that I know I'll be able to get from people that I'll want > >to be able to add to the database. Starting off we'll have at least > >7,000 individuals in the database, and I don't think that just family > >and given names are going to be enough. I don't think we'll be able > >to get telephone numbers for all of them, and definitely aren't going > >to be getting birthdays for all. > > > >I'm very interested to hear what other use in their applications for > >holding people and companies. > > I've been thinking long and hard about the same thing myself, in > developing my genealogy database. For identification of people, there > seems to be no realistic alternative to an arbitrary ID number. > > Still, I'm struggling with the basic concept of /identity/, eg. is the > William Smith born to John Smith and Jane Doe in 1733, the same William > Smith who marries Mary Jones in the same parish in 1758? You may never > really know. Still, collecting such disparate "facts" under the same ID > number, thus taking the identity more or less for granted, is the modus > operandi of computer genealogy. Thus, one of the major objectives of > genealogy research, the assertion of identity, becomes totally hidden > the moment that you decide to cluster disparate evidence about what may > actually have been totally different persons, under a single ID number. > > The alternative is of course to collect each cluster of evidence under a > separate ID, but then the handling of a "person" becomes a programmer's > nightmare. > > I have been writing about my genealogy data model here: > <url:http://solumslekt.org/forays/blue.php> The model has been slightly > modified since I wrote this; due to what I perceive as 'gotchas' in the > PostgreSQL implementation of table inheritance, I have dropped the > 'citations' table. Besides, I've dropped some of the surrogate keys, > and more will follow. I really should update this article soon. > > I should perhaps be posting this under another subject, but I feel that > beneath the surface, Michael's problem and my own are strongly related. There is also the problem that a name can change. People change names by deed-poll, and also women can adopt a married name or keep their old one. All in all an ID is about the only answer. David
On Thu, Feb 02, 2006 at 10:36:54AM +0000, David Goodenough wrote: > > Still, I'm struggling with the basic concept of /identity/, eg. is the > > William Smith born to John Smith and Jane Doe in 1733, the same William > > Smith who marries Mary Jones in the same parish in 1758? You may never > > really know. Still, collecting such disparate "facts" under the same ID > > number, thus taking the identity more or less for granted, is the modus > > operandi of computer genealogy. Thus, one of the major objectives of > > genealogy research, the assertion of identity, becomes totally hidden > > the moment that you decide to cluster disparate evidence about what may > > actually have been totally different persons, under a single ID number. > > > > The alternative is of course to collect each cluster of evidence under a > > separate ID, but then the handling of a "person" becomes a programmer's > > nightmare. > There is also the problem that a name can change. People change names > by deed-poll, and also women can adopt a married name or keep their old > one. All in all an ID is about the only answer. True, the issue being ofcourse that changing a name doesn't change their identity. To the GP, your page is an interesting one and raises several interesting points. In particular the one about the "person" being the conclusion of the rest of the database. You essentially have a set of facts "A married B in C on date D" and you're trying to correlate these. In the end it's just a certain amount of guess work, especially since back then they wern't that particular about spelling as they are today. My naive view is that you're basically assigning trust values to each fact and the chance that two citations refer to the same person. In principle you'd be able to cross-reference all these citations and build the structure quasi-automatically. I suppose in practice this is done by hand. As for your question, I think you're stuck with having a person ID. Basically because you need to identify a person somehow. Given you still have the original citiations, you can split a person into multiple if the situation appears to not work out. One thing I find odd though, your "person" objects have no birthdate or deathdate. Or birth place either. I would have thought these elements would be fundamental in determining if two people are the same, given that they can't change and people are unlikely to forget them. Put another way, two people with the same birthday in the same place with similar names are very likely to be the same. If you can demostrate this is not the case that's another fact. In the end you're dealing with probabilities, you can never know for sure. Anyway, hope this helps. It's a subject I've been vaguely interested in but never really had the time to look into. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
----- Original Message ----- From: "Leif B. Kristensen" <leif@solumslekt.org> To: <pgsql-general@postgresql.org> Sent: Thursday, February 02, 2006 4:07 AM Subject: Re: [GENERAL] Primary keys for companies and people >> [snip] >>I'm very interested to hear what other use in their applications for >>holding people and companies. > > I've been thinking long and hard about the same thing myself, in > developing my genealogy database. For identification of people, there > seems to be no realistic alternative to an arbitrary ID number. > > Still, I'm struggling with the basic concept of /identity/, eg. is the > William Smith born to John Smith and Jane Doe in 1733, the same William [snip] I have long been interested in this issue, and it is one that transcends the problem of IDs in IT. For my second doctorate, I examined this in the context of historical investigation, applying numerical classification techniques to biographical information that can be extracted from historical documents. It is, I fear, a problem for which only a probabilistic answer can be obtained in most historical cases. For example, there was an eleventh century viking king Harold who as a teenager was part of his cousin's court, and then found it necessary to flee to Kiev when his cousin found hiimself on the losing side of a rebellion. He then made his way into the Byzantine empire and served the emperor as a mercenary through much of the mediterranean, finally returning in fame and glory to Norway where he found another relative (a nephew IIRC) on the throne, which he inherited about a year after his return. Impagine yourself as a historian trying to write his biography. You'd find various documents all over the western world (as known in the viking age) written in a variety of languages, and using different names to refer to him. It isn't an easy task to determine which documents refer specifically to him. And to make things even more interesting, many documents refer to a given person only by his official title, and in other cases, the eldest son of each generation was given the same name as his father. In my own case, in the time I was at the University of Toronto, I know of four other men who had precisely the same name I have. I know this from strange phone calls from faculty I never studied with about assignments and examinations for courses I had never taken. In each case, the professor checked again with the university's records department and found the correct student. The last case was particularly disturbing since in that case, things were a bit different in that I had taken a graduate course with the professor in question, and he stopped me on campus and asked about an assignment for a given advanced undergraduate course that I had not taken, but my namesake had. What made this disturbing is that not only did the other student carry my name, but he also looked enough like me that our professor could mistake me for him on campus! I can only hope that he is a well behaved, law abiding citizen! The total time period in question was 18 years. In general, the problem only gets more challenging as the length, and as the age, of the historical period considered increases. The point is, not only are the combinations of family and given names not reliably unique, even certain biological data, such as photographs of the human face, not adequately unique. Even DNA fingerprints, putatively the best available biometric technology, are not entirely reliable since even that can not distinguish between identical twins, and at the same time, there can be, admittedly extremely rare as far as we know, developmental anomalies resulting in a person being his own twin (this results from twin fetuses merging, with the consequence that the resulting person has some organ systems from one of the original fetuses and some from the other). For historical questions, I don't believe one can get any better than inference based on a balance of probabilities. A geneologist has no option but to become an applied statistician! For purposes of modern investigation or for the purpose of modern business, one may do better through an appropriate use of a combination of technologies. This is a hard problem, even with the use of the best available technologies and especially given the current problems associated with identity theft. For software developers in general, and database developers in particular, there are several distinct questions to consider.: 1) How does one reliably determine identity to begin with, and then use that identity with whatever technology one might use to represent it? 2) How good does this technology, and identification process need to be? In other words, how does the cost of a mistake (esp. in identification) relate to the increased cost of using better technology? In this analysis, one needs to consider both the cost of such a mistake to the person identified or misidentified, and the cost to the owner or user of the application or database. Who will suffer if a mistake is made? Will, or can, bad things happen if a given person ends up with more than one ID? What is the cost, and who bears this cost, if more than one person can use the same ID? 3) Can we construct a suite of best practices from which we can select given specific functional or non-functional constraints as developed for our application? Included with this question is consideration of protection of sensitive data in general, and protection of data that might conceivably be used by cyber-criminals in activity related to identity theft, or to use sensitive data to the harm of the person so identified. 4) How is biometric data best stored and searched for use in authentication processes within an arbitrary application? I guess this question assumes that biometric data needs to be used in an authentication request, and it occurs to me that for some applications, it may be sufficient to use biometric data in creation of a unique user id, and subsequently may be needed only for certain sensitive processes or resources. My own feeling is that some options are very easy, and some of these are adequate for some situations, but that there are others that may be needed depending on the sentivity of the data in question or on the potential cost to one or more parties to a given business process. I expect to be considering these issues extensively over the next few years since they are relevant to some of the web applications I am designing. Any insights you, or others, may have on these questions would be greatly appreciated. Cheers, Ted R.E. (Ted) Byers, Ph.D., Ed.D. R & D Decision Support Solutions http://www.randddecisionsupportsolutions.com/
On Thursday 02 February 2006 21:09, Martijn van Oosterhout wrote: >To the GP, your page is an interesting one and raises several >interesting points. In particular the one about the "person" being the >conclusion of the rest of the database. You essentially have a set of >facts "A married B in C on date D" and you're trying to correlate >these. In the end it's just a certain amount of guess work, especially >since back then they wern't that particular about spelling as they are >today. > >My naive view is that you're basically assigning trust values to each >fact and the chance that two citations refer to the same person. In >principle you'd be able to cross-reference all these citations and >build the structure quasi-automatically. I suppose in practice this is >done by hand. Yes it is. As I stated in the article, I'd like to quantify a 'participant' of an 'event' as a "vector in genealogy space", but I haven't really figured out a sensible entry mode for that evidence yet. For now, I'm trying to enter as much information as possible into the source citations. >As for your question, I think you're stuck with having a person ID. >Basically because you need to identify a person somehow. Given you >still have the original citiations, you can split a person into >multiple if the situation appears to not work out. > >One thing I find odd though, your "person" objects have no birthdate > or deathdate. Or birth place either. I've appropriated the model from my previous program, The Master Genealogist. I like the approach that the "person" entity should contain the least possible number of assertions. I've got views and functions that retrieves a primary birth and death date from the database automatically. > I would have thought these > elements would be fundamental in determining if two people are the > same, given that they can't change and people are unlikely to forget > them. Yes. But in 18th century genealogy, at least in Norway, you're unlikely to find a birth date and place in other records than the christening. As a matter of fact, the birth date wasn't usually recorded either, but as the christening usually took place within a week after birth, you've got a pretty good approximation. >Put another way, two people with the same birthday in the same place >with similar names are very likely to be the same. If you can >demostrate this is not the case that's another fact. In the end you're >dealing with probabilities, you can never know for sure. 18th century genealogy has a lot in common with crime investigation. You've basically got a few clues, and try to figure out a picture from the sparse evidence that may be found. I love that challenge, but it may be quite taxing sometimes. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE
> > I should perhaps be posting this under another subject, but I feel that > > beneath the surface, Michael's problem and my own are strongly related. > There is also the problem that a name can change. People change names > by deed-poll, and also women can adopt a married name or keep their old > one. All in all an ID is about the only answer. I'll take the other side of this issue. The fact that a primary key is mutable does not make it any less primary. As long as we can can count on it to be unique, how often identiying info changes has no bearing on its selection as a p-key from a relational standpoint. SQL gives us RI to help deal with this but in this is not always practical if for example you you have a changing p-key that cascades to a million records. The performance issue has zero meaning in a conceptual sense however and I think you are trying to grapple things in conceptual terms. By assigning a surrogate key to a person, you are simply moving the guess work from one place to another. If you can't logically determine who 'John Smith' is, how can you possibly expect to relate information to him? (which john smith? why, etc)...you are just hiding a guess behind a number. Put into other words, *a record must have unique identifiying criteria or the table containing it cannot be expected to give correct results*. This is, more or less, a mathematical truth. The non key attributes of the tuple are now undefined because they can give two or more different answers to the same question. Surrogates do not change this principle, they just hide it but it still has to be dealt with. Merlin
On Feb 3, 2006, at 7:25 , Merlin Moncure wrote: >> There is also the problem that a name can change. People change >> names >> by deed-poll, and also women can adopt a married name or keep >> their old >> one. All in all an ID is about the only answer. > > I'll take the other side of this issue. The fact that a primary key > is mutable does not make it any less primary. As long as we can can > count on it to be unique, how often identiying info changes has no > bearing on its selection as a p-key from a relational standpoint. > The performance issue has zero meaning in a > conceptual sense however and I think you are trying to grapple things > in conceptual terms. I definitely agree with you here, Merlin. Mutability is not the issue at hand. May I ask what strategies you use for determining uniqueness for people? Michael Glaesemann grzm myrealbox com
Michael Glaesemann wrote: > Hello, all! > > Recently there was quite a bit of discussion regarding surrogate keys > and natural keys. I'm not interested in discussing the pros and cons of > surrogate keys. What I'd like to find out are the different methods > people actually use to uniquely identify companies and people *besides* > surrogate keys. > > I'm currently working on an application that will include contact > information, so being able to uniquely identify these two entities is > of interest to me. Right now I'm thinking of uniquely identifying > companies by telephone number. For example: > > create table companies ( > company_id integer primary key -- telephone number, not serial > , company_name text not null > ); > > Of course, the company may have more than one telephone number > associated with it, so there will also be a table associating telephone > numbers and companies. > > create table companies__telephone_numbers ( > company_id integer not null > references companies (company_id) > on update cascade on delete cascade > , telephone_number integer not null > , unique (company_id, telephone_number) > ); > > There should also be a trigger that will check that the company_id > matches an existing telephone number associated with the company, > something like: > > create function assert_company_id_telephone_number_exists > returns trigger > language plpgsql as $$ > begin > if exists ( > select company_id > from companies > except > select company_id > from companies > join companies__telephone_numbers on (company_id = telephone_number) > ) > then raise exception 'company_id must match existing company telephone > number'; > end if; > return null; > end; > $$; > > For people I'm more or less stumped. I can't think of a combination of > things that I know I'll be able to get from people that I'll want to be > able to add to the database. Starting off we'll have at least 7,000 > individuals in the database, and I don't think that just family and > given names are going to be enough. I don't think we'll be able to get > telephone numbers for all of them, and definitely aren't going to be > getting birthdays for all. > > I'm very interested to hear what other use in their applications for > holding people and companies. > > Michael Glaesemann > grzm myrealbox com > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > Telephone numbers make bad primary keys because they get recycled. A phone number that belongs to me this year may belong to somebody else next year.
> I definitely agree with you here, Merlin. Mutability is not the issue > at hand. May I ask what strategies you use for determining uniqueness > for people? Well, that depends on the particular problem at hand. If you had two john smiths in your system, how would you distinguish them? If you assinged an account number in your system and gave it to the person to refer back to you, this ok...this is not a surrogate key per se, but a meaningful alias. However, that technique can't always be applied, take the case of the 'contacts' table for an account. Since you don't give each contact of each accunt a number and you don't print a number representing them on various reports (there would be no reason to), adding a surrogate to the table adds nothing to your database, it's just meaningless infromation with no semantic value. There *must* be a semantic difference between the two John Smiths or you should be storing one record in the database, not two. If you kind determine an easy natural differentiator, invent one: create table contact ( account text, name text, memo text, primary key(account, name, memo) ); The memo field is blank in most cases unlees it's needed. Suppose you were filling contact information in your databse and Taking your second John Smith from an account...your operator says, 'we already have a john smith for your account, can you give us something to identify him?' Put that in the memo field and there you go. Now your operator is taking information which has value pertaining to the scope of the problem domain your application exists in. This is just one example of how to approach the problem Now there is no ambiguiity about which john smith you are dealing with. This may not be a perfect solution for every application but there is basically has to be a method of finding semantic unquenes to your data or you have a hole in your data model. Glossing over that hole with artificial information solves nothing. There are pracitcal reasons to use surrogates but the uniqueness argument generally holds no water. By 'generating' uniqueness you are breaking your data on mathematical terms. Until you truly understand the ramifcations of that statement you can't really understand when the practical cases apply. Many of the arguments for surrugates based on mutability and uniqueness are simply illogical. The performance issue is more complex and leads to the issue of practicality. I wouldn't find any fault with a modeler who benchmarked his app with a surrogate vs. a 5 part key and chose the former as long as he truly understood the downside to doing that (extra joins). Merlin
On Feb 4, 2006, at 2:23 , Merlin Moncure wrote: > If you kind determine an easy natural differentiator, invent one: > create table contact > ( > account text, name text, memo text, > primary key(account, name, memo) > ); > > The memo field is blank in most cases unlees it's needed. Suppose you > were filling contact information in your databse and Taking your > second John Smith from an account...your operator says, 'we already > have a john smith for your account, can you give us something to > identify him?' Put that in the memo field and there you go. Merlin, Thanks for the blissfully simple solution! I think this can work well for me. Michael Glaesemann grzm myrealbox com
Leif B. Kristensen wrote: > Still, I'm struggling with the basic concept of /identity/, eg. is the > William Smith born to John Smith and Jane Doe in 1733, the same William > Smith who marries Mary Jones in the same parish in 1758? You may never > really know. Still, collecting such disparate "facts" under the same ID > number, thus taking the identity more or less for granted, is the modus > operandi of computer genealogy. Thus, one of the major objectives of > genealogy research, the assertion of identity, becomes totally hidden > the moment that you decide to cluster disparate evidence about what may > actually have been totally different persons, under a single ID number. We have a similar issue in a database in which we are integrating multiple geographic gazetteers, e.g., USGS, NGA, Wordnet. We cannot be sure that source A's Foobar City is the same as source B's. Our approach is to =always= import them as separate entities, and use a table of equivalences that gets filled out using various heuristics. For example, if each source indicates that its Foobar City is in Baz County, and we decide to equate the counties, we may equate the cities. > The alternative is of course to collect each cluster of evidence under > a > separate ID, but then the handling of a "person" becomes a programmer's > nightmare. Our intent is to have views and/or special versions of the database that collapse equivalent entities, but I must confess that we have not done much along these lines - I hope it is not too nightmarish. - John D. Burger MITRE