Thread: surrogate vs natural primary keys

surrogate vs natural primary keys

From

Seb

Date:

15 September 2008, 19:15:07

Hi,

I've been reading several articles on this hotly debated issue and still
can't find proper criteria to select one or the other approach for the
database I'm currently designing.  I'd appreciate any pointers.  Thanks.


Cheers,

-- 
Seb

Re: surrogate vs natural primary keys

From

"Scott Marlowe"

Date:

15 September 2008, 19:45:12

On Mon, Sep 15, 2008 at 4:02 PM, Seb <spluque@gmail.com> wrote:
> Hi,
>
> I've been reading several articles on this hotly debated issue and still
> can't find proper criteria to select one or the other approach for the
> database I'm currently designing.  I'd appreciate any pointers.  Thanks.

You'll find lots of arguments from both sides, some more strident than
others.  In most big transactional systems you'll find surrogate keys
used for performance reasons, as well as design choices.  for
instance, when you book a flight with an airline, you'll get a locator
code like A89JK3 that is unique to any other locator code in the
system.  Sure, you could make a natural key of first name, last name,
address, phone number, flight number, departure / arrival and day and
time, but there's no way that's going to perform as well as a single
char(6).

The problem with natural keys is that you can never be sure they won't
change on you.  I like using them, but have been caught out on many
occasions where things changed halfway through development and
required a lot of redesign.

I think this question is a lot like "how large should I set
shared_buffers?"  There's lots of different answers based on how you
are using your data.

Re: surrogate vs natural primary keys

From

Seb

Date:

15 September 2008, 19:56:56

On Mon, 15 Sep 2008 16:45:08 -0600,
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

[...]

> I think this question is a lot like "how large should I set
> shared_buffers?"  There's lots of different answers based on how you
> are using your data.

Yes, this is precisely what I'm after: *criteria* to help me decide
which approach to take for different scenarios.  Such guidance is what
seems to be lacking from most of the discussions I've seen on the
subject.  It's hard to distill this information when most of the
discussion is centered on advocating one or the other approach.

Thanks,

-- 
Seb

Re: surrogate vs natural primary keys

From

"Richard Broersma"

Date:

16 September 2008, 11:59:26

On Mon, Sep 15, 2008 at 3:45 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> for
> instance, when you book a flight with an airline, you'll get a locator
> code like A89JK3 that is unique to any other locator code in the
> system.  Sure, you could make a natural key of first name, last name,
> address, phone number, flight number, departure / arrival and day and
> time, but there's no way that's going to perform as well as a single
> char(6).

Scott,

My understanding is that as soon as any code becomes visible outside
of the database or application it can no longer be called a surrogate
key.  From my reading of some of the Celko books, he strongly ascribes
to codes as primary keys.  His suggestion is to use internationally
recognized codes (if they exist) for identify items.  If none exist
then nationally recognized codes, then industry recognized codes, and
then finally if non of these exists then he recommends developing a
company specific unique code (for these he recommends codes that have
a built in check-sum  (I think check-sum is the correct word) for data
entry validation).

Any thoughts?

My DB experience has been limited to small < 2 GB OLTP databases, so I
understand that for very large databases there may be cases where
natural keys are a clear looser for performace and storage reasons.
In my case, I've tried to evaluate the trade-offs between using
natural vs surrogate keys for every relation.  Most of the time I use
natural keys, however there are some occations when surrogate keys
(for me) are only way to go.
-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: surrogate vs natural primary keys

From

Andrew Sullivan

Date:

16 September 2008, 12:18:01

On Tue, Sep 16, 2008 at 07:59:20AM -0700, Richard Broersma wrote:
> key.  From my reading of some of the Celko books, he strongly ascribes
> to codes as primary keys.  His suggestion is to use internationally
> recognized codes (if they exist) for identify items. 

The problem with that strategy is that any identifier has to be NOT
NULL and UNIQUE.  So if the international identifier assigner makes a
mistake (fails to assign or assigns duplicates), you're hosed.  

In addition, you need to make sure that the purpose for which you are
using the identifier is also the purpose for which the identifier is
assigned.  ISO, for instance, is willing to re-use country codes (even
though the specification never suggested they were).  So if you expect
to use the ISO 2-letter codes over time, you may get a nasty surprise.
(For an example, in 2003 "CS" became historically ambiguous.)

A

-- 
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

Re: surrogate vs natural primary keys

From

Seb

Date:

16 September 2008, 21:10:42

Hi,

After some more reading and considering your feedback, I'm still
somewhat confused about this issue.

1. Should the choice of surrogate/natural primary keys be done across an
entire database, or does it make more sense to do it on a per-table
basis?  I reckon one could do it on a per-table basis, but its direct
relationships would influence the choice.

2. If we do find a suitable natural primary key for a table, but it
turns out to be a composite one, how can such a key be referred to in
another table?  Say we have:

CREATE TABLE t1 (   c1 varchar(200),   c2 int8,   c3 varchar(500),   PRIMARY KEY (c1, c2)
);


and I want to create a table t2 which needs to refer to the composite
primary key of t1.  Should one create 2 columns in t2 that REFERENCE c1
and c2?  If so, this seems very cumbersome and I'm tempted to create a
surrogate key in t1 just to be able to refer to it more efficiently.  Is
this something we should be considering when choosing natural
vs. surrogate keys?  Thanks again.


-- 
Seb

Re: surrogate vs natural primary keys

From

"Scott Marlowe"

Date:

16 September 2008, 23:34:57

On Tue, Sep 16, 2008 at 6:10 PM, Seb <spluque@gmail.com> wrote:
> Hi,
>
> After some more reading and considering your feedback, I'm still
> somewhat confused about this issue.
>
> 1. Should the choice of surrogate/natural primary keys be done across an
> entire database, or does it make more sense to do it on a per-table
> basis?  I reckon one could do it on a per-table basis, but its direct
> relationships would influence the choice.

Definitely on a per table basis.  for instance, if you create a lookup
table to use as a target for an FK, there's usually little need for an
artificial key.

> 2. If we do find a suitable natural primary key for a table, but it
> turns out to be a composite one, how can such a key be referred to in
> another table?  Say we have:
>
> CREATE TABLE t1 (
>    c1 varchar(200),
>    c2 int8,
>    c3 varchar(500),
>    PRIMARY KEY (c1, c2)
> );

create table t2 (   d1 varchar(200),   d2 int8,   d3 varchar(1000),   foreign key t2_fk references t1(c1,c2) );

or something like that.

>
>
> and I want to create a table t2 which needs to refer to the composite
> primary key of t1.  Should one create 2 columns in t2 that REFERENCE c1
> and c2?  If so, this seems very cumbersome and I'm tempted to create a
> surrogate key in t1 just to be able to refer to it more efficiently.  Is
> this something we should be considering when choosing natural
> vs. surrogate keys?  Thanks again.
>
>
> --
> Seb
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: surrogate vs natural primary keys

From

Seb

Date:

17 September 2008, 10:21:26

On Tue, 16 Sep 2008 20:34:51 -0600,
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

[...]

> create table t2 (
>     d1 varchar(200),
>     d2 int8,
>     d3 varchar(1000),
>     foreign key t2_fk references t1(c1,c2) );

Thanks Scott, I guess you meant:

CREATE TABLE t2 (   d1 varchar(200),   d2 int8,   d3 varchar(1000),   PRIMARY KEY (d1, d2)   FOREIGN KEY (d1, d2)
REFERENCESt1(c1, c2) );

But this makes it difficult to work with t2 because it has 2 fields that
are the same as in t1.  Isn't it better to just use a surrogate key and
use a single field in t2, thereby avoiding repeating multiple pieces of
information?

Thanks,

-- 
Seb

Re: surrogate vs natural primary keys

From

"Scott Marlowe"

Date:

17 September 2008, 11:45:39

On Wed, Sep 17, 2008 at 7:20 AM, Seb <spluque@gmail.com> wrote:
> On Tue, 16 Sep 2008 20:34:51 -0600,
> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
> [...]
>
>> create table t2 (
>>     d1 varchar(200),
>>     d2 int8,
>>     d3 varchar(1000),
>>     foreign key t2_fk references t1(c1,c2) );
>
> Thanks Scott, I guess you meant:
>
> CREATE TABLE t2 (
>    d1 varchar(200),
>    d2 int8,
>    d3 varchar(1000),
>    PRIMARY KEY (d1, d2)
>    FOREIGN KEY (d1, d2) REFERENCES t1(c1, c2) );
>
> But this makes it difficult to work with t2 because it has 2 fields that
> are the same as in t1.  Isn't it better to just use a surrogate key and
> use a single field in t2, thereby avoiding repeating multiple pieces of
> information?

Yes and no.  If you're gonna hit table t2 a lot by itself, then it's
more efficient to have the data there in t2 and not have to join to t1
to get it.  There are always use cases that go either way in this kind
of situation.

Re: surrogate vs natural primary keys

From

"Richard Broersma"

Date:

17 September 2008, 13:20:51

On Wed, Sep 17, 2008 at 7:45 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

>> CREATE TABLE t2 (
>>    d1 varchar(200),
>>    d2 int8,
>>    d3 varchar(1000),
>>    PRIMARY KEY (d1, d2)
>>    FOREIGN KEY (d1, d2) REFERENCES t1(c1, c2) );
>>
>> thereby avoiding repeating multiple pieces of
>> information?
>
> Yes and no.  If you're gonna hit table t2 a lot by itself, then it's
> more efficient to have the data there in t2 and not have to join to t1
> to get it.  There are always use cases that go either way in this kind
> of situation.

Also, doing this allows you to apply more constraints to T2 (if you
ever wanted to add them).  For example, what if you only wanted to
allow a sub-set of T1(c1,c2) in T2(d1,d2), you could use a check
constraint to enforce this more restrictive relationship.  This
wouldn't be possible (without adding custom triggers) if you only used
a surrogate key.  My experience is that many more validation
constraints are possible if use natural keys are used.  Whether this
feature is a good thing or not is up to you.

My opinion is that the database constraints are the last line of
defense to ensure business rules and data integrity are not violated.
Since I highly value the ability to enforce business rules using
ordinary table DDL, I try to use natural keys as often as I can.

-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: surrogate vs natural primary keys

From

Steve Midgley

Date:

18 September 2008, 02:01:24

>To: pgsql-sql@postgresql.org
>From:  Seb <spluque@gmail.com>
>Subject: Re: surrogate vs natural primary keys
>Date:  Mon, 15 Sep 2008 17:56:31 -0500
>Organization:  Church of Emacs
>Lines: 20
>Message-ID:  <87hc8h2f34.fsf@patagonia.sebmags.homelinux.org>
>References:  <87y71t2hlu.fsf@patagonia.sebmags.homelinux.org>
>         <dcc563d10809151545r70d7d2a7j25931d7557dd88f3@mail.gmail.com>
>X-Archive-Number: 200809/101
>X-Sequence-Number: 31553
>
>On Mon, 15 Sep 2008 16:45:08 -0600,
>"Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
>[...]
>
> > I think this question is a lot like "how large should I set
> > shared_buffers?"  There's lots of different answers based on how 
> you
> > are using your data.
>
>Yes, this is precisely what I'm after: *criteria* to help me decide
>which approach to take for different scenarios.  Such guidance is what
>seems to be lacking from most of the discussions I've seen on the
>subject.  It's hard to distill this information when most of the
>discussion is centered on advocating one or the other approach.

I think Scott and others have laid out the main ideas in a very 
cool-headed way already, but here's my follow-on input:

I agree with Andrew Sullivan that using industry standard id's as your 
primary key can be problematic. But I do sometimes apply unique indices 
to such "industry standard" columns to ensure they are in fact unique 
and can be a surrogate for the "real" integer/serial primary key.

As a rule, I have decided to stay away from "meaningful" (natural) 
primary keys for these reasons:

1) They sometimes change b/c of business rule changes, forcing 
technical changes to the relationship model, when only internal table 
schema changes should be required to support the new business 
requirements.

2) Generating arbitrary/surrogate keys is easier b/c you can use 
sequence generators. (When creating a new record, I have to figure out 
the value of a meaningful column before saving the record which 
sometimes I don't want to do!)

3) Surrogate keys are guaranteed unique regardless of semantic content 
of the table.

4) All tables can all join to each other in the same ways: property.id 
holds the same data type as contact.id. All id fields are the same in 
type/format.

I think there's even a reasonable argument for "globally unique" 
surrogate keys: all keys for any table use the same sequence of id's. I 
implemented a system in the 90's that used globally unique id's and it 
opened up some interesting solutions that I wouldn't have thought of 
when I started the project (self joins were the same as foreign joins 
since the id's in both entities were guaranteed unique).

I've heard some people argue the use of GUID's for id's but I've been 
too scared to try that in a real system.

Sequential, arbitrary primary keys (as surrogate keys) are predictable 
though. So if you share those keys with the public (via URL's for 
example), then competitors can learn information about your business 
(how fast keys are generated for a certain table for example).

That's an argument for random, arbitrary primary keys though, not for 
compound/meaningful keys.

I think natural or compound keys make more sense to DBA's and let you 
implement some kinds of database solutions more quickly.

All in all, I don't really understand the merits of natural keys 
outside of data warehouse applications. In data warehouses, in my 
experience, compound natural keys just end up turning into fact tables! 
:)

In summary: I've never heard someone say they've been bitten by using 
an arbitrary surrogate key system, but I myself have been bitten and 
have heard lots of stories of problems when using natural keys.

I hope this helps some,

Steve

Re: surrogate vs natural primary keys

From

Steve Midgley

Date:

19 September 2008, 03:57:10

At 08:20 AM 9/18/2008, pgsql-sql-owner@postgresql.org wrote:
>Message-ID: 
><396486430809170920o12782b5dy3837b4aa8611a0c1@mail.gmail.com>
>Date: Wed, 17 Sep 2008 09:20:44 -0700
>From: "Richard Broersma" <richard.broersma@gmail.com>
>To: "Scott Marlowe" <scott.marlowe@gmail.com>
>Subject: Re: surrogate vs natural primary keys
>In-Reply-To: 
><dcc563d10809170745n24f5ae73i7e6ba3ec73743dbc@mail.gmail.com>
>References: <87y71t2hlu.fsf@patagonia.sebmags.homelinux.org>
>         <dcc563d10809151545r70d7d2a7j25931d7557dd88f3@mail.gmail.com>
>         <396486430809160759n15ec2134ke0df317f36f74d5a@mail.gmail.com>
>         <20080916151743.GG201@commandprompt.com>
>         <87ljxrvdhw.fsf@patagonia.sebmags.homelinux.org>
>         <dcc563d10809161934q205e4e3bu6b9cd43339ff89b3@mail.gmail.com>
>         <87abe6vrgl.fsf@patagonia.sebmags.homelinux.org>
>         <dcc563d10809170745n24f5ae73i7e6ba3ec73743dbc@mail.gmail.com>
>X-Archive-Number: 200809/124
>X-Sequence-Number: 31576
>
>
>My opinion is that the database constraints are the last line of
>defense to ensure business rules and data integrity are not violated.
>Since I highly value the ability to enforce business rules using
>ordinary table DDL, I try to use natural keys as often as I can.
>

Hi Richard,

I often find your comments insightful and right on the money. This is 
another one of those cases. Your comments above are a great example of 
when natural keys make sense: I hadn't looked at it from this 
perspective!

I'm a middleware developer (the bane of DBA's!) -- and so I generally 
"solve" these sorts business rules constraints in the middleware code, 
which of course is prone to all kinds of different problems (like lazy 
developers who code around the OO validation checkers!).

Thanks for giving such a great explanation as to the value of natural 
keys! You haven't won me over, but you did teach me something - which I 
appreciate.

Best,

Steve