Thread: surrogate vs natural primary keys
Hi, I've been reading several articles on this hotly debated issue and still can't find proper criteria to select one or the other approach for the database I'm currently designing. I'd appreciate any pointers. Thanks. Cheers, -- Seb
On Mon, Sep 15, 2008 at 4:02 PM, Seb <spluque@gmail.com> wrote: > Hi, > > I've been reading several articles on this hotly debated issue and still > can't find proper criteria to select one or the other approach for the > database I'm currently designing. I'd appreciate any pointers. Thanks. You'll find lots of arguments from both sides, some more strident than others. In most big transactional systems you'll find surrogate keys used for performance reasons, as well as design choices. for instance, when you book a flight with an airline, you'll get a locator code like A89JK3 that is unique to any other locator code in the system. Sure, you could make a natural key of first name, last name, address, phone number, flight number, departure / arrival and day and time, but there's no way that's going to perform as well as a single char(6). The problem with natural keys is that you can never be sure they won't change on you. I like using them, but have been caught out on many occasions where things changed halfway through development and required a lot of redesign. I think this question is a lot like "how large should I set shared_buffers?" There's lots of different answers based on how you are using your data.
On Mon, 15 Sep 2008 16:45:08 -0600, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: [...] > I think this question is a lot like "how large should I set > shared_buffers?" There's lots of different answers based on how you > are using your data. Yes, this is precisely what I'm after: *criteria* to help me decide which approach to take for different scenarios. Such guidance is what seems to be lacking from most of the discussions I've seen on the subject. It's hard to distill this information when most of the discussion is centered on advocating one or the other approach. Thanks, -- Seb
On Mon, Sep 15, 2008 at 3:45 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > for > instance, when you book a flight with an airline, you'll get a locator > code like A89JK3 that is unique to any other locator code in the > system. Sure, you could make a natural key of first name, last name, > address, phone number, flight number, departure / arrival and day and > time, but there's no way that's going to perform as well as a single > char(6). Scott, My understanding is that as soon as any code becomes visible outside of the database or application it can no longer be called a surrogate key. From my reading of some of the Celko books, he strongly ascribes to codes as primary keys. His suggestion is to use internationally recognized codes (if they exist) for identify items. If none exist then nationally recognized codes, then industry recognized codes, and then finally if non of these exists then he recommends developing a company specific unique code (for these he recommends codes that have a built in check-sum (I think check-sum is the correct word) for data entry validation). Any thoughts? My DB experience has been limited to small < 2 GB OLTP databases, so I understand that for very large databases there may be cases where natural keys are a clear looser for performace and storage reasons. In my case, I've tried to evaluate the trade-offs between using natural vs surrogate keys for every relation. Most of the time I use natural keys, however there are some occations when surrogate keys (for me) are only way to go. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
On Tue, Sep 16, 2008 at 07:59:20AM -0700, Richard Broersma wrote: > key. From my reading of some of the Celko books, he strongly ascribes > to codes as primary keys. His suggestion is to use internationally > recognized codes (if they exist) for identify items. The problem with that strategy is that any identifier has to be NOT NULL and UNIQUE. So if the international identifier assigner makes a mistake (fails to assign or assigns duplicates), you're hosed. In addition, you need to make sure that the purpose for which you are using the identifier is also the purpose for which the identifier is assigned. ISO, for instance, is willing to re-use country codes (even though the specification never suggested they were). So if you expect to use the ISO 2-letter codes over time, you may get a nasty surprise. (For an example, in 2003 "CS" became historically ambiguous.) A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
Hi, After some more reading and considering your feedback, I'm still somewhat confused about this issue. 1. Should the choice of surrogate/natural primary keys be done across an entire database, or does it make more sense to do it on a per-table basis? I reckon one could do it on a per-table basis, but its direct relationships would influence the choice. 2. If we do find a suitable natural primary key for a table, but it turns out to be a composite one, how can such a key be referred to in another table? Say we have: CREATE TABLE t1 ( c1 varchar(200), c2 int8, c3 varchar(500), PRIMARY KEY (c1, c2) ); and I want to create a table t2 which needs to refer to the composite primary key of t1. Should one create 2 columns in t2 that REFERENCE c1 and c2? If so, this seems very cumbersome and I'm tempted to create a surrogate key in t1 just to be able to refer to it more efficiently. Is this something we should be considering when choosing natural vs. surrogate keys? Thanks again. -- Seb
On Tue, Sep 16, 2008 at 6:10 PM, Seb <spluque@gmail.com> wrote: > Hi, > > After some more reading and considering your feedback, I'm still > somewhat confused about this issue. > > 1. Should the choice of surrogate/natural primary keys be done across an > entire database, or does it make more sense to do it on a per-table > basis? I reckon one could do it on a per-table basis, but its direct > relationships would influence the choice. Definitely on a per table basis. for instance, if you create a lookup table to use as a target for an FK, there's usually little need for an artificial key. > 2. If we do find a suitable natural primary key for a table, but it > turns out to be a composite one, how can such a key be referred to in > another table? Say we have: > > CREATE TABLE t1 ( > c1 varchar(200), > c2 int8, > c3 varchar(500), > PRIMARY KEY (c1, c2) > ); create table t2 ( d1 varchar(200), d2 int8, d3 varchar(1000), foreign key t2_fk references t1(c1,c2) ); or something like that. > > > and I want to create a table t2 which needs to refer to the composite > primary key of t1. Should one create 2 columns in t2 that REFERENCE c1 > and c2? If so, this seems very cumbersome and I'm tempted to create a > surrogate key in t1 just to be able to refer to it more efficiently. Is > this something we should be considering when choosing natural > vs. surrogate keys? Thanks again. > > > -- > Seb > > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql >
On Tue, 16 Sep 2008 20:34:51 -0600, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: [...] > create table t2 ( > d1 varchar(200), > d2 int8, > d3 varchar(1000), > foreign key t2_fk references t1(c1,c2) ); Thanks Scott, I guess you meant: CREATE TABLE t2 ( d1 varchar(200), d2 int8, d3 varchar(1000), PRIMARY KEY (d1, d2) FOREIGN KEY (d1, d2) REFERENCESt1(c1, c2) ); But this makes it difficult to work with t2 because it has 2 fields that are the same as in t1. Isn't it better to just use a surrogate key and use a single field in t2, thereby avoiding repeating multiple pieces of information? Thanks, -- Seb
On Wed, Sep 17, 2008 at 7:20 AM, Seb <spluque@gmail.com> wrote: > On Tue, 16 Sep 2008 20:34:51 -0600, > "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > > [...] > >> create table t2 ( >> d1 varchar(200), >> d2 int8, >> d3 varchar(1000), >> foreign key t2_fk references t1(c1,c2) ); > > Thanks Scott, I guess you meant: > > CREATE TABLE t2 ( > d1 varchar(200), > d2 int8, > d3 varchar(1000), > PRIMARY KEY (d1, d2) > FOREIGN KEY (d1, d2) REFERENCES t1(c1, c2) ); > > But this makes it difficult to work with t2 because it has 2 fields that > are the same as in t1. Isn't it better to just use a surrogate key and > use a single field in t2, thereby avoiding repeating multiple pieces of > information? Yes and no. If you're gonna hit table t2 a lot by itself, then it's more efficient to have the data there in t2 and not have to join to t1 to get it. There are always use cases that go either way in this kind of situation.
On Wed, Sep 17, 2008 at 7:45 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> CREATE TABLE t2 ( >> d1 varchar(200), >> d2 int8, >> d3 varchar(1000), >> PRIMARY KEY (d1, d2) >> FOREIGN KEY (d1, d2) REFERENCES t1(c1, c2) ); >> >> thereby avoiding repeating multiple pieces of >> information? > > Yes and no. If you're gonna hit table t2 a lot by itself, then it's > more efficient to have the data there in t2 and not have to join to t1 > to get it. There are always use cases that go either way in this kind > of situation. Also, doing this allows you to apply more constraints to T2 (if you ever wanted to add them). For example, what if you only wanted to allow a sub-set of T1(c1,c2) in T2(d1,d2), you could use a check constraint to enforce this more restrictive relationship. This wouldn't be possible (without adding custom triggers) if you only used a surrogate key. My experience is that many more validation constraints are possible if use natural keys are used. Whether this feature is a good thing or not is up to you. My opinion is that the database constraints are the last line of defense to ensure business rules and data integrity are not violated. Since I highly value the ability to enforce business rules using ordinary table DDL, I try to use natural keys as often as I can. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
>To: pgsql-sql@postgresql.org >From: Seb <spluque@gmail.com> >Subject: Re: surrogate vs natural primary keys >Date: Mon, 15 Sep 2008 17:56:31 -0500 >Organization: Church of Emacs >Lines: 20 >Message-ID: <87hc8h2f34.fsf@patagonia.sebmags.homelinux.org> >References: <87y71t2hlu.fsf@patagonia.sebmags.homelinux.org> > <dcc563d10809151545r70d7d2a7j25931d7557dd88f3@mail.gmail.com> >X-Archive-Number: 200809/101 >X-Sequence-Number: 31553 > >On Mon, 15 Sep 2008 16:45:08 -0600, >"Scott Marlowe" <scott.marlowe@gmail.com> wrote: > >[...] > > > I think this question is a lot like "how large should I set > > shared_buffers?" There's lots of different answers based on how > you > > are using your data. > >Yes, this is precisely what I'm after: *criteria* to help me decide >which approach to take for different scenarios. Such guidance is what >seems to be lacking from most of the discussions I've seen on the >subject. It's hard to distill this information when most of the >discussion is centered on advocating one or the other approach. I think Scott and others have laid out the main ideas in a very cool-headed way already, but here's my follow-on input: I agree with Andrew Sullivan that using industry standard id's as your primary key can be problematic. But I do sometimes apply unique indices to such "industry standard" columns to ensure they are in fact unique and can be a surrogate for the "real" integer/serial primary key. As a rule, I have decided to stay away from "meaningful" (natural) primary keys for these reasons: 1) They sometimes change b/c of business rule changes, forcing technical changes to the relationship model, when only internal table schema changes should be required to support the new business requirements. 2) Generating arbitrary/surrogate keys is easier b/c you can use sequence generators. (When creating a new record, I have to figure out the value of a meaningful column before saving the record which sometimes I don't want to do!) 3) Surrogate keys are guaranteed unique regardless of semantic content of the table. 4) All tables can all join to each other in the same ways: property.id holds the same data type as contact.id. All id fields are the same in type/format. I think there's even a reasonable argument for "globally unique" surrogate keys: all keys for any table use the same sequence of id's. I implemented a system in the 90's that used globally unique id's and it opened up some interesting solutions that I wouldn't have thought of when I started the project (self joins were the same as foreign joins since the id's in both entities were guaranteed unique). I've heard some people argue the use of GUID's for id's but I've been too scared to try that in a real system. Sequential, arbitrary primary keys (as surrogate keys) are predictable though. So if you share those keys with the public (via URL's for example), then competitors can learn information about your business (how fast keys are generated for a certain table for example). That's an argument for random, arbitrary primary keys though, not for compound/meaningful keys. I think natural or compound keys make more sense to DBA's and let you implement some kinds of database solutions more quickly. All in all, I don't really understand the merits of natural keys outside of data warehouse applications. In data warehouses, in my experience, compound natural keys just end up turning into fact tables! :) In summary: I've never heard someone say they've been bitten by using an arbitrary surrogate key system, but I myself have been bitten and have heard lots of stories of problems when using natural keys. I hope this helps some, Steve
At 08:20 AM 9/18/2008, pgsql-sql-owner@postgresql.org wrote: >Message-ID: ><396486430809170920o12782b5dy3837b4aa8611a0c1@mail.gmail.com> >Date: Wed, 17 Sep 2008 09:20:44 -0700 >From: "Richard Broersma" <richard.broersma@gmail.com> >To: "Scott Marlowe" <scott.marlowe@gmail.com> >Subject: Re: surrogate vs natural primary keys >In-Reply-To: ><dcc563d10809170745n24f5ae73i7e6ba3ec73743dbc@mail.gmail.com> >References: <87y71t2hlu.fsf@patagonia.sebmags.homelinux.org> > <dcc563d10809151545r70d7d2a7j25931d7557dd88f3@mail.gmail.com> > <396486430809160759n15ec2134ke0df317f36f74d5a@mail.gmail.com> > <20080916151743.GG201@commandprompt.com> > <87ljxrvdhw.fsf@patagonia.sebmags.homelinux.org> > <dcc563d10809161934q205e4e3bu6b9cd43339ff89b3@mail.gmail.com> > <87abe6vrgl.fsf@patagonia.sebmags.homelinux.org> > <dcc563d10809170745n24f5ae73i7e6ba3ec73743dbc@mail.gmail.com> >X-Archive-Number: 200809/124 >X-Sequence-Number: 31576 > > >My opinion is that the database constraints are the last line of >defense to ensure business rules and data integrity are not violated. >Since I highly value the ability to enforce business rules using >ordinary table DDL, I try to use natural keys as often as I can. > Hi Richard, I often find your comments insightful and right on the money. This is another one of those cases. Your comments above are a great example of when natural keys make sense: I hadn't looked at it from this perspective! I'm a middleware developer (the bane of DBA's!) -- and so I generally "solve" these sorts business rules constraints in the middleware code, which of course is prone to all kinds of different problems (like lazy developers who code around the OO validation checkers!). Thanks for giving such a great explanation as to the value of natural keys! You haven't won me over, but you did teach me something - which I appreciate. Best, Steve