Thread: Avoiding surrogate keys
I think I know what I plan to do, but want to throw this out there to see if there are differing points of view.
I have a mailing list table, and 2 of the columns contain values which have to be from a list. These are country and status. There are 237 possible countries and 3 possible statuses. Now I know some people would assign a sequence ID (surrogate key) to the country and status values, and have them looked up in separate tables to get the textual value, but I think I'll still have those tables, just without an ID column, so 1 column for both the countries and statuses tables. This means storing the proper value in the main table.
So instead of
name, email, country, status
'mr smith', 'emailaddress@example.com', 44, 2
'mrs jones', 'me@emailcompany.com', 21, 1
I'd have
name, email, country, status
'mr smith', 'emailaddress@example.com', 'China', 'Registered'
'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
The values of course would be constrained by foreign key lookup to their associated tables.
Are there any serious downsides to this? If so, what would you recommend?
Thanks
Thom
In response to Thom Brown <thombrown@gmail.com>: > I think I know what I plan to do, but want to throw this out there to see if > there are differing points of view. > > I have a mailing list table, and 2 of the columns contain values which have > to be from a list. These are country and status. There are 237 possible > countries and 3 possible statuses. Now I know some people would assign a > sequence ID (surrogate key) to the country and status values, and have them > looked up in separate tables to get the textual value, but I think I'll > still have those tables, just without an ID column, so 1 column for both the > countries and statuses tables. This means storing the proper value in the > main table. > > So instead of > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 44, 2 > 'mrs jones', 'me@emailcompany.com', 21, 1 > > I'd have > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 'China', 'Registered' > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed' > > The values of course would be constrained by foreign key lookup to their > associated tables. > > Are there any serious downsides to this? If so, what would you recommend? I'd use an ENUM for the status, as that's not liable to change. The only problem I see with avoiding the surrogate key for the country is that the table might require more disk space if a lot of the country names end up being very long. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote: > I'd use an ENUM for the status, as that's not liable to change. > > The only problem I see with avoiding the surrogate key for the country > is that the table might require more disk space if a lot of the country > names end up being very long. > Yeah but that isn't generally a problem now and the gain he gets from the lack of join performance is more than worth it. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Wed, 21 Apr 2010, Thom Brown wrote: > I have a mailing list table, and 2 of the columns contain values which > have to be from a list. Thom, From 2 lists? > These are country and status. And each is from a separate list, correct? > There are 237 possible countries and 3 possible statuses. > Now I know some people would assign a sequence ID (surrogate key) to the > country and status values, Why? These two fields do not define a unique row, do they? If not, then neither is a candidate key and should be treated as a regular attribute. > ... and have them looked up in separate tables to get the textual value, > but I think I'll still have those tables, just without an ID column, so 1 > column for both the countries and statuses tables. This means storing the > proper value in the main table. You could have a table with two columns: abbreviation and name. Then you could copy that table from the Internet to get the data. > So instead of > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 44, 2 > 'mrs jones', 'me@emailcompany.com', 21, 1 Not only is more work, but it's confusing and unnecessary. > I'd have > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 'China', 'Registered' > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed' Sure. > Are there any serious downsides to this? If so, what would you recommend? Nope. Not even flippant downsides. Rich
On Wed, Apr 21, 2010 at 3:01 PM, Thom Brown <thombrown@gmail.com> wrote: > I think I know what I plan to do, but want to throw this out there to see if > there are differing points of view. > I have a mailing list table, and 2 of the columns contain values which have > to be from a list. These are country and status. There are 237 possible > countries and 3 possible statuses. Now I know some people would assign a > sequence ID (surrogate key) to the country and status values, and have them > looked up in separate tables to get the textual value, but I think I'll > still have those tables, just without an ID column, so 1 column for both the > countries and statuses tables. This means storing the proper value in the > main table. > So instead of > name, email, country, status > 'mr smith', 'emailaddress@example.com', 44, 2 > 'mrs jones', 'me@emailcompany.com', 21, 1 > I'd have > name, email, country, status > 'mr smith', 'emailaddress@example.com', 'China', 'Registered' > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed' > The values of course would be constrained by foreign key lookup to their > associated tables. > Are there any serious downsides to this? If so, what would you recommend? Natural keys: *) force formal relationships into your key design (this is good) *) Make your database MUCH easier to follow, browse, and understand *) in particular cases allow you to skip joins *) will make your indexes fatter (this is not good) *) can be a pain if your keys are updated frequently *) can be a major pain if your key changes in structure (adds a field, or changes in type) Surrogate keys: *) Give you faster joins, but more of them (this is a win/loss depending on circumstances) *) Tend to encourage lazy/poor designs, since you hide relationships behind a value *) Make the tables more difficult to browse and understand *) Make updates to keys/key structure trivial I personally use natural keys when I can and surrogates when I have to. When I do use a surrogate, I tend to still define the natural key as primary and simply make a alternate 'unique' constraint for the surrogate. merlin
On Wed, 2010-04-21 at 15:18 -0400, Merlin Moncure wrote: > Natural keys: > *) force formal relationships into your key design (this is good) > *) Make your database MUCH easier to follow, browse, and understand > *) in particular cases allow you to skip joins > *) will make your indexes fatter (this is not good) > *) can be a pain if your keys are updated frequently > *) can be a major pain if your key changes in structure (adds a field, > or changes in type) > > Surrogate keys: > *) Give you faster joins, but more of them (this is a win/loss > depending on circumstances) Kind of... natural keys can remove the need for the join in the first place, which is certainly faster than joining :P Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On 21 April 2010 20:18, Merlin Moncure <mmoncure@gmail.com> wrote:
Natural keys:On Wed, Apr 21, 2010 at 3:01 PM, Thom Brown <thombrown@gmail.com> wrote:
> I think I know what I plan to do, but want to throw this out there to see if
> there are differing points of view.
> I have a mailing list table, and 2 of the columns contain values which have
> to be from a list. These are country and status. There are 237 possible
> countries and 3 possible statuses. Now I know some people would assign a
> sequence ID (surrogate key) to the country and status values, and have them
> looked up in separate tables to get the textual value, but I think I'll
> still have those tables, just without an ID column, so 1 column for both the
> countries and statuses tables. This means storing the proper value in the
> main table.
> So instead of
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1
> I'd have
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> The values of course would be constrained by foreign key lookup to their
> associated tables.
> Are there any serious downsides to this? If so, what would you recommend?
*) force formal relationships into your key design (this is good)
*) Make your database MUCH easier to follow, browse, and understand
*) in particular cases allow you to skip joins
*) will make your indexes fatter (this is not good)
*) can be a pain if your keys are updated frequently
*) can be a major pain if your key changes in structure (adds a field,
or changes in type)
Surrogate keys:
*) Give you faster joins, but more of them (this is a win/loss
depending on circumstances)
*) Tend to encourage lazy/poor designs, since you hide relationships
behind a value
*) Make the tables more difficult to browse and understand
*) Make updates to keys/key structure trivial
I personally use natural keys when I can and surrogates when I have
to. When I do use a surrogate, I tend to still define the natural key
as primary and simply make a alternate 'unique' constraint for the
surrogate.
merlin
Thanks for the comments guys. I'm now pretty sure using natural keys is the right approach. Neither list will ever change type, and if they change, it'll be quite infrequent. Index size shouldn't really be a problem since we're realistically talking thousands of rows.
What I hate about surrogate keys is having to keep looking stuff up:
INSERT INTO stuff (col_a, col_b)
SELECT 'my_value', table_b.id
FROM table_b
WHERE table_b.real_value = 'PostgreSQL';
As opposed to:
INSERT INTO stuff (col_a, col_b) VALUES ('my_value', 'PostgreSQL');
Just making sure I don't fall into the surrogate key abuse trap. :)
Thom
In response to "Joshua D. Drake" <jd@commandprompt.com>: > On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote: > > I'd use an ENUM for the status, as that's not liable to change. > > > > The only problem I see with avoiding the surrogate key for the country > > is that the table might require more disk space if a lot of the country > > names end up being very long. > > > > Yeah but that isn't generally a problem now and the gain he gets from > the lack of join performance is more than worth it. Agreed. I guess I didn't explain properly: the disk space _may_ be an issue if you have LOTS of REALLY LONG names. Like, if the shortest name in your country table is 'Federated States of Micronesia (Esquire)', then the disk space used by the table and index might become an issue. Also, if this is an embedded application where disk is at a premium ... As with many things, "big" and "small" are relative, ambiguous and copy-written by someone who thinks they can charge me every time I use them. One thing that a lot of people seem to get confused about is that they subconsciously think that ints or bigints take up less space when the numbers are small. I.e.: I want to use an int for my state identifier instead of the 2-digit code, because it will use less space -- wrong, an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the number never gets higher than 50. Personally, I think the only good reason to use a surrogate key is when there's a high likelihood that your primary text identifier might change. Although ON UPDATE CASCADE can even handle that, it just might take a while. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
Bill Moran escribió: > One thing that a lot of people seem to get confused about is that they > subconsciously think that ints or bigints take up less space when the > numbers are small. I.e.: I want to use an int for my state identifier > instead of the 2-digit code, because it will use less space -- wrong, > an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the > number never gets higher than 50. Eh, a 2 byte char column uses 3 bytes -- there's one byte of overhead. (Unless one of the chars is multibyte in which case it can be longer). Earlier versions of Postgres use 6 bytes to store the 2 chars (4 bytes of overhead), so it would be larger than the int. Not that this invalidates the argument -- just nitpicking here. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Wed, Apr 21, 2010 at 3:32 PM, Bill Moran <wmoran@potentialtech.com> wrote: > In response to "Joshua D. Drake" <jd@commandprompt.com>: > >> On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote: >> > I'd use an ENUM for the status, as that's not liable to change. >> > >> > The only problem I see with avoiding the surrogate key for the country >> > is that the table might require more disk space if a lot of the country >> > names end up being very long. >> > >> >> Yeah but that isn't generally a problem now and the gain he gets from >> the lack of join performance is more than worth it. > > Agreed. I guess I didn't explain properly: the disk space _may_ be an > issue if you have LOTS of REALLY LONG names. > > Like, if the shortest name in your country table is 'Federated States > of Micronesia (Esquire)', then the disk space used by the table and > index might become an issue. > > Also, if this is an embedded application where disk is at a premium ... > > As with many things, "big" and "small" are relative, ambiguous and > copy-written by someone who thinks they can charge me every time I > use them. > > One thing that a lot of people seem to get confused about is that they > subconsciously think that ints or bigints take up less space when the > numbers are small. I.e.: I want to use an int for my state identifier > instead of the 2-digit code, because it will use less space -- wrong, > an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the > number never gets higher than 50. > > Personally, I think the only good reason to use a surrogate key is when > there's a high likelihood that your primary text identifier might change. > Although ON UPDATE CASCADE can even handle that, it just might take a > while. Another semi-related case is when the keys are large which in turn gives you larger indexes. Large indexes pressure your cache which in turn gives you more disk faults which can really nail you if your database is large. Both of these reasons fall away if/when storage catches up with the rest of hardware IMO. Another penalty of surrogate keys I neglected to mention is more sorts. It's typical to want to pull data in natural order which is obfuscated behind the surrogate key. So besides giving you a 'free join', you get a free sort as well if you pull data off the index. Interesting aside: I would estimate that a fairly large percentage of -performance problems (say, 25%) can be directly traced to poor indexing strategy and loose relationships between tables. Natural keys tend to deflect both of those problems. merlin
bill-
static information such as country names *should be* de-normalised into non-indexed columns of the driving table as you have already done
if on the other hand the column information country names were changing were dynamic then you would want to keep them in their respective table
as country code 001 will always be US (and the remaining countries and country code will never change)
i would suggest keeping the full name in the driving table (same goes with state/province data btw)
it also saves the database the I/O and CPU from having to do a lookup on another table
my 2 cents
martin-
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
> Date: Wed, 21 Apr 2010 15:07:35 -0400
> From: wmoran@potentialtech.com
> To: thombrown@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Avoiding surrogate keys
>
> In response to Thom Brown <thombrown@gmail.com>:
>
> > I think I know what I plan to do, but want to throw this out there to see if
> > there are differing points of view.
> >
> > I have a mailing list table, and 2 of the columns contain values which have
> > to be from a list. These are country and status. There are 237 possible
> > countries and 3 possible statuses. Now I know some people would assign a
> > sequence ID (surrogate key) to the country and status values, and have them
> > looked up in separate tables to get the textual value, but I think I'll
> > still have those tables, just without an ID column, so 1 column for both the
> > countries and statuses tables. This means storing the proper value in the
> > main table.
> >
> > So instead of
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 44, 2
> > 'mrs jones', 'me@emailcompany.com', 21, 1
> >
> > I'd have
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> >
> > The values of course would be constrained by foreign key lookup to their
> > associated tables.
> >
> > Are there any serious downsides to this? If so, what would you recommend?
>
> I'd use an ENUM for the status, as that's not liable to change.
>
> The only problem I see with avoiding the surrogate key for the country
> is that the table might require more disk space if a lot of the country
> names end up being very long.
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. Learn more.
static information such as country names *should be* de-normalised into non-indexed columns of the driving table as you have already done
if on the other hand the column information country names were changing were dynamic then you would want to keep them in their respective table
as country code 001 will always be US (and the remaining countries and country code will never change)
i would suggest keeping the full name in the driving table (same goes with state/province data btw)
it also saves the database the I/O and CPU from having to do a lookup on another table
my 2 cents
martin-
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
> Date: Wed, 21 Apr 2010 15:07:35 -0400
> From: wmoran@potentialtech.com
> To: thombrown@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Avoiding surrogate keys
>
> In response to Thom Brown <thombrown@gmail.com>:
>
> > I think I know what I plan to do, but want to throw this out there to see if
> > there are differing points of view.
> >
> > I have a mailing list table, and 2 of the columns contain values which have
> > to be from a list. These are country and status. There are 237 possible
> > countries and 3 possible statuses. Now I know some people would assign a
> > sequence ID (surrogate key) to the country and status values, and have them
> > looked up in separate tables to get the textual value, but I think I'll
> > still have those tables, just without an ID column, so 1 column for both the
> > countries and statuses tables. This means storing the proper value in the
> > main table.
> >
> > So instead of
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 44, 2
> > 'mrs jones', 'me@emailcompany.com', 21, 1
> >
> > I'd have
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> >
> > The values of course would be constrained by foreign key lookup to their
> > associated tables.
> >
> > Are there any serious downsides to this? If so, what would you recommend?
>
> I'd use an ENUM for the status, as that's not liable to change.
>
> The only problem I see with avoiding the surrogate key for the country
> is that the table might require more disk space if a lot of the country
> names end up being very long.
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. Learn more.
I agree, as long as one needs the country names in only one language. -Mark (Sorry Martin- forgot to "Reply to all" the last time) ________________________________________ De : pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] De la part de Martin Gainty Envoyé : 21 avril 2010 16:38 À : wmoran@potentialtech.com Cc : pgsql-general@postgresql.org Objet : Re: [GENERAL] Avoiding surrogate keys ... static information such as country names *should be* de-normalised into non-indexed columns of the driving table as you have already done if on the other hand the column information country names were changing were dynamic then you would want to keep them in their respective table as country code 001 will always be US (and the remaining countries and country code will never change) i would suggest keeping the full name in the driving table (same goes with state/province data btw) ...
On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote: > I'd use an ENUM for the status, as that's not liable to change. > > The only problem I see with avoiding the surrogate key for the country > is that the table might require more disk space if a lot of the country > names end up being very long. > Yeah but that isn't generally a problem now and the gain he gets from the lack of join performance is more than worth it. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Wed, 2010-04-21 at 15:18 -0400, Merlin Moncure wrote: > Natural keys: > *) force formal relationships into your key design (this is good) > *) Make your database MUCH easier to follow, browse, and understand > *) in particular cases allow you to skip joins > *) will make your indexes fatter (this is not good) > *) can be a pain if your keys are updated frequently > *) can be a major pain if your key changes in structure (adds a field, > or changes in type) > > Surrogate keys: > *) Give you faster joins, but more of them (this is a win/loss > depending on circumstances) Kind of... natural keys can remove the need for the join in the first place, which is certainly faster than joining :P Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
Hi, I think nobody mentioned Object-Relational mappers. If you intend to used one (or think you may be using one in the future),using surrogate keys is more straightforward, if not necessary. Best regards, ------------------------------------------------------------- Attik System web : http://www.attiksystem.ch Philippe Lang phone: +41 26 422 13 75 rte de la Fonderie 2 gsm : +41 79 351 49 94 1700 Fribourg pgp : http://keyserver.pgp.com > -----Message d'origine----- > De : pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] De la part de Thom Brown > Envoyé : mercredi 21 avril 2010 21:01 > À : PGSQL Mailing List > Objet : [GENERAL] Avoiding surrogate keys > > I think I know what I plan to do, but want to throw this out there to > see if there are differing points of view. > > I have a mailing list table, and 2 of the columns contain values which > have to be from a list. These are country and status. There are 237 > possible countries and 3 possible statuses. Now I know some people > would assign a sequence ID (surrogate key) to the country and status > values, and have them looked up in separate tables to get the textual > value, but I think I'll still have those tables, just without an ID > column, so 1 column for both the countries and statuses tables. This > means storing the proper value in the main table. > > So instead of > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 44, 2 > 'mrs jones', 'me@emailcompany.com', 21, 1 > > I'd have > > name, email, country, status > 'mr smith', 'emailaddress@example.com', 'China', 'Registered' > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed' > > The values of course would be constrained by foreign key lookup to > their associated tables. > > Are there any serious downsides to this? If so, what would you > recommend? > > Thanks > > Thom
Attachment
Philippe Lang wrote: > I think nobody mentioned Object-Relational mappers. > If you intend to used one (or think you may be using one in the future), > using surrogate keys is more straightforward, if not necessary. Neither of those claims is even slightly true. Using Hibernate, EclipseLink or OpenJPA (for Java applications), natural keys are sufficient and far more straightforward than surrogate keys. -- Lew
On Sat, May 1, 2010 at 12:09 PM, Lew <noone@lwsc.ehost-services.com> wrote: > Philippe Lang wrote: >> >> I think nobody mentioned Object-Relational mappers. >> If you intend to used one (or think you may be using one in the future), >> using surrogate keys is more straightforward, if not necessary. > > Neither of those claims is even slightly true. Using Hibernate, EclipseLink > or OpenJPA (for Java applications), natural keys are sufficient and far more > straightforward than surrogate keys. right -- to be fair though is quite a bit of (generally bad) software out there that assumes or at least heavily encourages surrogate keys. merlin
If your 'natural key' is a large text field, I'd have to assume there's some point at which a surrogate index would be more efficient. Would this be above a few dozen characters, or a few 100 characters? I wouldn't want a PK based on a multi-K byte text field for a table that has many 10s or 100s of 1000s of rows, for sure.
On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote: > > If your 'natural key' is a large text field, I'd have to assume there's some > point at which a surrogate index would be more efficient. Would this be > above a few dozen characters, or a few 100 characters? I wouldn't want a > PK based on a multi-K byte text field for a table that has many 10s or 100s > of 1000s of rows, for sure. Well, yes, but: *) most natural keys are small, between 4-10 bytes *) regardless of the data type of the key, the number of btree lookups is going to remain approximately the same *) you have to (or at least should) put the index on anyway as unique constraint. you do pay the price in dependent tables however. so the natural _primary_ key is free. it's the foreign keys where you pay. the two main performance issues with natural keys are this (you kinda touched on one): *) the index is fatter, pressuring cache It's not so much the comparison function but the fact that the larger index(es) require more memory. If drives were faster than they were this wouldn't matter as much -- I expect this to become less of a factor as SSD technology improves. This can somewhat modulated by clustering the index...you get a better chance of reading multiple relevant records on a single page. *) cascading updates If your key is in a lot of places and has to be updated it can cause a mess. Lots of locks, dead space, vacuuming, etc. Most of the time primary keys don't change very much but if they do you had better give it fair consideration. Natural keys have a lot of performance advantages as mentioned upthread. merlin
On Tue, May 4, 2010 at 9:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote: >> >> If your 'natural key' is a large text field, I'd have to assume there's some >> point at which a surrogate index would be more efficient. Would this be >> above a few dozen characters, or a few 100 characters? I wouldn't want a >> PK based on a multi-K byte text field for a table that has many 10s or 100s >> of 1000s of rows, for sure. one more note about this. if you truly have a situation where a multi kilobyte chunk of data is the key, you can always digest it and use that. you lose the natural ordering -- but in these type of cases it usually doesn't matter. merlin
On Tue, May 4, 2010 at 3:16 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, May 4, 2010 at 9:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote: >>> >>> If your 'natural key' is a large text field, I'd have to assume there's some >>> point at which a surrogate index would be more efficient. Would this be >>> above a few dozen characters, or a few 100 characters? I wouldn't want a >>> PK based on a multi-K byte text field for a table that has many 10s or 100s >>> of 1000s of rows, for sure. > > one more note about this. if you truly have a situation where a multi > kilobyte chunk of data is the key, you can always digest it and use > that. you lose the natural ordering -- but in these type of cases it > usually doesn't matter. > 99% of all cases, where I introduced bigint surrogate key, in tables that hold account, customers, and that sort of bollocks, were straight wins, sometimes measuring 3-4x. Another thing, If your PK changes, it is no longer a PK, you can't rely on it. -- GJ
2010/5/4 Grzegorz Jaśkiewicz <gryzman@gmail.com>: > > Another thing, If your PK changes, it is no longer a PK, you can't rely on it. Depending upon what you mean by changes this could be true or it could be a matter of opinion. If your referring to a candidate key's value changes, this key still provides a useful way to identify a tuple. Also, just because a surrogate key is a useful way to identify a row, by itself it does nothing to uniquely identify an entity that a row represents. On some DBMS forums that I frequent, it is a daily occurrence to see questions on how to eliminate duplicate rows. So, for some people, there are times when even surrogate keys cannot be relied upon as a PK. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
the rule of thumb for me is: - if you have more than one column as PK - and are variable length, or more than 2 columns, fixed length, no bigger than 8 bytes - go for surrogate - always. - if PK is variable length, on average longer than 8 bytes, or can change - go surrogate. - Otherwise leave it as it is. At the end of a day, joining tables, using that have more than 1 variable length key, kills performance. In essence it makes index search a PITA, and should be avoided. This should always be however decision based on query performance, usage, and data types. Never a textbook A or B type of thing.