Home > mailing lists

Thread: Avoiding surrogate keys

Avoiding surrogate keys

From

Thom Brown

Date:

21 April 2010, 16:01:57

I think I know what I plan to do, but want to throw this out there to see if there are differing points of view.

I have a mailing list table, and 2 of the columns contain values which have to be from a list. These are country and status. There are 237 possible countries and 3 possible statuses. Now I know some people would assign a sequence ID (surrogate key) to the country and status values, and have them looked up in separate tables to get the textual value, but I think I'll still have those tables, just without an ID column, so 1 column for both the countries and statuses tables. This means storing the proper value in the main table.

So instead of

name, email, country, status

'mr smith', 'emailaddress@example.com', 44, 2

'mrs jones', 'me@emailcompany.com', 21, 1

I'd have

name, email, country, status

'mr smith', 'emailaddress@example.com', 'China', 'Registered'

'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'

The values of course would be constrained by foreign key lookup to their associated tables.

Are there any serious downsides to this? If so, what would you recommend?

Thanks

Thom

Re: Avoiding surrogate keys

From

Bill Moran

Date:

21 April 2010, 16:07:49

In response to Thom Brown <thombrown@gmail.com>:

> I think I know what I plan to do, but want to throw this out there to see if
> there are differing points of view.
>
> I have a mailing list table, and 2 of the columns contain values which have
> to be from a list.  These are country and status.  There are 237 possible
> countries and 3 possible statuses.  Now I know some people would assign a
> sequence ID (surrogate key) to the country and status values, and have them
> looked up in separate tables to get the textual value, but I think I'll
> still have those tables, just without an ID column, so 1 column for both the
> countries and statuses tables.  This means storing the proper value in the
> main table.
>
> So instead of
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1
>
> I'd have
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
>
> The values of course would be constrained by foreign key lookup to their
> associated tables.
>
> Are there any serious downsides to this?  If so, what would you recommend?

I'd use an ENUM for the status, as that's not liable to change.

The only problem I see with avoiding the surrogate key for the country
is that the table might require more disk space if a lot of the country
names end up being very long.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Avoiding surrogate keys

From

"Joshua D. Drake"

Date:

21 April 2010, 16:14:35

On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote:
> I'd use an ENUM for the status, as that's not liable to change.
>
> The only problem I see with avoiding the surrogate key for the country
> is that the table might require more disk space if a lot of the country
> names end up being very long.
>

Yeah but that isn't generally a problem now and the gain he gets from
the lack of join performance is more than worth it.

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering

Re: Avoiding surrogate keys

From

Rich Shepard

Date:

21 April 2010, 16:17:38

On Wed, 21 Apr 2010, Thom Brown wrote:

> I have a mailing list table, and 2 of the columns contain values which
> have to be from a list.

Thom,

   From 2 lists?

> These are country and status.

   And each is from a separate list, correct?

> There are 237 possible countries and 3 possible statuses.

> Now I know some people would assign a sequence ID (surrogate key) to the
> country and status values,

   Why? These two fields do not define a unique row, do they? If not, then
neither is a candidate key and should be treated as a regular attribute.

> ... and have them looked up in separate tables to get the textual value,
> but I think I'll still have those tables, just without an ID column, so 1
> column for both the countries and statuses tables.  This means storing the
> proper value in the main table.

   You could have a table with two columns: abbreviation and name. Then you
could copy that table from the Internet to get the data.

> So instead of
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1

   Not only is more work, but it's confusing and unnecessary.

> I'd have
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'

   Sure.

> Are there any serious downsides to this?  If so, what would you recommend?

   Nope. Not even flippant downsides.

Rich

Re: Avoiding surrogate keys

From

Merlin Moncure

Date:

21 April 2010, 16:18:59

On Wed, Apr 21, 2010 at 3:01 PM, Thom Brown <thombrown@gmail.com> wrote:
> I think I know what I plan to do, but want to throw this out there to see if
> there are differing points of view.
> I have a mailing list table, and 2 of the columns contain values which have
> to be from a list.  These are country and status.  There are 237 possible
> countries and 3 possible statuses.  Now I know some people would assign a
> sequence ID (surrogate key) to the country and status values, and have them
> looked up in separate tables to get the textual value, but I think I'll
> still have those tables, just without an ID column, so 1 column for both the
> countries and statuses tables.  This means storing the proper value in the
> main table.
> So instead of
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1
> I'd have
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> The values of course would be constrained by foreign key lookup to their
> associated tables.
> Are there any serious downsides to this?  If so, what would you recommend?

Natural keys:
*) force formal relationships into your key design (this is good)
*) Make your database MUCH easier to follow, browse, and understand
*) in particular cases allow you to skip joins
*) will make your indexes fatter (this is not good)
*) can be a pain if your keys are updated frequently
*) can be a major pain if your key changes in structure (adds a field,
or changes in type)

Surrogate keys:
*) Give you faster joins, but more of them (this is a win/loss
depending on circumstances)
*) Tend to encourage lazy/poor designs, since you hide relationships
behind a value
*) Make the tables more difficult to browse and understand
*) Make updates to keys/key structure trivial

I personally use natural keys when I can and surrogates when I have
to.  When I do use a surrogate, I tend to still define the natural key
as primary and simply make a alternate 'unique' constraint for the
surrogate.

merlin

Re: Avoiding surrogate keys

From

"Joshua D. Drake"

Date:

21 April 2010, 16:25:30

On Wed, 2010-04-21 at 15:18 -0400, Merlin Moncure wrote:

> Natural keys:
> *) force formal relationships into your key design (this is good)
> *) Make your database MUCH easier to follow, browse, and understand
> *) in particular cases allow you to skip joins
> *) will make your indexes fatter (this is not good)
> *) can be a pain if your keys are updated frequently
> *) can be a major pain if your key changes in structure (adds a field,
> or changes in type)
>
> Surrogate keys:
> *) Give you faster joins, but more of them (this is a win/loss
> depending on circumstances)

Kind of... natural keys can remove the need for the join in the first
place, which is certainly faster than joining :P

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering

Re: Avoiding surrogate keys

From

Thom Brown

Date:

21 April 2010, 16:31:17

On 21 April 2010 20:18, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Apr 21, 2010 at 3:01 PM, Thom Brown <thombrown@gmail.com> wrote:
> I think I know what I plan to do, but want to throw this out there to see if
> there are differing points of view.
> I have a mailing list table, and 2 of the columns contain values which have
> to be from a list. These are country and status. There are 237 possible
> countries and 3 possible statuses. Now I know some people would assign a
> sequence ID (surrogate key) to the country and status values, and have them
> looked up in separate tables to get the textual value, but I think I'll
> still have those tables, just without an ID column, so 1 column for both the
> countries and statuses tables. This means storing the proper value in the
> main table.
> So instead of
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1
> I'd have
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> The values of course would be constrained by foreign key lookup to their
> associated tables.
> Are there any serious downsides to this? If so, what would you recommend?

Natural keys:
*) force formal relationships into your key design (this is good)
*) Make your database MUCH easier to follow, browse, and understand
*) in particular cases allow you to skip joins
*) will make your indexes fatter (this is not good)
*) can be a pain if your keys are updated frequently
*) can be a major pain if your key changes in structure (adds a field,
or changes in type)

Surrogate keys:
*) Give you faster joins, but more of them (this is a win/loss
depending on circumstances)
*) Tend to encourage lazy/poor designs, since you hide relationships
behind a value
*) Make the tables more difficult to browse and understand
*) Make updates to keys/key structure trivial

I personally use natural keys when I can and surrogates when I have
to. When I do use a surrogate, I tend to still define the natural key
as primary and simply make a alternate 'unique' constraint for the
surrogate.

merlin

Thanks for the comments guys. I'm now pretty sure using natural keys is the right approach. Neither list will ever change type, and if they change, it'll be quite infrequent. Index size shouldn't really be a problem since we're realistically talking thousands of rows.

What I hate about surrogate keys is having to keep looking stuff up:

INSERT INTO stuff (col_a, col_b)

SELECT 'my_value', table_b.id

FROM table_b

WHERE table_b.real_value = 'PostgreSQL';

As opposed to:

INSERT INTO stuff (col_a, col_b) VALUES ('my_value', 'PostgreSQL');

Just making sure I don't fall into the surrogate key abuse trap. :)

Thom

Re: Avoiding surrogate keys

From

Bill Moran

Date:

21 April 2010, 16:33:17

In response to "Joshua D. Drake" <jd@commandprompt.com>:

> On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote:
> > I'd use an ENUM for the status, as that's not liable to change.
> >
> > The only problem I see with avoiding the surrogate key for the country
> > is that the table might require more disk space if a lot of the country
> > names end up being very long.
> >
>
> Yeah but that isn't generally a problem now and the gain he gets from
> the lack of join performance is more than worth it.

Agreed.  I guess I didn't explain properly: the disk space _may_ be an
issue if you have LOTS of REALLY LONG names.

Like, if the shortest name in your country table is 'Federated States
of Micronesia (Esquire)', then the disk space used by the table and
index might become an issue.

Also, if this is an embedded application where disk is at a premium ...

As with many things, "big" and "small" are relative, ambiguous and
copy-written by someone who thinks they can charge me every time I
use them.

One thing that a lot of people seem to get confused about is that they
subconsciously think that ints or bigints take up less space when the
numbers are small.  I.e.: I want to use an int for my state identifier
instead of the 2-digit code, because it will use less space -- wrong,
an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the
number never gets higher than 50.

Personally, I think the only good reason to use a surrogate key is when
there's a high likelihood that your primary text identifier might change.
Although ON UPDATE CASCADE can even handle that, it just might take a
while.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Avoiding surrogate keys

From

Alvaro Herrera

Date:

21 April 2010, 16:43:39

Bill Moran escribió:

> One thing that a lot of people seem to get confused about is that they
> subconsciously think that ints or bigints take up less space when the
> numbers are small.  I.e.: I want to use an int for my state identifier
> instead of the 2-digit code, because it will use less space -- wrong,
> an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the
> number never gets higher than 50.

Eh, a 2 byte char column uses 3 bytes -- there's one byte of overhead.
(Unless one of the chars is multibyte in which case it can be longer).
Earlier versions of Postgres use 6 bytes to store the 2 chars (4 bytes
of overhead), so it would be larger than the int.

Not that this invalidates the argument -- just nitpicking here.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Avoiding surrogate keys

From

Merlin Moncure

Date:

21 April 2010, 16:56:23

On Wed, Apr 21, 2010 at 3:32 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> In response to "Joshua D. Drake" <jd@commandprompt.com>:
>
>> On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote:
>> > I'd use an ENUM for the status, as that's not liable to change.
>> >
>> > The only problem I see with avoiding the surrogate key for the country
>> > is that the table might require more disk space if a lot of the country
>> > names end up being very long.
>> >
>>
>> Yeah but that isn't generally a problem now and the gain he gets from
>> the lack of join performance is more than worth it.
>
> Agreed.  I guess I didn't explain properly: the disk space _may_ be an
> issue if you have LOTS of REALLY LONG names.
>
> Like, if the shortest name in your country table is 'Federated States
> of Micronesia (Esquire)', then the disk space used by the table and
> index might become an issue.
>
> Also, if this is an embedded application where disk is at a premium ...
>
> As with many things, "big" and "small" are relative, ambiguous and
> copy-written by someone who thinks they can charge me every time I
> use them.
>
> One thing that a lot of people seem to get confused about is that they
> subconsciously think that ints or bigints take up less space when the
> numbers are small.  I.e.: I want to use an int for my state identifier
> instead of the 2-digit code, because it will use less space -- wrong,
> an int is 4 bytes, but a 2 byte char column is 1/2 that ... even if the
> number never gets higher than 50.
>
> Personally, I think the only good reason to use a surrogate key is when
> there's a high likelihood that your primary text identifier might change.
> Although ON UPDATE CASCADE can even handle that, it just might take a
> while.

Another semi-related case is when the keys are large which in turn
gives you larger indexes.  Large indexes pressure your cache which in
turn gives you more disk faults which can really nail you if your
database is large.  Both of these reasons fall away if/when storage
catches up with the rest of hardware IMO.

Another penalty of surrogate keys I neglected to mention is more
sorts.  It's typical to want to pull data in natural order which is
obfuscated behind the surrogate key.  So besides giving you a 'free
join', you get a free sort as well if you pull data off the index.

Interesting aside: I would estimate that a fairly large percentage of
-performance problems (say, 25%) can be directly traced to poor
indexing strategy and loose relationships between tables.  Natural
keys tend to deflect both of those problems.

merlin

Re: Avoiding surrogate keys

From

Martin Gainty

Date:

21 April 2010, 17:37:57

bill-

static information such as country names *should be* de-normalised into non-indexed columns of the driving table as you have already done

if on the other hand the column information country names were changing were dynamic then you would want to keep them in their respective table

as country code 001 will always be US (and the remaining countries and country code will never change)
i would suggest keeping the full name in the driving table (same goes with state/province data btw)

it also saves the database the I/O and CPU from having to do a lookup on another table

my 2 cents
martin-
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.

Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

> Date: Wed, 21 Apr 2010 15:07:35 -0400
> From: wmoran@potentialtech.com
> To: thombrown@gmail.com
> CC: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Avoiding surrogate keys
>
> In response to Thom Brown <thombrown@gmail.com>:
>
> > I think I know what I plan to do, but want to throw this out there to see if
> > there are differing points of view.
> >
> > I have a mailing list table, and 2 of the columns contain values which have
> > to be from a list. These are country and status. There are 237 possible
> > countries and 3 possible statuses. Now I know some people would assign a
> > sequence ID (surrogate key) to the country and status values, and have them
> > looked up in separate tables to get the textual value, but I think I'll
> > still have those tables, just without an ID column, so 1 column for both the
> > countries and statuses tables. This means storing the proper value in the
> > main table.
> >
> > So instead of
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 44, 2
> > 'mrs jones', 'me@emailcompany.com', 21, 1
> >
> > I'd have
> >
> > name, email, country, status
> > 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> > 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
> >
> > The values of course would be constrained by foreign key lookup to their
> > associated tables.
> >
> > Are there any serious downsides to this? If so, what would you recommend?
>
> I'd use an ENUM for the status, as that's not liable to change.
>
> The only problem I see with avoiding the surrogate key for the country
> is that the table might require more disk space if a lot of the country
> names end up being very long.
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. Learn more.

Re: Avoiding surrogate keys

From

"Mark Watson"

Date:

21 April 2010, 18:06:11

I agree, as long as one needs the country names in only one language.
-Mark
(Sorry Martin- forgot to "Reply to all" the last time)
________________________________________
De : pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] De la part de Martin Gainty
Envoyé : 21 avril 2010 16:38 À : wmoran@potentialtech.com Cc :
pgsql-general@postgresql.org Objet : Re: [GENERAL] Avoiding surrogate keys

...

static information such as country names *should be* de-normalised into
non-indexed columns of the driving table as you have already done

if on the other hand the column information country names were changing were
dynamic then you would want to keep them in their respective table

as country code 001 will always be US (and the remaining countries and
country code will never change) i would suggest keeping the full name in the
driving table (same goes with state/province data btw)

...

Re: Avoiding surrogate keys

From

"Joshua D. Drake"

Date:

21 April 2010, 23:14:21

On Wed, 2010-04-21 at 15:07 -0400, Bill Moran wrote:
> I'd use an ENUM for the status, as that's not liable to change.
>
> The only problem I see with avoiding the surrogate key for the country
> is that the table might require more disk space if a lot of the country
> names end up being very long.
>

Yeah but that isn't generally a problem now and the gain he gets from
the lack of join performance is more than worth it.

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering

Re: Avoiding surrogate keys

From

"Joshua D. Drake"

Date:

21 April 2010, 23:25:16

On Wed, 2010-04-21 at 15:18 -0400, Merlin Moncure wrote:

> Natural keys:
> *) force formal relationships into your key design (this is good)
> *) Make your database MUCH easier to follow, browse, and understand
> *) in particular cases allow you to skip joins
> *) will make your indexes fatter (this is not good)
> *) can be a pain if your keys are updated frequently
> *) can be a major pain if your key changes in structure (adds a field,
> or changes in type)
>
> Surrogate keys:
> *) Give you faster joins, but more of them (this is a win/loss
> depending on circumstances)

Kind of... natural keys can remove the need for the join in the first
place, which is certainly faster than joining :P

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering

Re: Avoiding surrogate keys

From

Philippe Lang

Date:

22 April 2010, 03:24:33

Hi,

I think nobody mentioned Object-Relational mappers. If you intend to used one (or think you may be using one in the
future),using surrogate keys is more straightforward, if not necessary. 

Best regards,

-------------------------------------------------------------
Attik System              web  : http://www.attiksystem.ch
Philippe Lang             phone: +41 26 422 13 75
rte de la Fonderie 2      gsm  : +41 79 351 49 94
1700 Fribourg             pgp  : http://keyserver.pgp.com


> -----Message d'origine-----
> De : pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] De la part de Thom Brown
> Envoyé : mercredi 21 avril 2010 21:01
> À : PGSQL Mailing List
> Objet : [GENERAL] Avoiding surrogate keys
>
> I think I know what I plan to do, but want to throw this out there to
> see if there are differing points of view.
>
> I have a mailing list table, and 2 of the columns contain values which
> have to be from a list.  These are country and status.  There are 237
> possible countries and 3 possible statuses.  Now I know some people
> would assign a sequence ID (surrogate key) to the country and status
> values, and have them looked up in separate tables to get the textual
> value, but I think I'll still have those tables, just without an ID
> column, so 1 column for both the countries and statuses tables.  This
> means storing the proper value in the main table.
>
> So instead of
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 44, 2
> 'mrs jones', 'me@emailcompany.com', 21, 1
>
> I'd have
>
> name, email, country, status
> 'mr smith', 'emailaddress@example.com', 'China', 'Registered'
> 'mrs jones', 'me@emailcompany.com', 'Belgium', 'Unconfirmed'
>
> The values of course would be constrained by foreign key lookup to
> their associated tables.
>
> Are there any serious downsides to this?  If so, what would you
> recommend?
>
> Thanks
>
> Thom

Attachment

PGP.sig

Re: Avoiding surrogate keys

From

Lew

Date:

01 May 2010, 13:52:55

Philippe Lang wrote:
> I think nobody mentioned Object-Relational mappers.
> If you intend to used one (or think you may be using one in the future),
> using surrogate keys is more straightforward, if not necessary.

Neither of those claims is even slightly true.  Using Hibernate, EclipseLink
or OpenJPA (for Java applications), natural keys are sufficient and far more
straightforward than surrogate keys.

--
Lew

Re: Avoiding surrogate keys

From

Merlin Moncure

Date:

01 May 2010, 15:50:52

On Sat, May 1, 2010 at 12:09 PM, Lew <noone@lwsc.ehost-services.com> wrote:
> Philippe Lang wrote:
>>
>> I think nobody mentioned Object-Relational mappers.
>> If you intend to used one (or think you may be using one in the future),
>> using surrogate keys is more straightforward, if not necessary.
>
> Neither of those claims is even slightly true.  Using Hibernate, EclipseLink
> or OpenJPA (for Java applications), natural keys are sufficient and far more
> straightforward than surrogate keys.

right -- to be fair though is quite a bit of (generally bad) software
out there that assumes or at least heavily encourages surrogate keys.

merlin

Re: Avoiding surrogate keys

From

John R Pierce

Date:

01 May 2010, 17:14:51

If your 'natural key' is a large text field, I'd have to assume there's
some point at which a surrogate index would be more efficient.  Would
this be above a few dozen characters, or a few 100 characters?   I
wouldn't want a PK based on a multi-K byte text field for a table that
has many 10s or 100s of 1000s of rows, for sure.

Re: Avoiding surrogate keys

From

Merlin Moncure

Date:

04 May 2010, 10:40:51

On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote:
>
> If your 'natural key' is a large text field, I'd have to assume there's some
> point at which a surrogate index would be more efficient.  Would this be
> above a few dozen characters, or a few 100 characters?   I wouldn't want a
> PK based on a multi-K byte text field for a table that has many 10s or 100s
> of 1000s of rows, for sure.

Well, yes, but:
*) most natural keys are small, between 4-10 bytes
*) regardless of the data type of the key, the number of btree lookups
is going to remain approximately the same
*) you have to (or at least should) put the index on anyway as unique
constraint. you do pay the price in dependent tables however.  so the
natural _primary_ key is free.  it's the foreign keys where you pay.

the two main performance issues with natural keys are this (you kinda
touched on one):
*) the index is fatter, pressuring cache
It's not so much the comparison function but the fact that the larger
index(es) require more memory. If drives were faster than they were
this wouldn't matter as much -- I expect this to become less of a
factor as SSD technology improves.  This can somewhat modulated by
clustering the index...you get  a better chance of reading multiple
relevant records on a single page.
*) cascading updates
If your key is in a lot of places and has to be updated it can cause a
mess. Lots of locks, dead space, vacuuming, etc. Most of the time
primary keys don't change very much but if they do you had better give
it fair consideration.

Natural keys have a lot of performance advantages as mentioned upthread.

merlin

Re: Avoiding surrogate keys

From

Merlin Moncure

Date:

04 May 2010, 11:16:53

On Tue, May 4, 2010 at 9:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote:
>>
>> If your 'natural key' is a large text field, I'd have to assume there's some
>> point at which a surrogate index would be more efficient.  Would this be
>> above a few dozen characters, or a few 100 characters?   I wouldn't want a
>> PK based on a multi-K byte text field for a table that has many 10s or 100s
>> of 1000s of rows, for sure.

one more note about this.  if you truly have a situation where a multi
kilobyte chunk of data is the key, you can always digest it and use
that.  you lose the natural ordering -- but in these type of cases it
usually doesn't matter.

merlin

Re: Avoiding surrogate keys

From

Grzegorz Jaśkiewicz

Date:

04 May 2010, 11:19:51

On Tue, May 4, 2010 at 3:16 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, May 4, 2010 at 9:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Sat, May 1, 2010 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote:
>>>
>>> If your 'natural key' is a large text field, I'd have to assume there's some
>>> point at which a surrogate index would be more efficient.  Would this be
>>> above a few dozen characters, or a few 100 characters?   I wouldn't want a
>>> PK based on a multi-K byte text field for a table that has many 10s or 100s
>>> of 1000s of rows, for sure.
>
> one more note about this.  if you truly have a situation where a multi
> kilobyte chunk of data is the key, you can always digest it and use
> that.  you lose the natural ordering -- but in these type of cases it
> usually doesn't matter.
>

99% of all cases, where I introduced bigint surrogate key, in tables
that hold account, customers, and that sort of bollocks, were straight
wins, sometimes measuring 3-4x.

Another thing, If your PK changes, it is no longer a PK, you can't rely on it.



--
GJ

Re: Avoiding surrogate keys

From

Richard Broersma

Date:

04 May 2010, 11:38:53

2010/5/4 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
>
> Another thing, If your PK changes, it is no longer a PK, you can't rely on it.

Depending upon what you mean by changes this could be true or it could
be a matter of opinion.  If your referring to a candidate key's value
changes, this key still provides a useful way to identify a tuple.

Also, just because a surrogate key is a useful way to identify a row,
by itself it does nothing to uniquely identify an entity that a row
represents.  On some DBMS forums that I frequent, it is a daily
occurrence to see questions on how to eliminate duplicate rows.  So,
for some people, there are times when even surrogate keys cannot be
relied upon as a PK.
--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: Avoiding surrogate keys

From

Grzegorz Jaśkiewicz

Date:

04 May 2010, 11:50:05

the rule of thumb for me is:
 - if you have more than one column as PK - and are variable length,
or more than 2 columns, fixed length, no bigger than 8 bytes -  go for
surrogate - always.
 - if PK is variable length, on average longer than 8 bytes, or can
change - go surrogate.
 - Otherwise leave it as it is.

At the end of a day, joining tables, using that have more than 1
variable length key, kills performance. In essence it makes index
search a PITA, and should be avoided.


This should always be however decision based on query performance,
usage, and data types. Never a textbook A or B type of thing.