Thread: Alternative to serial primary key

Alternative to serial primary key

From
"David Clarke"
Date:
I posted a couple of weeks back a question regarding the use of a 100
char column as a primary key and the responses uniformily advised the
use of a serial column. My concern is that the key is effectively
abstract and I want to use the column as a foreign key in other
tables. It occurred to me that if I used a hash function on insert to
generate another column and used that column as the primary key then I
have a value that meets a lot of the requirements for a good key,
including that I can regenerate the exact value from my data,
something that is impossible with a serial id. I also don't have to
index the 100 char column in order to search on the table, I just need
to calculate the hash value and check that against the calculated
column. It does violate the rule that a table shouldn't contain a
column that is calculated from another column in the table but I think
it would still be more effective than a serial id.

Is this a reasonable/normal thing to do? I know postgres contains an
md5() hash function, is this likely to be fast enough to make this an
effective choice? Are there other options? Am I just a noob barking up
the wrong tree? It is getting kind of late and my brain is starting to
hurt.


Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 10:16:42PM +1200, David Clarke wrote:
> column that is calculated from another column in the table but I think
> it would still be more effective than a serial id.

There is the problem that the hash is not proved unique (in fact,
someone has generated collisions on md5).  Primary keys have to be
unique, of course.

I _think_ with a hundred columns, you could probably prove (using
brute force, if need be) that the hashes are going to be unique,
assuming the list of possible values in each column is bounded.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.    --Alexander Hamilton


Re: Alternative to serial primary key

From
Markus Schaber
Date:
Hi, David,

David Clarke wrote:
> It occurred to me that if I used a hash function on insert to
> generate another column and used that column as the primary key then I
> have a value that meets a lot of the requirements for a good key,
> including that I can regenerate the exact value from my data,
> something that is impossible with a serial id.

This is a good idea if you want to have taller indices, but you still
need to re-check the "real" key due to hash collisions.

It's some kind of lossy index, pretty the same as the GIST index type
PostGIS uses.

If you've plenty of time to spend, you could also bring the hash index
type back to life, which is currently deprecated according to the
PostgreSQL docs...

HTH,
Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/6/06, Markus Schaber <schabi@logix-tt.com> wrote:
> This is a good idea if you want to have taller indices, but you still
> need to re-check the "real" key due to hash collisions.

I am aware there are collisions with md5 but without any actual proof
I believe the risk to be very low with the data I'm storing which is a
kind of scrubbed free form residential address.

>
> If you've plenty of time to spend, you could also bring the hash index
> type back to life, which is currently deprecated according to the
> PostgreSQL docs...

Unfortunately with the meagre time I have available, the only dent I'm
likely to make is in the wall with my head.


Re: Alternative to serial primary key

From
Markus Schaber
Date:
Hi, David,

David Clarke wrote:
> On 7/6/06, Markus Schaber <schabi@logix-tt.com> wrote:
>> This is a good idea if you want to have taller indices, but you still
>> need to re-check the "real" key due to hash collisions.
> 
> I am aware there are collisions with md5 but without any actual proof
> I believe the risk to be very low with the data I'm storing which is a
> kind of scrubbed free form residential address.

Then you'll have to use the re-check approach, like:

SELECT * FROM foo WHERE hashed_key = hash(mykey) && plain_key = mykey;

And then have an non-unique index on hashed_key, and probably no index n
my_key (to give the planner a hint which index to use).

HTH,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 01:10:55PM +0200, Markus Schaber wrote:
> Then you'll have to use the re-check approach, like:

That sort of undermines the value of the calculated primary key,
though, doesn't it?  He'd need the unique index for FK references,
which was the point, I thought.

A


-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The whole tendency of modern prose is away from concreteness.    --George Orwell


Re: Alternative to serial primary key

From
Markus Schaber
Date:
Hi, Andrew,

Andrew Sullivan wrote:
> On Thu, Jul 06, 2006 at 01:10:55PM +0200, Markus Schaber wrote:
>> Then you'll have to use the re-check approach, like:
> 
> That sort of undermines the value of the calculated primary key,
> though, doesn't it?  He'd need the unique index for FK references,
> which was the point, I thought.

Yes, risking collisions. It will work for some time, and then create a
maintainance nightmare for his successors. :-)


Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/6/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> That sort of undermines the value of the calculated primary key,
> though, doesn't it?  He'd need the unique index for FK references,
> which was the point, I thought.
>

Yes, that occurred to me as well. Frankly I believe the md5 collision
generation is more of a practical issue for crypto where for my
purposes the potential for two residential street addresses to
generate the same md5 hash value is effectively zero. And the md5
function is a builtin which I would hope is faster than anything I
could write in pgsql. Could be wrong, I have been before.


Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 01:28:58PM +0200, Markus Schaber wrote:
> 
> Yes, risking collisions. It will work for some time, and then create a
> maintainance nightmare for his successors. :-)

Well, that plus you can't actually point a foreign key contraint at
anything that isn't a unique contraint.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.    --H.W. Fowler


Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 11:32:36PM +1200, David Clarke wrote:
> 
> Yes, that occurred to me as well. Frankly I believe the md5 collision
> generation is more of a practical issue for crypto where for my
> purposes the potential for two residential street addresses to
> generate the same md5 hash value is effectively zero. And the md5
> function is a builtin which I would hope is faster than anything I
> could write in pgsql. Could be wrong, I have been before.

You could, of course, just put a unique index on it and accept that,
in the event of collision, you'll have to cope with the error.  It's
probably an acceptable cheat, as you're right that the collision risk
is pretty small.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
If they don't do anything, we don't need their acronym.    --Josh Hamilton, on the US FEMA


Re: Alternative to serial primary key

From
Chris Browne
Date:
ajs@crankycanuck.ca (Andrew Sullivan) writes:
> On Thu, Jul 06, 2006 at 10:16:42PM +1200, David Clarke wrote:
>> column that is calculated from another column in the table but I think
>> it would still be more effective than a serial id.
>
> There is the problem that the hash is not proved unique (in fact,
> someone has generated collisions on md5).  Primary keys have to be
> unique, of course.
>
> I _think_ with a hundred columns, you could probably prove (using
> brute force, if need be) that the hashes are going to be unique,
> assuming the list of possible values in each column is bounded.

Seems to me that would only happen if those hundred columns consisted
of values that could be fully enumerated before designing the hash.

In effect, if you know that each column selects from a predetermined
set of values that will never change, then you may assign a number to
each value, and then generate a function which amounts to...
 hash := 0 for column from 1 to 100 do    multiple := degree[column]    hash := (hash + enum[column,value[column]]) *
multipledone
 

where degree[column] is the number of possible values for the column,
value[column] is the value found in the column, and enum[column,VALUE]
selects the integer associated with that column.

Unfortunately, this approach to hashing breaks down if there is ever a
reason to add to the list of values that can be stored in a column.
For instance, country codes per ISO 3166-1 are no good as one of these
enumerated columns because the set of countries in the world changes
every so often.

You could allow for that change by, say, allowing for an extra 100
entries that would likely allow the country codes to change for 50
years.

But that's the "good news" part.

If some of the 100 fields contain peoples' names, that's not
particularly usefully enumerable :-(.  New names come up all the time,
even if only because immigration officials transliterate names into
English in inconsistent ways...  Back in the days when I was an
accountant, I did tax returns for three Lebanese brothers who came to
Canada and who all have slightly different surnames in English even
though they were the same in (I presume) Arabic.

It would be nice to have a suitable hash, but I'm not sure it's
attainable...
-- 
output = ("cbbrowne" "@" "ntlug.org")
http://cbbrowne.com/info/spreadsheets.html
Rules of the Evil Overlord #114. "I will never accept a challenge from
the hero." <http://www.eviloverlord.com/>


Re: Alternative to serial primary key

From
Scott Marlowe
Date:
On Thu, 2006-07-06 at 05:16, David Clarke wrote:
> I posted a couple of weeks back a question regarding the use of a 100
> char column as a primary key and the responses uniformily advised the
> use of a serial column. My concern is that the key is effectively
> abstract and I want to use the column as a foreign key in other
> tables. It occurred to me that if I used a hash function on insert to
> generate another column and used that column as the primary key then I
> have a value that meets a lot of the requirements for a good key,
> including that I can regenerate the exact value from my data,
> something that is impossible with a serial id. I also don't have to
> index the 100 char column in order to search on the table, I just need
> to calculate the hash value and check that against the calculated
> column. It does violate the rule that a table shouldn't contain a
> column that is calculated from another column in the table but I think
> it would still be more effective than a serial id.
> 
> Is this a reasonable/normal thing to do? I know postgres contains an
> md5() hash function, is this likely to be fast enough to make this an
> effective choice? Are there other options? Am I just a noob barking up
> the wrong tree? It is getting kind of late and my brain is starting to
> hurt.

I've read the responses, I'm starting a new answer thread because I
think that this is "premature optimization".

Please note that there seemed to be a misunderstanding in a few
responses that this gentleman had 100 columns to key.  According to this
post it is one column, with 100 characters in it.

My guess is that any test you come up with will find no gain in md5ing a
100 char column.  1,000 or 10,000 maybe.  But not 100.

And since it was intimated this is an address, I would assume it's a
varchar(100) not a char(100) since there's no reason for it to be
padded, so it will not always even be 100 characters long.

I would imagine initing your database for the C (Ascii) locale would be
a much bigger performance gain than any md5 hashing would.


Re: Alternative to serial primary key

From
"Aaron Bono"
Date:
On 7/6/06, David Clarke <pigwin32@gmail.com> wrote:
I posted a couple of weeks back a question regarding the use of a 100
char column as a primary key and the responses uniformily advised the
use of a serial column. My concern is that the key is effectively
abstract and I want to use the column as a foreign key in other
tables.

I have a simple question... why do you want to use the column as a foreign key in other tables?  If you use the serial column then all you need is a simple join to get the 100 char column out in your query.  If you need to make things simpler, just create a view that does the join for you.

Either there is some requirement here that I am not aware of or it sounds like you may be trying to use a sledge hammer on a nail.

-Aaron

Re: Alternative to serial primary key

From
Date:
> On 7/6/06, David Clarke <pigwin32@gmail.com> wrote:
> >
> > I posted a couple of weeks back a question
> regarding the use of a 100
> > char column as a primary key and the responses
> uniformily advised the
> > use of a serial column. My concern is that the key
> is effectively
> > abstract and I want to use the column as a foreign
> key in other
> > tables.
> 
> 
> I have a simple question... why do you want to use
> the column as a foreign
> key in other tables?  If you use the serial column
> then all you need is a
> simple join to get the 100 char column out in your
> query.  If you need to
> make things simpler, just create a view that does
> the join for you.
> 
> Either there is some requirement here that I am not
> aware of or it sounds
> like you may be trying to use a sledge hammer on a
> nail.
> 
> -Aaron

i agree.  all my primary keys are abstract - even
though some don't have to be.  iow, i'm comfortable
using serials as my primary key even when i don't
absolutely need to.

in any case, the primary key is typically used as a
unique identifer *and that's it*.  the uniqueness
makes it ideal for another table to use it to link
related data.

it sounds like you want a unique identifier *plus
something else*.  i'd argue that the "plus something
else" belongs in a separate column.

imho, your database life will be much easier and more
consistent.

good luck.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/7/06, operationsengineer1@yahoo.com <operationsengineer1@yahoo.com> wrote:
> i agree.  all my primary keys are abstract - even
> though some don't have to be.  iow, i'm comfortable
> using serials as my primary key even when i don't
> absolutely need to.

Yes I had in fact already created my table using a serial as the
primary key but I've been reading Celko's SQL Programming Style and
the use of a hash on the address column as the primary key (and for
use in FK's) meets a number of the requirements for a good key. The
address column itself is the natural primary key but it doesn't make
for a good FK. Plus I feel I would be remiss in not exploring an
alternative to the serial key.

To recap, yes there is only a single column, yes it is varchar. I need
to do a lookup on the address column which is unique and use it as a
foreign key in other tables. Using a serial id would obviously work
and has been recommended. But having a hash function over the address
column as the primary key means I can always regenerate my primary key
from the data which is impossible with a serial key. I believe the
risk of collision using md5 is effectively zero on this data and I can
put a unique index over it.

I'm kind of new to sql so apologies if this is a naive approach.
Thanks to all for responses.

Dave


Re: Alternative to serial primary key

From
"D'Arcy J.M. Cain"
Date:
On Fri, 7 Jul 2006 08:30:57 +1200
"David Clarke" <pigwin32@gmail.com> wrote:
> Yes I had in fact already created my table using a serial as the
> primary key but I've been reading Celko's SQL Programming Style and
> the use of a hash on the address column as the primary key (and for
> use in FK's) meets a number of the requirements for a good key. The
> address column itself is the natural primary key but it doesn't make

Are you sure?  I have a hard time imagining a situation where that
would be true.  The only thing I can think of is some sort of
municipality database tracking properties regardless of who currently
owns/resides there in a situation where the address can never be
changed, e.g. "521 Main" splitting into "521A Main" and "521B Main."
Is that the situation here?

Also, you need to get into a lot more coding to handle the fact that
"521 Main Avenue" is the same address as "521 Main Av." and "521 Main
Ave" and even "521 Main."

And even given all of that, I would probably still use serial.
> and has been recommended. But having a hash function over the address
> column as the primary key means I can always regenerate my primary key

Danger, Will Robinson.  The phrase "regenerate my primary key"
immediately raises the hairs on the back of my neck.  If the primary
key can ever change, you have a broken schema.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.


Re: Alternative to serial primary key

From
"Aaron Bono"
Date:
On 7/6/06, David Clarke <pigwin32@gmail.com> wrote:

To recap, yes there is only a single column, yes it is varchar. I need
to do a lookup on the address column which is unique and use it as a
foreign key in other tables. Using a serial id would obviously work
and has been recommended. But having a hash function over the address
column as the primary key means I can always regenerate my primary key
from the data which is impossible with a serial key. I believe the
risk of collision using md5 is effectively zero on this data and I can
put a unique index over it.

So if you have:

addresses
    address_id  bigserial (pk),
    address

person
    person_id bigserial (pk),
    first_name,
    last_name,
    address_id

you can do something like

INSERT INTO person (
    address_id
)
SELECT
   'Joe',
   'Blow',
   address_id
FROM addresses
WHERE addresses.address = ?;

No regeneration of PK necessary.  If you index addresses.address the insert should run quickly, right?

-Aaron Bono

Re: Alternative to serial primary key

From
"Sander Steffann"
Date:
Hi,

> But having a hash function over the address
> column as the primary key means I can always regenerate my primary key

Warning: don't attach a meaning to a primary key, as it might change

- Sander




Re: Alternative to serial primary key

From
Scott Marlowe
Date:
On Thu, 2006-07-06 at 16:43, Aaron Bono wrote:
> On 7/6/06, David Clarke <pigwin32@gmail.com> wrote:
>         To recap, yes there is only a single column, yes it is
>         varchar. I need
>         to do a lookup on the address column which is unique and use
>         it as a
>         foreign key in other tables. Using a serial id would obviously
>         work
>         and has been recommended. But having a hash function over the
>         address
>         column as the primary key means I can always regenerate my
>         primary key
>         from the data which is impossible with a serial key. I believe
>         the
>         risk of collision using md5 is effectively zero on this data
>         and I can 
>         put a unique index over it.

I'll repeat my previous statement that this is premature optimization,
and the hash is kind the wrong direction.

If you store an int and the 1 to 100 characters in a varchar, you'll
have about 4 to 8 bytes for the int (6 I think, but it's been a while)
plus 1 to 200 or possibly more for the characters in the address.

If you use C local with ASCII encoding, you can get single byte.

If you switch to an md5 hash, you'll need ~50 bytes (average address
about 1/2 max length, just a guess) plus 32 bytes, plus the extra bytes
to keep track of the length of the fields.

The table now becomes wider itself, and the md5 is generally about as
big as the address, or fairly close to it.

And you've got the possibility of md5 collisions to deal with.

I'd say just FK off of the address field.  It's a natural key, fairly
small (100 bytes ain't really that big) and your primary key never needs
any kind of regenerating or anything, because it's already there.

Just set it up with cascading updates and deletes in case you need to
edit it in the future.

The first rule of optimization:  Don't


Re: Alternative to serial primary key

From
Scott Marlowe
Date:
On Thu, 2006-07-06 at 16:45, Sander Steffann wrote:
> Hi,
> 
> > But having a hash function over the address
> > column as the primary key means I can always regenerate my primary key
> 
> Warning: don't attach a meaning to a primary key, as it might change

And as long as it has cascading updates and deletes it doesn't matter if
it changes.  As long as it doesn't change into something that collides.

This is a religious topic, and there's advantages to both ways.  But if
the field is and always must be unique and non-null, then there's no
reason to not use it as a primary key.


Re: Alternative to serial primary key

From
Date:
> Plus I feel I would be remiss in not
> exploring an
> alternative to the serial key.

why?  it is a tried and true method.

> I can always
> regenerate my primary key
> from the data which is impossible with a serial key.

why on earth would you need to "regenerate" the
primary key?  it is used to link related data in
different tables, not as some sort of meaningful piece
of data all by itself. 

even if you wanted to "regenerate" the primary key,
using a serial is still easier.  under the hash
scheme, you have know the *exact* address... 
including abbreviation, periods, etc...

if you use a serial, you could query the table using
SQL's LIKE clause and get all the serial results that
match the string you entered.  if you enter the
*exact* address, you'll get a single result - unless
that address is entered twice.  you can avoid that
possibility by making address a unique column.

when you link tables together, you can visually one
big table with all the linked data...  and that's what
you can get when you join the tables.

if you know one column's information (or part of it
using LIKE), you can return any desired result in the
same row (LIKE might return more than one result).

> I believe the
> risk of collision using md5 is effectively zero on
> this data and I can
> put a unique index over it.

why risk it?  what if you are wrong?

> I'm kind of new to sql so apologies if this is a
> naive approach.
> Thanks to all for responses.

no worries.  i'm learning a ton as time goes on,
myself.  i do think you are over thinking this issue a
bit, though.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Alternative to serial primary key

From
"Aaron Bono"
Date:
On 7/7/06, operationsengineer1@yahoo.com <operationsengineer1@yahoo.com> wrote:
> Plus I feel I would be remiss in not
> exploring an
> alternative to the serial key.

why?  it is a tried and true method.

Actually, there are two reasons I think it good for someone new to SQL to explore this option.  First of all, it is very educational.  Secondly, it is the person who stands up and says, "I know everyone does it this way, but what if we did it that way?" who has a chance of discovering something new.

For this particular topic however, using the serial is very easy and using a hash is not only error prone but much more difficult.

It is good to see different philosophies about foreign keys though!

-Aaron Bono

Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/7/06, D'Arcy J.M. Cain <darcy@druid.net> wrote:
> Are you sure?  I have a hard time imagining a situation where that
Absolutely.

> Also, you need to get into a lot more coding to handle the fact that
> "521 Main Avenue" is the same address as "521 Main Av." and "521 Main
> Ave" and even "521 Main."
Actually that is being done for me and you're correct, it is a lot of
effort but there are a variety of services out there and I'm not
trying to reinvent the wheel.

> And even given all of that, I would probably still use serial.
Because?

> Danger, Will Robinson.  The phrase "regenerate my primary key"
> immediately raises the hairs on the back of my neck.  If the primary
> key can ever change, you have a broken schema.

Perhaps my choice of words was somewhat hasty. A serial is totally
divorced from the data it represents whereas a md5 hash is (for my
purposes) unique, stable, verifiable, and simple.

Dave


Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/7/06, Scott Marlowe <smarlowe@g2switchworks.com> wrote:
> On Thu, 2006-07-06 at 16:43, Aaron Bono wrote:
> I'll repeat my previous statement that this is premature optimization,
> and the hash is kind the wrong direction.
>
> If you store an int and the 1 to 100 characters in a varchar, you'll
> have about 4 to 8 bytes for the int (6 I think, but it's been a while)
> plus 1 to 200 or possibly more for the characters in the address.
>
> If you use C local with ASCII encoding, you can get single byte.
>
> If you switch to an md5 hash, you'll need ~50 bytes (average address
> about 1/2 max length, just a guess) plus 32 bytes, plus the extra bytes
> to keep track of the length of the fields.
>
> The table now becomes wider itself, and the md5 is generally about as
> big as the address, or fairly close to it.
>
> And you've got the possibility of md5 collisions to deal with.
>
> I'd say just FK off of the address field.  It's a natural key, fairly
> small (100 bytes ain't really that big) and your primary key never needs
> any kind of regenerating or anything, because it's already there.
>
> Just set it up with cascading updates and deletes in case you need to
> edit it in the future.
>
> The first rule of optimization:  Don't
>
Yep, this was pretty much where I started from and I totally agree
with you regarding premature optimisation. I would point out that md5
hash is 128 bits or 16 bytes and not 32 so the difference between the
hash value and the source data is somewhat larger than you've
suggested. My original post a few weeks back was really about the use
of a natural key of varchar(100) as a foreign key in other tables. The
response was to not do it and instead use a serial which is basically
how I was progressing. Celko's comments re the use of autonumbering
schemes have obviously been gnawing away at me. That is why I asked
the question about the hashing approach which I saw as a way to
fulfill a lot of the criteria he has suggested for a primary key and
still have a reasonable value to use as a foreign key. Please ignore
the regeneration comment, it was written in haste and not because I
really anticipate any need to regenerate my primary key at any stage.

The question remains regarding the use of a string value as a primary
key for the table and as a foreign key in other tables. If I use the
address column as a foreign key in a differrent table will postgres
physically duplicate the data or will it simply attach the appropriate
index magically behind the scenes? I agree that the address column is
fairly small and I've heard of others using the likes of GUIDs as key
values which are not a lot smaller than I would expect my average
address to be.

Thanks
Dave

Dave


Re: Alternative to serial primary key

From
"Aaron Bono"
Date:
On 7/7/06, David Clarke <pigwin32@gmail.com> wrote:

The question remains regarding the use of a string value as a primary
key for the table and as a foreign key in other tables. If I use the
address column as a foreign key in a differrent table will postgres
physically duplicate the data or will it simply attach the appropriate
index magically behind the scenes? I agree that the address column is
fairly small and I've heard of others using the likes of GUIDs as key
values which are not a lot smaller than I would expect my average
address to be.

Theoretically using the address as a foreign key WILL duplicate the data.  I don't know if there is some fancy way PostgreSQL optimizes the use of space for indexed fields or foreign keys though.  Might need to get feedback from someone who has looked at the source code for that and it may depend on the version you are using.  Unless you are looking at a huge number of records though, the size difference is of little concern - disk space is cheap.

As far as should you use varchar as a foreign key - as someone mentioned before, you may be treading on religious territory there.  It can be done and is done frequently.  I never do it but I really cannot come up with a compelling argument why not other than it is just one of the standards I have adopted for my team.

-Aaron

Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 09:41:52AM -0500, Scott Marlowe wrote:

> Please note that there seemed to be a misunderstanding in a few
> responses that this gentleman had 100 columns to key.  

Oh, yes, that wa certainly my understanding.  I totally agree that
this is premature optimisation then.

A


-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Everything that happens in the world happens at some place.    --Jane Jacobs 


Re: Alternative to serial primary key

From
Andrew Sullivan
Date:
On Thu, Jul 06, 2006 at 05:18:16PM -0400, D'Arcy J.M. Cain wrote:
> owns/resides there in a situation where the address can never be
> changed, e.g. "521 Main" splitting into "521A Main" and "521B Main."

And anyone who has looked at an even moderately old city map will
tell you that even this "impossible case" is by no means impossible. 

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.     --Dennis Ritchie


Re: Alternative to serial primary key

From
"D'Arcy J.M. Cain"
Date:
On Fri, 7 Jul 2006 19:37:15 +1200
"David Clarke" <pigwin32@gmail.com> wrote:
> > And even given all of that, I would probably still use serial.
> Because?

Simplicity.  Cleanliness.

> > Danger, Will Robinson.  The phrase "regenerate my primary key"
> > immediately raises the hairs on the back of my neck.  If the primary
> > key can ever change, you have a broken schema.
> 
> Perhaps my choice of words was somewhat hasty. A serial is totally
> divorced from the data it represents whereas a md5 hash is (for my
> purposes) unique, stable, verifiable, and simple.

It's not that I think that the primary key should never have meaning in
the database (I use the two letter country code as the PK in my country
table for example) I just think that it's dangerous ground and should
be tread very carefully.

However, I join others in applauding you for your efforts to investigate
this so deeply.  You may wind up coming out of this with something
interesting, even if it isn't what you went in looking for.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.


Re: Alternative to serial primary key

From
Achilleus Mantzios
Date:
O D'Arcy J.M. Cain έγραψε στις Jul 7, 2006 :

> On Fri, 7 Jul 2006 19:37:15 +1200
> "David Clarke" <pigwin32@gmail.com> wrote:
> > > And even given all of that, I would probably still use serial.
> > Because?
> 
> Simplicity.  Cleanliness.
> 
> > > Danger, Will Robinson.  The phrase "regenerate my primary key"
> > > immediately raises the hairs on the back of my neck.  If the primary
> > > key can ever change, you have a broken schema.
> > 
> > Perhaps my choice of words was somewhat hasty. A serial is totally
> > divorced from the data it represents whereas a md5 hash is (for my
> > purposes) unique, stable, verifiable, and simple.
> 
> It's not that I think that the primary key should never have meaning in
> the database (I use the two letter country code as the PK in my country
> table for example) I just think that it's dangerous ground and should
> be tread very carefully.

I think some similar  "artificial vs natural keys" discussion was active
some months ago in the -sql list.

I say, 99% of the cases use serials for Primary Keys.
They do that they are designed for, and they are robust,
and ultra reliable.
If someone whishes, then he/she can enforce some
Unique Keys as well, on the natural columns.

So as a rule of thumb, i do system work (PKs,FKs) with artificial columns,
and human work (UKs) (read: prone to change!!!) with natural columns.

In my experience, using natural columns for (PKs,FKs) has always resulted
in a lost weekend trying to convert to artificial keys.

> 
> However, I join others in applauding you for your efforts to investigate
> this so deeply.  You may wind up coming out of this with something
> interesting, even if it isn't what you went in looking for.
> 
> 

-- 
-Achilleus



Re: Alternative to serial primary key

From
Richard Huxton
Date:
Andrew Sullivan wrote:
> On Thu, Jul 06, 2006 at 05:18:16PM -0400, D'Arcy J.M. Cain wrote:
>> owns/resides there in a situation where the address can never be
>> changed, e.g. "521 Main" splitting into "521A Main" and "521B Main."
> 
> And anyone who has looked at an even moderately old city map will
> tell you that even this "impossible case" is by no means impossible. 

I seem to remember that there used to be a street in the UK that not 
only wasn't numbered in any apparent order, but had duplicate 
house-numbers too. All in a length of a few hundred yards too, so all in 
the same postal code.

--   Richard Huxton  Archonet Ltd


Re: Alternative to serial primary key

From
"Sander Steffann"
Date:
Hi D'Arcy,

> It's not that I think that the primary key should never have 
> meaning in the database (I use the two letter country code as 
> the PK in my country table for example) I just think that 
> it's dangerous ground and should be tread very carefully.

You are right. I now realize that I have done exactly the same thing in my
database designs :)  In 95+% of the cases I would choose a serial, but this
is a good counter-example.

Thanks for making me think :)
Sander



Re: Alternative to serial primary key

From
Chris Browne
Date:
darcy@druid.net ("D'Arcy J.M. Cain") writes:
> And even given all of that, I would probably still use serial.
>> and has been recommended. But having a hash function over the address
>> column as the primary key means I can always regenerate my primary key
>
> Danger, Will Robinson.  The phrase "regenerate my primary key"
> immediately raises the hairs on the back of my neck.  If the primary
> key can ever change, you have a broken schema.

Actually, I'd call it worse than that...

A schema is "merely" a technical detail.

The trouble is that if the primary key can ever change, then you have
a broken model for the data.

You can do "technical things" to fix problems with a schema; if the
data model is busted, then your schema is necessarily broken, in a way
that cannot be fixed.
-- 
(format nil "~S@~S" "cbbrowne" "ntlug.org")
http://cbbrowne.com/info/unix.html
MICROS~1 is to quality software what MacDonalds is to gourmet cooking


Re: Alternative to serial primary key

From
Scott Marlowe
Date:
On Fri, 2006-07-07 at 03:07, David Clarke wrote:

> Yep, this was pretty much where I started from and I totally agree
> with you regarding premature optimisation. I would point out that md5
> hash is 128 bits or 16 bytes and not 32 

Unless you're going to store them as a binary field, the standard ascii
rep of an md5 is 32 characters (not I didn't say bytes in my original
post, cause bytes and characters aren't always the same thing.)

If you were using multi-byte encoding, and handed it an md5sum and
stored it in UTF-8, it would take 64 bytes to store.  My point being
that paying close attention to locale and encoding can likely save you
more space and give you better performance than using md5 hashes can
here.  

>  My original post a few weeks back was really about the use
> of a natural key of varchar(100) as a foreign key in other tables. The
> response was to not do it and instead use a serial which is basically
> how I was progressing. Celko's comments re the use of autonumbering
> schemes have obviously been gnawing away at me. 

Yeah, like I said earlier (at least I think I did.  :) this is a
religious issue.  I tend towards using the natural keys with a serial
field to pull things out of the fire should they head south.

> The question remains regarding the use of a string value as a primary
> key for the table and as a foreign key in other tables. If I use the
> address column as a foreign key in a differrent table will postgres
> physically duplicate the data or will it simply attach the appropriate
> index magically behind the scenes?

PostgreSQL, and most other dbs as well, will duplicate the data.  but
keep in mind that for the most part, differences of <100 bytes aren't a
big performance issue.  Now, if you're trying to set records running TPC
tests, then yes, you'll want to optimize the heck out of your schema.

But mostly, I've found that there's plenty of low hanging fruit in the
application side before the database becomes the choke point in this
kind of thing.  

Good luck on your implementation.  Let us all know how it goes.


Re: Alternative to serial primary key

From
"David Clarke"
Date:
On 7/7/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> On Thu, Jul 06, 2006 at 05:18:16PM -0400, D'Arcy J.M. Cain wrote:
> > owns/resides there in a situation where the address can never be
> > changed, e.g. "521 Main" splitting into "521A Main" and "521B Main."
>
> And anyone who has looked at an even moderately old city map will
> tell you that even this "impossible case" is by no means impossible.
>

Actually for  my purposes it doesn't matter, I'm using a freeform
address string to index a lat/long. I fully expect to have multiple
unique addresses with a duplicate lat/long, i.e. the location doesn't
change - only the name of it. I'm personally hoping not to be around
long enough for geological events to change the physical location but
living in the shaky isles anything is possible.

Thanks again to all who have responded. I guess I'm probably labouring
the point but it's been instructive at least. Being fairly new to
database design I wanted to have enough info to be confident with my
approach.

Dave