Re: surrogate key or not? - Mailing list pgsql-sql

From Josh Berkus
Subject Re: surrogate key or not?
Date
Msg-id 200407211025.33615.josh@agliodbs.com
Whole thread Raw
In response to Re: surrogate key or not?  (sad <sad@bankir.ru>)
Responses Re: surrogate key or not?  (Karsten Hilbert <Karsten.Hilbert@gmx.net>)
List pgsql-sql
Sad,

First of all, please excuse me if I've misunderstood you below because of 
translation issues.   You'll find I'm rather strident, but it's because the 
reasons you're presenting, or seem to be, are excuses for bad database design 
I hear every day on the job, and end up having to fix when they go wrong.

Now, to reply to your comments:

> 4) Replication:  to identify an object of ANY type (record of any table
> regardless to datamodel), to store lists of deleted or modified objects
> (regardless to datamodel)

Only if the Replication product requires it.  Note that our new replication 
engine, Slony-I, does *not* require surrogate keys; it can handle 
multi-column primary keys.   I'm not sure about the other replication 
products, they may be able to as well.

I think what you're talking about is a GUID (Global Unique Identifier).  Once 
again, while most GUIDs are based on random number theory, the ideal GUID 
would be a unique combination of the Row Key, the version, and the server 
identification -- making it "real data" instead of a random number that tells 
you nothing about the row.   For performance, this combination might be 
combined in a hash.

In other words, I think your 4th point is actually part of Point (1), 
convenience.

> 5) Making a primary key: if there is no real key at all.

No, no, no, a thousand times NO.   

Data without a real key is NOT DATA.   It is just garbage.    

I manage or consult on more than twenty production databases for my clients.  
Every single one of those databases has a real unique key on every single 
table.   The ONLY exception is "holding tables" for incoming bulk data, some 
of which will be retained as data and some of which will discarded as garbage 
-- and key uniqueness is the primary test of what is and isn't data.

> the sentence (5) is debatable. in theory every relation has a real key, but
> in the practice we have historical datamodels without PK !!! it is
> impossible but it exists.

Only because you have bad historical databases.  This is not an excuse, it is 
a reason to fix them.

> For example:
> Here in Russia we have a lot of different but identically named streets
> within one city. They has absoluetly identical attributes. Historically
> only human not machines work on that datamodel and they all used to call
> such streets by name adding some spechial non-formal explainations, for
> example:
> "deliver this message please to the house 35 on the Green street, that is
> to the west of the center of the city."
> "deliver this message please to the house 12 on the Green street, that is
> shortest of the all Green streets in the town."

This is a PERFECT example of the evil of surrogate keys.  You had a data 
problem, and instead of solving it, you chose to lean on surrogate keys like 
a crutch.   Here's what you have in your database now:

Streets
ID    Street Name
345    Green Street
2019    Green Street
5781    Green Street
Key: ID???

How, exactly, do you expect to distinguish the above 3 "green streets" from 
each other?   How do you expect to your users to know that 345 is West Green 
Street and 5781 is the shortest in town?   How are they supposed to choose 
between the three?   How do you know there are actually 3 and maybe not just 
two or even one?   The ID 5781 isn't exactly informative; in fact, it's 
meaningless.  It's GARBAGE.

Look, real data, on a row-by-row basis, is a POSTULATE.   It is should 
describe adequately the portion of the world you are trying to represent. 
Automated surrogate keys are NOT part of these postulates since they convey 
no meaningful information.   This is what you currently have:

There is a street named GREEN STREET.
There is a street named GREEN STREET.
There is a street named GREEN STREET.

As you can see by rendering it in English, those three statements are damned 
confusing.   Are there three streets named Green Street, or is there only 
one?  Either could be true.  We don't know, the data doesn't tell us.

What you need is:

Streets
ID    Street Name    Location    
345    Green Street    West Side of City
2019    Green Street    In Front of Consulate
5781    Green Street    Shortest in Town
Key:  Street Name, Location

This gives you much more meaningful data:

There is a street named GREEN STREET which is on the WEST SIDE OF THE CITY.
There is a street named GREEN STREET which is IN FRONT OF THE CONSULATE.
There is a street named GREEN STREET which is the SHORTEST IN TOWN.

This tells the user which Green Street he wants to select.    It also gives us 
a fair indication that there are 3 Green Streets, and not one or two.  And if 
there is duplication -- for example, if the street in front of the consulate 
is also the shortest in town -- then we can determine this and correct it in 
a minute or less by looking at a map.

I'm not just picking on you.   I'm saying this because it's happened to me and 
was a costly error for both me and the client.

Several years ago, we were doing a scheduling database, and the client messed 
around with the data model a lot.   In the process, we mistakenly eliminated 
the Events table's (the largest and most important table in the database) 
real unique key.   But because we were "leaning" on a surrogate key, we 
didn't notice until the database was in production.  

Then, after 6 months of scheduling, the client began to suspect that duplicate 
Events were creeping into the database.   But without a real, unique key, we 
discovered that we could not figure out whether or not two events with 
similar information were duplicates or not!   The Event_ID told us nothing.  
This was a real, critical problem because each Event required the attendance 
of an attorney and there were never enough attorneys to go around.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


pgsql-sql by date:

Previous
From: Geoff Richards
Date:
Subject: Re: MySQL-style "create temporary table foo select ..."
Next
From: "Gregory S. Williamson"
Date:
Subject: Re: Sorry too many conecctions