Re: surrogate key or not? - Mailing list pgsql-sql
From | Josh Berkus |
---|---|
Subject | Re: surrogate key or not? |
Date | |
Msg-id | 200407211025.33615.josh@agliodbs.com Whole thread Raw |
In response to | Re: surrogate key or not? (sad <sad@bankir.ru>) |
Responses |
Re: surrogate key or not?
|
List | pgsql-sql |
Sad, First of all, please excuse me if I've misunderstood you below because of translation issues. You'll find I'm rather strident, but it's because the reasons you're presenting, or seem to be, are excuses for bad database design I hear every day on the job, and end up having to fix when they go wrong. Now, to reply to your comments: > 4) Replication: to identify an object of ANY type (record of any table > regardless to datamodel), to store lists of deleted or modified objects > (regardless to datamodel) Only if the Replication product requires it. Note that our new replication engine, Slony-I, does *not* require surrogate keys; it can handle multi-column primary keys. I'm not sure about the other replication products, they may be able to as well. I think what you're talking about is a GUID (Global Unique Identifier). Once again, while most GUIDs are based on random number theory, the ideal GUID would be a unique combination of the Row Key, the version, and the server identification -- making it "real data" instead of a random number that tells you nothing about the row. For performance, this combination might be combined in a hash. In other words, I think your 4th point is actually part of Point (1), convenience. > 5) Making a primary key: if there is no real key at all. No, no, no, a thousand times NO. Data without a real key is NOT DATA. It is just garbage. I manage or consult on more than twenty production databases for my clients. Every single one of those databases has a real unique key on every single table. The ONLY exception is "holding tables" for incoming bulk data, some of which will be retained as data and some of which will discarded as garbage -- and key uniqueness is the primary test of what is and isn't data. > the sentence (5) is debatable. in theory every relation has a real key, but > in the practice we have historical datamodels without PK !!! it is > impossible but it exists. Only because you have bad historical databases. This is not an excuse, it is a reason to fix them. > For example: > Here in Russia we have a lot of different but identically named streets > within one city. They has absoluetly identical attributes. Historically > only human not machines work on that datamodel and they all used to call > such streets by name adding some spechial non-formal explainations, for > example: > "deliver this message please to the house 35 on the Green street, that is > to the west of the center of the city." > "deliver this message please to the house 12 on the Green street, that is > shortest of the all Green streets in the town." This is a PERFECT example of the evil of surrogate keys. You had a data problem, and instead of solving it, you chose to lean on surrogate keys like a crutch. Here's what you have in your database now: Streets ID Street Name 345 Green Street 2019 Green Street 5781 Green Street Key: ID??? How, exactly, do you expect to distinguish the above 3 "green streets" from each other? How do you expect to your users to know that 345 is West Green Street and 5781 is the shortest in town? How are they supposed to choose between the three? How do you know there are actually 3 and maybe not just two or even one? The ID 5781 isn't exactly informative; in fact, it's meaningless. It's GARBAGE. Look, real data, on a row-by-row basis, is a POSTULATE. It is should describe adequately the portion of the world you are trying to represent. Automated surrogate keys are NOT part of these postulates since they convey no meaningful information. This is what you currently have: There is a street named GREEN STREET. There is a street named GREEN STREET. There is a street named GREEN STREET. As you can see by rendering it in English, those three statements are damned confusing. Are there three streets named Green Street, or is there only one? Either could be true. We don't know, the data doesn't tell us. What you need is: Streets ID Street Name Location 345 Green Street West Side of City 2019 Green Street In Front of Consulate 5781 Green Street Shortest in Town Key: Street Name, Location This gives you much more meaningful data: There is a street named GREEN STREET which is on the WEST SIDE OF THE CITY. There is a street named GREEN STREET which is IN FRONT OF THE CONSULATE. There is a street named GREEN STREET which is the SHORTEST IN TOWN. This tells the user which Green Street he wants to select. It also gives us a fair indication that there are 3 Green Streets, and not one or two. And if there is duplication -- for example, if the street in front of the consulate is also the shortest in town -- then we can determine this and correct it in a minute or less by looking at a map. I'm not just picking on you. I'm saying this because it's happened to me and was a costly error for both me and the client. Several years ago, we were doing a scheduling database, and the client messed around with the data model a lot. In the process, we mistakenly eliminated the Events table's (the largest and most important table in the database) real unique key. But because we were "leaning" on a surrogate key, we didn't notice until the database was in production. Then, after 6 months of scheduling, the client began to suspect that duplicate Events were creeping into the database. But without a real, unique key, we discovered that we could not figure out whether or not two events with similar information were duplicates or not! The Event_ID told us nothing. This was a real, critical problem because each Event required the attendance of an attorney and there were never enough attorneys to go around. -- Josh Berkus Aglio Database Solutions San Francisco