Thread: Data type to use for primary key
Good day, I'm asking myself if there is a performance issue in using an integer of varchar(24) PRIMARY KEY in a product table. I've read that there is no speed issue in the query, but that the only performance issue is the database size of copying the key in other tables that require it. My product_id is copied in orders, jobs, and other specific tables. What is the common approach? Should I use directly the product_code as my ID, or use a sequantial number for speed? (I did the same for the company_id, this is a 'serial' and not the shor name of the customer. I just don't know what is usually done. Right now I did the following: CREATE TABLE design.products ( product_id serial PRIMARY KEY, company_id integer NOT NULL REFERENCES sales.companies ON UPDATE CASCADE, product_code varchar(24) NOT NULL, ... CONSTRAINT product_code_already_used_for_this_company UNIQUE (company_id, product_code) ); CREATE TABLE sales.companies ( company_id integer PRIMARY KEY, company_name varchar(48) NOT NULL UNIQUE, ... ); The company_id is also copied in many tables like product, contacts, etc. Thank you very much for any good pointers on this 'already seen' issue. -- Alexandre Leclerc
> What is the common approach? Should I use directly the product_code as > my ID, or use a sequantial number for speed? (I did the same for the > company_id, this is a 'serial' and not the shor name of the customer. > I just don't know what is usually done. Use a serial : - you can change product_code for a product easily - you can pass around integers easier around, in web forms for instance, you don't have to ask 'should I escape this string ?' - it's faster - it uses less space - if one day you must manage products from another source whose product_code overlap yours, you won't have problems - you can generate them with a serial uniquely and easily
Mr Caillaud, Merci! Many points you bring were also my toughts. I was asking myself really this was the way to go. I'm happy to see that my view of the problem was good. Encore merci! (Thanks again!) On Tue, 23 Nov 2004 00:06:13 +0100, Pierre-Frédéric Caillaud <lists@boutiquenumerique.com> wrote: > > > What is the common approach? Should I use directly the product_code as > > my ID, or use a sequantial number for speed? (I did the same for the > > company_id, this is a 'serial' and not the shor name of the customer. > > I just don't know what is usually done. > > Use a serial : > - you can change product_code for a product easily > - you can pass around integers easier around, in web forms for instance, > you don't have to ask 'should I escape this string ?' > - it's faster > - it uses less space > - if one day you must manage products from another source whose > product_code overlap yours, you won't have problems > - you can generate them with a serial uniquely and easily > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- Alexandre Leclerc
Alexandre, > What is the common approach? Should I use directly the product_code as > my ID, or use a sequantial number for speed? (I did the same for the > company_id, this is a 'serial' and not the shor name of the customer. > I just don't know what is usually done. Don't use SERIAL just because it's there. Ideally, you *want* to use the product_code if you can. It's your natural key and a natural key is always superior to a surrogate key all other things being equal. Unfortunately, all other things are NOT equal. Here's the reasons why you'd use a surrogate key (i.e. SERIAL): 1) because the product code is a large text string (i.e. > 10bytes) and you will have many millions of records, so having it as an FK in other tables will add significantly to the footprint of the database; 2) because product codes get blanket changes frequently, where thousands of them pet re-mapped to new codes, and the ON CASCADE UPDATE slow performance will kill your database; 3) Because every other table in the database has a SERIAL key and consistency reduces errors; 4) or because your interface programmers get annoyed with using different types of keys for different tables and multicolumn keys. If none of the above is true (and I've had it not be, in some tables and some databases) then you want to stick with your "natural key", the product_code. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Mon, 2004-11-22 at 16:54 -0800, Josh Berkus wrote: > Alexandre, > > > What is the common approach? Should I use directly the product_code as > > my ID, or use a sequantial number for speed? (I did the same for the > > company_id, this is a 'serial' and not the shor name of the customer. > > I just don't know what is usually done. > > Don't use SERIAL just because it's there. Ideally, you *want* to use the > product_code if you can. It's your natural key and a natural key is always > superior to a surrogate key all other things being equal. It would be nice if PostgreSQL had some form of transparent surrogate keying in the background which would automatically run around and replace your real data with SERIAL integers. It could use a lookup table for conversions between the surrogate and real values so the user never knows that it's done, a bit like ENUM. Then we could all use the real values with no performance issues for 1) because it's an integer in the background, and 2) because a cascade only touches a single tuple in the lookup table. --
Rod, > It would be nice if PostgreSQL had some form of transparent surrogate > keying in the background which would automatically run around and > replace your real data with SERIAL integers. It could use a lookup table > for conversions between the surrogate and real values so the user never > knows that it's done, a bit like ENUM. Then we could all use the real > values with no performance issues for 1) because it's an integer in the > background, and 2) because a cascade only touches a single tuple in the > lookup table. Sybase does this, and it's a feature I would dearly love to emulate. You can just refer to another table, without specifying the column, as an FK and it will create an invisible hashed key. This is the type of functionality Codd was advocating -- invisible, implementation-automated surrogate keys -- in the mid 90's (don't have a paper citation at the moment). So you'd just do: create table client_contacts ( fname text not null, lname text not null, client foriegn key clients, position text, notes text ); and the "client" column would create an invisible hashed key that would drag in the relevant row from the clients table; thus a: select * from client_contacts would actually show the whole record from clients as well. -- Josh Berkus Aglio Database Solutions San Francisco
>> It would be nice if PostgreSQL had some form of transparent surrogate >> keying in the background which would automatically run around and >> replace your real data with SERIAL integers. It could use a lookup table There is still table inheritance, but it's not really the same.
All, Well, you should still escape any strings you're getting from a web page so you can ensure you're not subject to a SQL insert attack, even if you're expecting integers. Thanks, Peter Darley -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Pierre-Frédéric Caillaud Sent: Monday, November 22, 2004 3:06 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Data type to use for primary key > What is the common approach? Should I use directly the product_code as > my ID, or use a sequantial number for speed? (I did the same for the > company_id, this is a 'serial' and not the shor name of the customer. > I just don't know what is usually done. Use a serial : - you can change product_code for a product easily - you can pass around integers easier around, in web forms for instance, you don't have to ask 'should I escape this string ?' - it's faster - it uses less space - if one day you must manage products from another source whose product_code overlap yours, you won't have problems - you can generate them with a serial uniquely and easily ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
On Mon, 22 Nov 2004 16:54:56 -0800, Josh Berkus <josh@agliodbs.com> wrote: > Alexandre, > > > What is the common approach? Should I use directly the product_code as > > my ID, or use a sequantial number for speed? (I did the same for the > > company_id, this is a 'serial' and not the shor name of the customer. > > I just don't know what is usually done. > > Don't use SERIAL just because it's there. Ideally, you *want* to use the > product_code if you can. It's your natural key and a natural key is always > superior to a surrogate key all other things being equal. > > Unfortunately, all other things are NOT equal. Here's the reasons why you'd > use a surrogate key (i.e. SERIAL): > > 1) because the product code is a large text string (i.e. > 10bytes) and you > will have many millions of records, so having it as an FK in other tables > will add significantly to the footprint of the database; Thanks for those tips. I'll print and keep them. So in my case, the product_code being varchar(24) is: 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I did the good thing using a serial. For my shorter keys (4 bytes + up to 6 char) I will use the natural key. This is interesting, because this is what I did right now. The "transparent surrogate keying" proposal that is discussed bellow in the thread is a very good idea. It would be nice to see that. It would be easier for the DB admin and the coder; the moment this is not slowing the system. : ) Best regards. -- Alexandre Leclerc
> All, > Well, you should still escape any strings you're getting from a web > page so > you can ensure you're not subject to a SQL insert attack, even if you're > expecting integers. > Thanks, > Peter Darley Well, your framework should do this for you : "integer" specified in your database object class description "%d" appears in in your generated queries (or you put it in your hand written queries) => if the parameter is not an integer, an exception is thrown, then catched, then an error page is displayed... Or, just casting to int should throw an exception... Forms should be validated, but hidden parameters in links are OK imho to display an error page if they are incorrect, after all, if the user edits the get or post parameters, well...
Alexandre Leclerc <alexandre.leclerc@gmail.com> writes: > Thanks for those tips. I'll print and keep them. So in my case, the > product_code being varchar(24) is: > 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I > did the good thing using a serial. For my shorter keys (4 bytes + up > to 6 char) I will use the natural key. Realize that space usage is really only part of the issue. If you ever have two records with the same natural key or a record whose natural key has changed you'll be in for a world of hurt if you use the natural key as the primary key in your database. Basically I never use natural keys except when they're arbitrarily chosen values defined by the application itself. Situations where I've used varchars instead of integer keys are things like: . Individual privileges grantable in a security system. (things like "VIEWUSER" "EDITUSER" privileges) . Reference tables for one letter codes used to indicate the type of object represented by the record. Actually I see one interesting exception to my policy in my current database schema. And I don't think I would do this one differently given the choice either. The primary key of the postal code table is the postal code. (postal codes are up here in the great white north like zip codes down there.) This could hurt if they ever reuse an old previously retired postal code, which isn't an entirely impossible case. As far as I know it hasn't happened yet though. And it's just so much more convenient having the postal code handy instead of having to join against another table to look it up. -- greg
Greg Stark <gsstark@mit.edu> writes: > This could hurt if they ever reuse an old previously retired postal code, > which isn't an entirely impossible case. As far as I know it hasn't happened > yet though. One would suppose that the guys who are in charge of this point at Canada Post consider the postal code to be their primary key, and are no more eager to reuse one than you are to see it reused. Basically this comes down to "I'm going to use some externally supplied primary key as my primary key. Do I trust the upstream DBA to know what a primary key is?" regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > This could hurt if they ever reuse an old previously retired postal code, > > which isn't an entirely impossible case. As far as I know it hasn't happened > > yet though. > > One would suppose that the guys who are in charge of this point at > Canada Post consider the postal code to be their primary key, and > are no more eager to reuse one than you are to see it reused. Well, they may eventually be forced to. For the same sort of hierarchic issue that causes the "shortage" of IPv4 address space even though there's far less than 4 billion hosts online. But as far as I can see today the only postal codes that are being retired are rural areas that are being developed and have blocks of codes assigned instead of having a single broad code. > Basically this comes down to "I'm going to use some externally supplied > primary key as my primary key. Do I trust the upstream DBA to know what > a primary key is?" Well there's another issue here I think. Often people see something that looks unique and is clearly intended to be a primary key and think "aha, nice primary key" but they miss a subtle distinction between what the external primary key represents and what their data model is tracking. The typical example is social security numbers. SSNs are a perfectly reasonable primary key -- as long as you're tracking Social Security accounts, not people. Most people in the US have exactly one SS account, so people often think it looks like a primary key for people. In fact not everyone has a Social Security account (aliens who have never worked in the US, or for that matter people who have never been in the US) and others have had multiple Social Security accounts (victims of identity theft). Another example that comes to mind is the local telephone company. When I changed my phone number they created a new account without telling me, because their billing system's primary key for accounts is... the phone number. So all my automated bill payments started disappearing into the black hole of the old account and my new account went negative. I wonder what they do for customers who buy services from them but don't have a telephone line. -- greg
On 24 Nov 2004 01:52:52 -0500, Greg Stark <gsstark@mit.edu> wrote: > Alexandre Leclerc <alexandre.leclerc@gmail.com> writes: > > > Thanks for those tips. I'll print and keep them. So in my case, the > > product_code being varchar(24) is: > > 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I > > did the good thing using a serial. For my shorter keys (4 bytes + up > > to 6 char) I will use the natural key. > > Realize that space usage is really only part of the issue. Thank you for this additionnal information. This will help out in the futur. In my situation this is a good thing to have integer key where I decided to have them. Event if I was obliged to add UNIQUE constraints to some other columns. I think they call this "candidate key" and it's still 3NF (whatever; but only if my db is correctly organised)... I try to be logical and efficient for good performance. But in the end, the time (the db will get bigger) and good EXPLAIN ANALYSE commands will help fine tuning later! This will give me more experience at that point. > Actually I see one interesting exception to my policy in my current database > schema. And I don't think I would do this one differently given the choice > either. The primary key of the postal code table is the postal code. (postal > codes are up here in the great white north like zip codes down there.) (I do understand this one, living in the province of Quebec. ;) And the great white north is still not arrived; end november! - Still, not very exceptionnal.) Regards. -- Alexandre Leclerc