Thread: indexes
I notice a lot of places where people use the approach of creating an index and a unique key like: CREATE TABLE foo ( idx SERIAL PRIMARY KEY, name varchar(32) UNIQUE NOT NULL ) instead of CREATE TABLE foo ( name varchar(32) PRIMARY KEY ) If the name is NEVER going to change, is there any advantage to doing this? If there are many-to-many reference tables (like name-to-friends) is this any different? I've seen this a lot, but I've always assumed that with the condition that 'name' would NEVER change, there was no advantage.
I asked this question here awhile ago. It's a fairly common question, and it's known as the surrogate vs natural key debate. Using a natural key has the advantage of performance. With a surrogate key, most RDBMS systems will have to maintain two indexes. Natural keys can also make your database more readable, and can eliminate the need to do joins for foreign keys in some cases. Surrogate keys are useful because you can very easily change your data structure with a bit less SQL magick. A lot of Object Relational Mappers always create surrogate keys, too, although I suspect that's mostly a way to make the ORM more portable by guaranteeing that a primary key will exist. The only other time surrogate keys are very useful is when all your candidate keys have values that change fairly often, since the primary key ought to be as stable as possible. Some developers also feel more comfortable with an id field. Having that metadata feels like a safety net for some reason. -- Brandon Aiken CS/IT Systems Engineer -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Allison Sent: Friday, November 24, 2006 9:54 AM To: pgsql-general@postgresql.org Subject: [GENERAL] indexes I notice a lot of places where people use the approach of creating an index and a unique key like: CREATE TABLE foo ( idx SERIAL PRIMARY KEY, name varchar(32) UNIQUE NOT NULL ) instead of CREATE TABLE foo ( name varchar(32) PRIMARY KEY ) If the name is NEVER going to change, is there any advantage to doing this? If there are many-to-many reference tables (like name-to-friends) is this any different? I've seen this a lot, but I've always assumed that with the condition that 'name' would NEVER change, there was no advantage. ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
On Fri, 2006-11-24 at 09:54 -0500, Tom Allison wrote: > I notice a lot of places where people use the approach of creating an index and > a unique key like: > > CREATE TABLE foo ( > idx SERIAL PRIMARY KEY, > name varchar(32) UNIQUE NOT NULL > ) > > instead of > CREATE TABLE foo ( > name varchar(32) PRIMARY KEY > ) > > If the name is NEVER going to change, is there any advantage to doing this? > If there are many-to-many reference tables (like name-to-friends) is this any > different? THe point of the first table is to have a artificial key that allows easy access to the row. It is easier to say: select * from foo where id = 5; > > I've seen this a lot, but I've always assumed that with the condition that > 'name' would NEVER change, there was no advantage. Technically, it also violates normal form as your primary key should be on data that is representative. Although this: CREATE TABLE users ( id SERIAL PRIMARY KEY, name varchar(32) UNIQUE NOT NULL ) Would make more sense because id is representative of the users.id which is representative from an application stand point. Sincerely, Joshua D. Drake > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
It depends how it's going to be used. If you are going to reference this table in other tables a lot and/or rarely care about what the name actually is, then the two-column approach is going to be more efficient. Numbers are smaller and easier to compare than strings. On Nov 24, 2006, at 6:54 AM, Tom Allison wrote: > I notice a lot of places where people use the approach of creating > an index and a unique key like: > > CREATE TABLE foo ( > idx SERIAL PRIMARY KEY, > name varchar(32) UNIQUE NOT NULL > ) > > instead of > CREATE TABLE foo ( > name varchar(32) PRIMARY KEY > ) > > If the name is NEVER going to change, is there any advantage to > doing this? > If there are many-to-many reference tables (like name-to-friends) > is this any different? > > I've seen this a lot, but I've always assumed that with the > condition that 'name' would NEVER change, there was no advantage. > > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 But that requires that you haul an artificial construct around. On 11/24/06 12:38, Ben wrote: > It depends how it's going to be used. If you are going to reference this > table in other tables a lot and/or rarely care about what the name > actually is, then the two-column approach is going to be more efficient. > Numbers are smaller and easier to compare than strings. > > On Nov 24, 2006, at 6:54 AM, Tom Allison wrote: > >> I notice a lot of places where people use the approach of creating an >> index and a unique key like: >> >> CREATE TABLE foo ( >> idx SERIAL PRIMARY KEY, >> name varchar(32) UNIQUE NOT NULL >> ) >> >> instead of >> CREATE TABLE foo ( >> name varchar(32) PRIMARY KEY >> ) >> >> If the name is NEVER going to change, is there any advantage to doing >> this? >> If there are many-to-many reference tables (like name-to-friends) is >> this any different? >> >> I've seen this a lot, but I've always assumed that with the condition >> that 'name' would NEVER change, there was no advantage. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p nMmJ64MHVNfE91EZIsJNwts= =piIg -----END PGP SIGNATURE-----
Yes, it does. So of course it depends on how you use it to know what's going to be more efficient. For instance, if the rows in this table contain strings of more than a few bytes, and more than a couple tables reference this table with a foreign key, then you will quickly start to save space by using a numeric primary key, even if it is an artificial construct. For the kind of work I find myself doing, it's rare that it would be more efficient to not have the artificial construct. But that doesn't mean one is always better than the other. On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > But that requires that you haul an artificial construct around. > > On 11/24/06 12:38, Ben wrote: >> It depends how it's going to be used. If you are going to >> reference this >> table in other tables a lot and/or rarely care about what the name >> actually is, then the two-column approach is going to be more >> efficient. >> Numbers are smaller and easier to compare than strings. >> >> On Nov 24, 2006, at 6:54 AM, Tom Allison wrote: >> >>> I notice a lot of places where people use the approach of >>> creating an >>> index and a unique key like: >>> >>> CREATE TABLE foo ( >>> idx SERIAL PRIMARY KEY, >>> name varchar(32) UNIQUE NOT NULL >>> ) >>> >>> instead of >>> CREATE TABLE foo ( >>> name varchar(32) PRIMARY KEY >>> ) >>> >>> If the name is NEVER going to change, is there any advantage to >>> doing >>> this? >>> If there are many-to-many reference tables (like name-to-friends) is >>> this any different? >>> >>> I've seen this a lot, but I've always assumed that with the >>> condition >>> that 'name' would NEVER change, there was no advantage. > > > - -- > Ron Johnson, Jr. > Jefferson LA USA > > Is "common sense" really valid? > For example, it is "common sense" to white-power racists that > whites are superior to blacks, and that those with brown skins > are mud people. > However, that "common sense" is obviously wrong. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.5 (GNU/Linux) > > iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p > nMmJ64MHVNfE91EZIsJNwts= > =piIg > -----END PGP SIGNATURE----- > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 True. That doesn't mean I like it... Real-life example: We design and run customer service centers for many toll roads in the US Northeast. So, we have a set of tables like this: T_AGENCY ( AGENCY_ID INTEGER PRIMARY KEY, AGENCY_CODE CHAR(2), AGENCY_NAME CHAR(40), etc, etc ); T_PLAZA ( PLAZA_ID INTEGER PRIMARY KEY, EXTERNAL_PLAZA_ID CHAR(5), PLAZA_NAME CHAR(40), AGENCY_ID INTEGER FOREIGN KEY REFERENCES T_AGENCY.AGENCY_ID, etc, etc ); T_LANE ( LANE_ID INTEGER PRIMARY KEY, EXTERNAL_LANE_ID CHAR(5), LANE_NAME CHAR(40), PLAZA_ID INTEGER FOREIGN KEY REFERENCES T_PLAZA.PLAZA_ID, etc, etc); So, in our various transaction and summary tables, we have the LANE_ID column instead of natural key AGENCY_CODE, EXTERNAL_PLAZA_ID, EXTERNAL_LANE_ID. Since we have multiple projects, each with their own "database", and each database is federated (because they are so large), each actual database must keep a copy of T_AGENCY, T_PLAZA & T_LANE. The agencies pass transaction data around, since they all use the same brand of toll tag, and motorists can use their tag at any other agency's lanes and have it "work", so if the synthetic keys of these tables were to ever get out of sync, Bad Things would happen. On 11/24/06 18:42, Ben wrote: > Yes, it does. So of course it depends on how you use it to know what's > going to be more efficient. For instance, if the rows in this table > contain strings of more than a few bytes, and more than a couple tables > reference this table with a foreign key, then you will quickly start to > save space by using a numeric primary key, even if it is an artificial > construct. > > For the kind of work I find myself doing, it's rare that it would be > more efficient to not have the artificial construct. But that doesn't > mean one is always better than the other. > > On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote: > > But that requires that you haul an artificial construct around. > > On 11/24/06 12:38, Ben wrote: >>>> It depends how it's going to be used. If you are going to reference this >>>> table in other tables a lot and/or rarely care about what the name >>>> actually is, then the two-column approach is going to be more efficient. >>>> Numbers are smaller and easier to compare than strings. >>>> >>>> On Nov 24, 2006, at 6:54 AM, Tom Allison wrote: >>>> >>>>> I notice a lot of places where people use the approach of creating an >>>>> index and a unique key like: >>>>> >>>>> CREATE TABLE foo ( >>>>> idx SERIAL PRIMARY KEY, >>>>> name varchar(32) UNIQUE NOT NULL >>>>> ) >>>>> >>>>> instead of >>>>> CREATE TABLE foo ( >>>>> name varchar(32) PRIMARY KEY >>>>> ) >>>>> >>>>> If the name is NEVER going to change, is there any advantage to doing >>>>> this? >>>>> If there are many-to-many reference tables (like name-to-friends) is >>>>> this any different? >>>>> >>>>> I've seen this a lot, but I've always assumed that with the condition >>>>> that 'name' would NEVER change, there was no advantage. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD4DBQFFZ69TS9HxQb37XmcRAgAAAJUT1HnH/ocUKLxbow6GAg2Gz1wpAJwPQuJV ZodGoV0BF2NgKn2YTrEoSQ== =Y+x1 -----END PGP SIGNATURE-----
Ben wrote: > Yes, it does. So of course it depends on how you use it to know what's > going to be more efficient. For instance, if the rows in this table > contain strings of more than a few bytes, and more than a couple tables > reference this table with a foreign key, then you will quickly start to > save space by using a numeric primary key, even if it is an artificial > construct. > > For the kind of work I find myself doing, it's rare that it would be > more efficient to not have the artificial construct. But that doesn't > mean one is always better than the other. > So let me see if I understand this correctly. If the real-world primary key is large (say up to 100 characters in length) then the disadvantage is that you are duplicating this referenced key in several other tables, each element taking up 100 characters. Space is wasted when compared to int4 ID's. But not really sure if this is a performance problem for SELECT except for the space required (varchar(128) vs. int4). Having two keys, a primary_key of int4 and a unique key of varchar(128) would be very ugly on INSERT/DELETE/UPDATE queries because of the index overhead. One table may have rows on the order of 100's. Another table will be 1,000,000. The many-to-many join would be.. 100,000,000's And maybe there I would have need for smaller physical index variable types...
Tom,
What is worth a try is to check an md5 or sha hash of those 100chars and use that as a key.
with a proper hashing algorythm (proper= fitting to your data) collisions should not happen; and you have a shorter key to connect the tables.
best wishes
harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.
If the real-world primary key is large (say up to 100 characters in length) then
the disadvantage is that you are duplicating this referenced key in several
other tables, each element taking up 100 characters. Space is wasted when
compared to int4 ID's. But not really sure if this is a performance problem for
SELECT except for the space required (varchar(128) vs. int4).
What is worth a try is to check an md5 or sha hash of those 100chars and use that as a key.
with a proper hashing algorythm (proper= fitting to your data) collisions should not happen; and you have a shorter key to connect the tables.
best wishes
harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.
On Sat, Nov 25, 2006 at 12:14:11PM -0500, Tom Allison wrote: > If the real-world primary key is large (say up to 100 characters in length) > then the disadvantage is that you are duplicating this referenced key in > several other tables, each element taking up 100 characters. Space is > wasted when compared to int4 ID's. But not really sure if this is a > performance problem for SELECT except for the space required (varchar(128) > vs. int4). Well, it kinda sucks for joining because comparing strings may take tens to hundreds of times as long as compairng integers. It's not just byte-wise comparison but you have to be locale sensetive about it. My main problem with using any kind of string as "natural key" is that once you start passing it around you have worry about the encoding of said string and when it goes a round-trip to a client, will what you get back still be the same? If you can guarentee ASCII you might be ok, but otherwise... Integers have none of these issues. > Having two keys, a primary_key of int4 and a unique key of varchar(128) > would be very ugly on INSERT/DELETE/UPDATE queries because of the index > overhead. In general my tables are queried several orders of magnitude more often than they are updated, so index update cost isn't all that relevent. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn van Oosterhout wrote: > Well, it kinda sucks for joining because comparing strings may take > tens to hundreds of times as long as compairng integers. It's not just > byte-wise comparison but you have to be locale sensetive about it. > > My main problem with using any kind of string as "natural key" is that > once you start passing it around you have worry about the encoding of > said string and when it goes a round-trip to a client, will what you > get back still be the same? If you can guarentee ASCII you might be ok, > but otherwise... Integers have none of these issues. That sounds like there would be a purpose for a locale-less type of string, as ugly as that is. That's not a feature request, just something to digest ;) >> Having two keys, a primary_key of int4 and a unique key of varchar(128) >> would be very ugly on INSERT/DELETE/UPDATE queries because of the index >> overhead. > > In general my tables are queried several orders of magnitude more often > than they are updated, so index update cost isn't all that relevent. Indeed. Basically if expensive calculations can be pre-calculated I tend to move them into triggers that store the pre-calculated result. Usually there's only a handfull of people modifying data, while there are many people requesting it. Regards, -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //