Thread: indexes

indexes

From

Tom Allison

Date:

24 November 2006, 11:14:50

I notice a lot of places where people use the approach of creating an index and
a unique key like:

CREATE TABLE foo (
   idx SERIAL PRIMARY KEY,
   name varchar(32) UNIQUE NOT NULL
)

instead of
CREATE TABLE foo (
   name varchar(32) PRIMARY KEY
)

If the name is NEVER going to change, is there any advantage to doing this?
If there are many-to-many reference tables (like name-to-friends) is this any
different?

I've seen this a lot, but I've always assumed that with the condition that
'name' would NEVER change, there was no advantage.

Re: indexes

From

"Brandon Aiken"

Date:

24 November 2006, 12:14:31

I asked this question here awhile ago.  It's a fairly common question,
and it's known as the surrogate vs natural key debate.

Using a natural key has the advantage of performance.  With a surrogate
key, most RDBMS systems will have to maintain two indexes.  Natural keys
can also make your database more readable, and can eliminate the need to
do joins for foreign keys in some cases.

Surrogate keys are useful because you can very easily change your data
structure with a bit less SQL magick.  A lot of Object Relational
Mappers always create surrogate keys, too, although I suspect that's
mostly a way to make the ORM more portable by guaranteeing that a
primary key will exist.

The only other time surrogate keys are very useful is when all your
candidate keys have values that change fairly often, since the primary
key ought to be as stable as possible.

Some developers also feel more comfortable with an id field.  Having
that metadata feels like a safety net for some reason.

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Allison
Sent: Friday, November 24, 2006 9:54 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] indexes

I notice a lot of places where people use the approach of creating an
index and
a unique key like:

CREATE TABLE foo (
   idx SERIAL PRIMARY KEY,
   name varchar(32) UNIQUE NOT NULL
)

instead of
CREATE TABLE foo (
   name varchar(32) PRIMARY KEY
)

If the name is NEVER going to change, is there any advantage to doing
this?
If there are many-to-many reference tables (like name-to-friends) is
this any
different?

I've seen this a lot, but I've always assumed that with the condition
that
'name' would NEVER change, there was no advantage.




---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: indexes

From

"Joshua D. Drake"

Date:

24 November 2006, 12:21:14

On Fri, 2006-11-24 at 09:54 -0500, Tom Allison wrote:
> I notice a lot of places where people use the approach of creating an index and
> a unique key like:
>
> CREATE TABLE foo (
>    idx SERIAL PRIMARY KEY,
>    name varchar(32) UNIQUE NOT NULL
> )
>
> instead of
> CREATE TABLE foo (
>    name varchar(32) PRIMARY KEY
> )
>
> If the name is NEVER going to change, is there any advantage to doing this?
> If there are many-to-many reference tables (like name-to-friends) is this any
> different?

THe point of the first table is to have a artificial key that allows
easy access to the row.

It is easier to say: select * from foo where id = 5;

>
> I've seen this a lot, but I've always assumed that with the condition that
> 'name' would NEVER change, there was no advantage.

Technically, it also violates normal form as your primary key should be
on data that is representative. Although this:

CREATE TABLE users ( id SERIAL PRIMARY KEY,

                     name varchar(32) UNIQUE NOT NULL
                   )

Would make more sense because id is representative of the users.id which
is representative from an application stand point.

Sincerely,

Joshua D. Drake




>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: indexes

From

Ben

Date:

24 November 2006, 14:38:19

It depends how it's going to be used. If you are going to reference
this table in other tables a lot and/or rarely care about what the
name actually is, then the two-column approach is going to be more
efficient. Numbers are smaller and easier to compare than strings.

On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:

> I notice a lot of places where people use the approach of creating
> an index and a unique key like:
>
> CREATE TABLE foo (
>   idx SERIAL PRIMARY KEY,
>   name varchar(32) UNIQUE NOT NULL
> )
>
> instead of
> CREATE TABLE foo (
>   name varchar(32) PRIMARY KEY
> )
>
> If the name is NEVER going to change, is there any advantage to
> doing this?
> If there are many-to-many reference tables (like name-to-friends)
> is this any different?
>
> I've seen this a lot, but I've always assumed that with the
> condition that 'name' would NEVER change, there was no advantage.
>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

Re: indexes

From

Ron Johnson

Date:

24 November 2006, 17:14:53

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

But that requires that you haul an artificial construct around.

On 11/24/06 12:38, Ben wrote:
> It depends how it's going to be used. If you are going to reference this
> table in other tables a lot and/or rarely care about what the name
> actually is, then the two-column approach is going to be more efficient.
> Numbers are smaller and easier to compare than strings.
>
> On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
>
>> I notice a lot of places where people use the approach of creating an
>> index and a unique key like:
>>
>> CREATE TABLE foo (
>>   idx SERIAL PRIMARY KEY,
>>   name varchar(32) UNIQUE NOT NULL
>> )
>>
>> instead of
>> CREATE TABLE foo (
>>   name varchar(32) PRIMARY KEY
>> )
>>
>> If the name is NEVER going to change, is there any advantage to doing
>> this?
>> If there are many-to-many reference tables (like name-to-friends) is
>> this any different?
>>
>> I've seen this a lot, but I've always assumed that with the condition
>> that 'name' would NEVER change, there was no advantage.


- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p
nMmJ64MHVNfE91EZIsJNwts=
=piIg
-----END PGP SIGNATURE-----

Re: indexes

From

Ben

Date:

24 November 2006, 20:42:38

Yes, it does. So of course it depends on how you use it to know
what's going to be more efficient. For instance, if the rows in this
table contain strings of more than a few bytes, and more than a
couple tables reference this table with a foreign key, then you will
quickly start to save space by using a numeric primary key, even if
it is an artificial construct.

For the kind of work I find myself doing, it's rare that it would be
more efficient to not have the artificial construct. But that doesn't
mean one is always better than the other.

On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> But that requires that you haul an artificial construct around.
>
> On 11/24/06 12:38, Ben wrote:
>> It depends how it's going to be used. If you are going to
>> reference this
>> table in other tables a lot and/or rarely care about what the name
>> actually is, then the two-column approach is going to be more
>> efficient.
>> Numbers are smaller and easier to compare than strings.
>>
>> On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
>>
>>> I notice a lot of places where people use the approach of
>>> creating an
>>> index and a unique key like:
>>>
>>> CREATE TABLE foo (
>>>   idx SERIAL PRIMARY KEY,
>>>   name varchar(32) UNIQUE NOT NULL
>>> )
>>>
>>> instead of
>>> CREATE TABLE foo (
>>>   name varchar(32) PRIMARY KEY
>>> )
>>>
>>> If the name is NEVER going to change, is there any advantage to
>>> doing
>>> this?
>>> If there are many-to-many reference tables (like name-to-friends) is
>>> this any different?
>>>
>>> I've seen this a lot, but I've always assumed that with the
>>> condition
>>> that 'name' would NEVER change, there was no advantage.
>
>
> - --
> Ron Johnson, Jr.
> Jefferson LA  USA
>
> Is "common sense" really valid?
> For example, it is "common sense" to white-power racists that
> whites are superior to blacks, and that those with brown skins
> are mud people.
> However, that "common sense" is obviously wrong.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.5 (GNU/Linux)
>
> iD8DBQFFZ0SIS9HxQb37XmcRAppYAJ9i5PpJ021FyQYQSgTo9Alv8CDNHgCg1Q4p
> nMmJ64MHVNfE91EZIsJNwts=
> =piIg
> -----END PGP SIGNATURE-----
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly

Re: indexes

From

Ron Johnson

Date:

24 November 2006, 22:50:10

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

True.  That doesn't mean I like it...

Real-life example: We design and run customer service centers for
many toll roads in the US Northeast.

So, we have a set of tables like this:

T_AGENCY (
AGENCY_ID    INTEGER  PRIMARY KEY,
AGENCY_CODE  CHAR(2),
AGENCY_NAME  CHAR(40),
etc,
etc );

T_PLAZA (
PLAZA_ID          INTEGER PRIMARY KEY,
EXTERNAL_PLAZA_ID CHAR(5),
PLAZA_NAME        CHAR(40),
AGENCY_ID         INTEGER FOREIGN KEY REFERENCES T_AGENCY.AGENCY_ID,
etc,
etc );

T_LANE (
LANE_ID            INTEGER PRIMARY KEY,
EXTERNAL_LANE_ID   CHAR(5),
LANE_NAME          CHAR(40),
PLAZA_ID           INTEGER FOREIGN KEY REFERENCES T_PLAZA.PLAZA_ID,
etc,
etc);

So, in our various transaction and summary tables, we have the
LANE_ID column instead of natural key AGENCY_CODE,
EXTERNAL_PLAZA_ID, EXTERNAL_LANE_ID.

Since we have multiple projects, each with their own "database", and
each database is federated (because they are so large), each actual
database must keep a copy of T_AGENCY, T_PLAZA & T_LANE.  The
agencies pass transaction data around, since they all use the same
brand of toll tag, and motorists can use their tag at any other
agency's lanes and have it "work", so if the synthetic keys of these
tables were to ever get out of sync, Bad Things would happen.


On 11/24/06 18:42, Ben wrote:
> Yes, it does. So of course it depends on how you use it to know what's
> going to be more efficient. For instance, if the rows in this table
> contain strings of more than a few bytes, and more than a couple tables
> reference this table with a foreign key, then you will quickly start to
> save space by using a numeric primary key, even if it is an artificial
> construct.
>
> For the kind of work I find myself doing, it's rare that it would be
> more efficient to not have the artificial construct. But that doesn't
> mean one is always better than the other.
>
> On Nov 24, 2006, at 11:14 AM, Ron Johnson wrote:
>
> But that requires that you haul an artificial construct around.
>
> On 11/24/06 12:38, Ben wrote:
>>>> It depends how it's going to be used. If you are going to reference this
>>>> table in other tables a lot and/or rarely care about what the name
>>>> actually is, then the two-column approach is going to be more efficient.
>>>> Numbers are smaller and easier to compare than strings.
>>>>
>>>> On Nov 24, 2006, at 6:54 AM, Tom Allison wrote:
>>>>
>>>>> I notice a lot of places where people use the approach of creating an
>>>>> index and a unique key like:
>>>>>
>>>>> CREATE TABLE foo (
>>>>>   idx SERIAL PRIMARY KEY,
>>>>>   name varchar(32) UNIQUE NOT NULL
>>>>> )
>>>>>
>>>>> instead of
>>>>> CREATE TABLE foo (
>>>>>   name varchar(32) PRIMARY KEY
>>>>> )
>>>>>
>>>>> If the name is NEVER going to change, is there any advantage to doing
>>>>> this?
>>>>> If there are many-to-many reference tables (like name-to-friends) is
>>>>> this any different?
>>>>>
>>>>> I've seen this a lot, but I've always assumed that with the condition
>>>>> that 'name' would NEVER change, there was no advantage.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD4DBQFFZ69TS9HxQb37XmcRAgAAAJUT1HnH/ocUKLxbow6GAg2Gz1wpAJwPQuJV
ZodGoV0BF2NgKn2YTrEoSQ==
=Y+x1
-----END PGP SIGNATURE-----

Re: indexes

From

Tom Allison

Date:

25 November 2006, 13:14:37

Ben wrote:
> Yes, it does. So of course it depends on how you use it to know what's
> going to be more efficient. For instance, if the rows in this table
> contain strings of more than a few bytes, and more than a couple tables
> reference this table with a foreign key, then you will quickly start to
> save space by using a numeric primary key, even if it is an artificial
> construct.
>
> For the kind of work I find myself doing, it's rare that it would be
> more efficient to not have the artificial construct. But that doesn't
> mean one is always better than the other.
>

So let me see if I understand this correctly.

If the real-world primary key is large (say up to 100 characters in length) then
the disadvantage is that you are duplicating this referenced key in several
other tables, each element taking up 100 characters.  Space is wasted when
compared to int4 ID's.  But not really sure if this is a performance problem for
SELECT except for the space required (varchar(128) vs. int4).

Having two keys, a primary_key of int4 and a unique key of varchar(128) would be
very ugly on INSERT/DELETE/UPDATE queries because of the index overhead.

One table may have rows on the order of 100's.
Another table will be 1,000,000.
The many-to-many join would be..  100,000,000's
And maybe there I would have need for smaller physical index variable types...

Re: indexes

From

"Harald Armin Massa"

Date:

25 November 2006, 13:33:59

Tom,

If the real-world primary key is large (say up to 100 characters in length) then
the disadvantage is that you are duplicating this referenced key in several
other tables, each element taking up 100 characters. Space is wasted when
compared to int4 ID's. But not really sure if this is a performance problem for
SELECT except for the space required (varchar(128) vs. int4).

What is worth a try is to check an md5 or sha hash of those 100chars and use that as a key.

with a proper hashing algorythm (proper= fitting to your data) collisions should not happen; and you have a shorter key to connect the tables.

best wishes

harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
Python: the only language with more web frameworks than keywords.

Re: indexes

From

Martijn van Oosterhout

Date:

25 November 2006, 15:44:31

On Sat, Nov 25, 2006 at 12:14:11PM -0500, Tom Allison wrote:
> If the real-world primary key is large (say up to 100 characters in length)
> then the disadvantage is that you are duplicating this referenced key in
> several other tables, each element taking up 100 characters.  Space is
> wasted when compared to int4 ID's.  But not really sure if this is a
> performance problem for SELECT except for the space required (varchar(128)
> vs. int4).

Well, it kinda sucks for joining because comparing strings may take
tens to hundreds of times as long as compairng integers. It's not just
byte-wise comparison but you have to be locale sensetive about it.

My main problem with using any kind of string as "natural key" is that
once you start passing it around you have worry about the encoding of
said string and when it goes a round-trip to a client, will what you
get back still be the same? If you can guarentee ASCII you might be ok,
but otherwise... Integers have none of these issues.

> Having two keys, a primary_key of int4 and a unique key of varchar(128)
> would be very ugly on INSERT/DELETE/UPDATE queries because of the index
> overhead.

In general my tables are queried several orders of magnitude more often
than they are updated, so index update cost isn't all that relevent.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: indexes

From

Alban Hertroys

Date:

27 November 2006, 05:18:31

Martijn van Oosterhout wrote:
> Well, it kinda sucks for joining because comparing strings may take
> tens to hundreds of times as long as compairng integers. It's not just
> byte-wise comparison but you have to be locale sensetive about it.
>
> My main problem with using any kind of string as "natural key" is that
> once you start passing it around you have worry about the encoding of
> said string and when it goes a round-trip to a client, will what you
> get back still be the same? If you can guarentee ASCII you might be ok,
> but otherwise... Integers have none of these issues.

That sounds like there would be a purpose for a locale-less type of
string, as ugly as that is.
That's not a feature request, just something to digest ;)

>> Having two keys, a primary_key of int4 and a unique key of varchar(128)
>> would be very ugly on INSERT/DELETE/UPDATE queries because of the index
>> overhead.
>
> In general my tables are queried several orders of magnitude more often
> than they are updated, so index update cost isn't all that relevent.

Indeed. Basically if expensive calculations can be pre-calculated I tend
to move them into triggers that store the pre-calculated result. Usually
there's only a handfull of people modifying data, while there are many
people requesting it.

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //