Thread: Data type to use for primary key

Data type to use for primary key

From
Alexandre Leclerc
Date:
Good day,

I'm asking myself if there is a performance issue in using an integer
of varchar(24) PRIMARY KEY in a product table.

I've read that there is no speed issue in the query, but that the only
performance issue is the database size of copying the key in other
tables that require it.

My product_id is copied in orders, jobs, and other specific tables.

What is the common approach? Should I use directly the product_code as
my ID, or use a sequantial number for speed? (I did the same for the
company_id, this is a 'serial' and not the shor name of the customer.
I just don't know what is usually done.

Right now I did the following:
CREATE TABLE design.products (
product_id          serial      PRIMARY KEY,
company_id          integer     NOT NULL REFERENCES sales.companies ON
UPDATE CASCADE,
product_code        varchar(24) NOT NULL,
...
CONSTRAINT product_code_already_used_for_this_company UNIQUE
(company_id, product_code)
);

CREATE TABLE sales.companies (
company_id        integer      PRIMARY KEY,
company_name      varchar(48)  NOT NULL UNIQUE,
...
);

The company_id is also copied in many tables like product, contacts, etc.

Thank you very much for any good pointers on this 'already seen' issue.

--
Alexandre Leclerc

Re: Data type to use for primary key

From
Pierre-Frédéric Caillaud
Date:
> What is the common approach? Should I use directly the product_code as
> my ID, or use a sequantial number for speed? (I did the same for the
> company_id, this is a 'serial' and not the shor name of the customer.
> I just don't know what is usually done.

    Use a serial :
    - you can change product_code for a product easily
    - you can pass around integers easier around, in web forms for instance,
you don't have to ask 'should I escape this string ?'
    - it's faster
    - it uses less space
    - if one day you must manage products from another source whose
product_code overlap yours, you won't have problems
    - you can generate them with a serial uniquely and easily

Re: Data type to use for primary key

From
Alexandre Leclerc
Date:
Mr Caillaud,

Merci! Many points you bring were also my toughts. I was asking myself
really this was the way to go. I'm happy to see that my view of the
problem was good.

Encore merci! (Thanks again!)

On Tue, 23 Nov 2004 00:06:13 +0100, Pierre-Frédéric Caillaud
<lists@boutiquenumerique.com> wrote:
>
> > What is the common approach? Should I use directly the product_code as
> > my ID, or use a sequantial number for speed? (I did the same for the
> > company_id, this is a 'serial' and not the shor name of the customer.
> > I just don't know what is usually done.
>
>         Use a serial :
>         - you can change product_code for a product easily
>         - you can pass around integers easier around, in web forms for instance,
> you don't have to ask 'should I escape this string ?'
>         - it's faster
>         - it uses less space
>         - if one day you must manage products from another source whose
> product_code overlap yours, you won't have problems
>         - you can generate them with a serial uniquely and easily
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>


--
Alexandre Leclerc

Re: Data type to use for primary key

From
Josh Berkus
Date:
Alexandre,

> What is the common approach? Should I use directly the product_code as
> my ID, or use a sequantial number for speed? (I did the same for the
> company_id, this is a 'serial' and not the shor name of the customer.
> I just don't know what is usually done.

Don't use SERIAL just because it's there.    Ideally, you *want* to use the
product_code if you can.   It's your natural key and a natural key is always
superior to a surrogate key all other things being equal.

Unfortunately, all other things are NOT equal.    Here's the reasons why you'd
use a surrogate key (i.e. SERIAL):

1) because the product code is a large text string  (i.e. > 10bytes) and you
will have many millions of records, so having it as an FK in other tables
will add significantly to the footprint of the database;

2) because product codes get blanket changes frequently, where thousands of
them pet re-mapped to new codes, and the ON CASCADE UPDATE slow performance
will kill your database;

3) Because every other table in the database has a SERIAL key and consistency
reduces errors;

4) or because your interface programmers get annoyed with using different
types of keys for different tables and multicolumn keys.

If none of the above is true (and I've had it not be, in some tables and some
databases) then you want to stick with your "natural key", the product_code.


--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Data type to use for primary key

From
Rod Taylor
Date:
On Mon, 2004-11-22 at 16:54 -0800, Josh Berkus wrote:
> Alexandre,
>
> > What is the common approach? Should I use directly the product_code as
> > my ID, or use a sequantial number for speed? (I did the same for the
> > company_id, this is a 'serial' and not the shor name of the customer.
> > I just don't know what is usually done.
>
> Don't use SERIAL just because it's there.    Ideally, you *want* to use the
> product_code if you can.   It's your natural key and a natural key is always
> superior to a surrogate key all other things being equal.

It would be nice if PostgreSQL had some form of transparent surrogate
keying in the background which would automatically run around and
replace your real data with SERIAL integers. It could use a lookup table
for conversions between the surrogate and real values so the user never
knows that it's done, a bit like ENUM. Then we could all use the real
values with no performance issues for 1) because it's an integer in the
background, and 2) because a cascade only touches a single tuple in the
lookup table.


--


Re: Data type to use for primary key

From
Josh Berkus
Date:
Rod,

> It would be nice if PostgreSQL had some form of transparent surrogate
> keying in the background which would automatically run around and
> replace your real data with SERIAL integers. It could use a lookup table
> for conversions between the surrogate and real values so the user never
> knows that it's done, a bit like ENUM. Then we could all use the real
> values with no performance issues for 1) because it's an integer in the
> background, and 2) because a cascade only touches a single tuple in the
> lookup table.

Sybase does this, and it's a feature I would dearly love to emulate.  You can
just refer to another table, without specifying the column, as an FK and it
will create an invisible hashed key.   This is the type of functionality Codd
was advocating -- invisible, implementation-automated surrogate keys -- in
the mid 90's (don't have a paper citation at the moment).

So you'd just do:

create table client_contacts (
    fname text not null,
    lname text not null,
    client foriegn key clients,
    position text,
    notes text
);

and the "client" column would create an invisible hashed key that would drag
in the relevant row from the clients table; thus a:

select * from client_contacts

would actually show the whole record from clients as well.


--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Data type to use for primary key

From
Pierre-Frédéric Caillaud
Date:
>> It would be nice if PostgreSQL had some form of transparent surrogate
>> keying in the background which would automatically run around and
>> replace your real data with SERIAL integers. It could use a lookup table

    There is still table inheritance, but it's not really the same.

Re: Data type to use for primary key

From
"Peter Darley"
Date:
All,
    Well, you should still escape any strings you're getting from a web page so
you can ensure you're not subject to a SQL insert attack, even if you're
expecting integers.
Thanks,
Peter Darley

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of
Pierre-Frédéric Caillaud
Sent: Monday, November 22, 2004 3:06 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Data type to use for primary key



> What is the common approach? Should I use directly the product_code as
> my ID, or use a sequantial number for speed? (I did the same for the
> company_id, this is a 'serial' and not the shor name of the customer.
> I just don't know what is usually done.

    Use a serial :
    - you can change product_code for a product easily
    - you can pass around integers easier around, in web forms for instance,
you don't have to ask 'should I escape this string ?'
    - it's faster
    - it uses less space
    - if one day you must manage products from another source whose
product_code overlap yours, you won't have problems
    - you can generate them with a serial uniquely and easily

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match


Re: Data type to use for primary key

From
Alexandre Leclerc
Date:
On Mon, 22 Nov 2004 16:54:56 -0800, Josh Berkus <josh@agliodbs.com> wrote:
> Alexandre,
>
> > What is the common approach? Should I use directly the product_code as
> > my ID, or use a sequantial number for speed? (I did the same for the
> > company_id, this is a 'serial' and not the shor name of the customer.
> > I just don't know what is usually done.
>
> Don't use SERIAL just because it's there.    Ideally, you *want* to use the
> product_code if you can.   It's your natural key and a natural key is always
> superior to a surrogate key all other things being equal.
>
> Unfortunately, all other things are NOT equal.    Here's the reasons why you'd
> use a surrogate key (i.e. SERIAL):
>
> 1) because the product code is a large text string  (i.e. > 10bytes) and you
> will have many millions of records, so having it as an FK in other tables
> will add significantly to the footprint of the database;

Thanks for those tips. I'll print and keep them. So in my case, the
product_code being varchar(24) is:
4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
did the good thing using a serial. For my shorter keys (4 bytes + up
to 6 char) I will use the natural key.

This is interesting, because this is what I did right now.

The "transparent surrogate keying" proposal that is discussed bellow
in the thread is a very good idea. It would be nice to see that. It
would be easier for the DB admin and the coder; the moment this is not
slowing the system. : )

Best regards.

--
Alexandre Leclerc

Re: Data type to use for primary key

From
Pierre-Frédéric Caillaud
Date:
> All,
>     Well, you should still escape any strings you're getting from a web
> page so
> you can ensure you're not subject to a SQL insert attack, even if you're
> expecting integers.
> Thanks,
> Peter Darley

    Well, your framework should do this for you :

    "integer" specified in your database object class description
    "%d" appears in in your generated queries (or you put it in your hand
written queries)
    => if the parameter is not an integer, an exception is thrown, then
catched, then an error page is displayed...

    Or, just casting to int should throw an exception...

    Forms should be validated, but hidden parameters in links are OK imho to
display an error page if they are incorrect, after all, if the user edits
the get or post parameters, well...

Re: Data type to use for primary key

From
Greg Stark
Date:
Alexandre Leclerc <alexandre.leclerc@gmail.com> writes:

> Thanks for those tips. I'll print and keep them. So in my case, the
> product_code being varchar(24) is:
> 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
> did the good thing using a serial. For my shorter keys (4 bytes + up
> to 6 char) I will use the natural key.

Realize that space usage is really only part of the issue.

If you ever have two records with the same natural key or a record whose
natural key has changed you'll be in for a world of hurt if you use the
natural key as the primary key in your database.

Basically I never use natural keys except when they're arbitrarily chosen
values defined by the application itself.

Situations where I've used varchars instead of integer keys are things like:

. Individual privileges grantable in a security system.
  (things like "VIEWUSER" "EDITUSER" privileges)

. Reference tables for one letter codes used to indicate the type of object
  represented by the record.

Actually I see one interesting exception to my policy in my current database
schema. And I don't think I would do this one differently given the choice
either. The primary key of the postal code table is the postal code. (postal
codes are up here in the great white north like zip codes down there.)

This could hurt if they ever reuse an old previously retired postal code,
which isn't an entirely impossible case. As far as I know it hasn't happened
yet though. And it's just so much more convenient having the postal code handy
instead of having to join against another table to look it up.

--
greg

Re: Data type to use for primary key

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> This could hurt if they ever reuse an old previously retired postal code,
> which isn't an entirely impossible case. As far as I know it hasn't happened
> yet though.

One would suppose that the guys who are in charge of this point at
Canada Post consider the postal code to be their primary key, and
are no more eager to reuse one than you are to see it reused.

Basically this comes down to "I'm going to use some externally supplied
primary key as my primary key.  Do I trust the upstream DBA to know what
a primary key is?"

            regards, tom lane

Re: Data type to use for primary key

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > This could hurt if they ever reuse an old previously retired postal code,
> > which isn't an entirely impossible case. As far as I know it hasn't happened
> > yet though.
>
> One would suppose that the guys who are in charge of this point at
> Canada Post consider the postal code to be their primary key, and
> are no more eager to reuse one than you are to see it reused.

Well, they may eventually be forced to. For the same sort of hierarchic issue
that causes the "shortage" of IPv4 address space even though there's far less
than 4 billion hosts online.

But as far as I can see today the only postal codes that are being retired are
rural areas that are being developed and have blocks of codes assigned instead
of having a single broad code.

> Basically this comes down to "I'm going to use some externally supplied
> primary key as my primary key.  Do I trust the upstream DBA to know what
> a primary key is?"

Well there's another issue here I think. Often people see something that looks
unique and is clearly intended to be a primary key and think "aha, nice
primary key" but they miss a subtle distinction between what the external
primary key represents and what their data model is tracking.

The typical example is social security numbers. SSNs are a perfectly
reasonable primary key -- as long as you're tracking Social Security accounts,
not people. Most people in the US have exactly one SS account, so people often
think it looks like a primary key for people. In fact not everyone has a
Social Security account (aliens who have never worked in the US, or for that
matter people who have never been in the US) and others have had multiple
Social Security accounts (victims of identity theft).

Another example that comes to mind is the local telephone company. When I
changed my phone number they created a new account without telling me, because
their billing system's primary key for accounts is... the phone number. So all
my automated bill payments started disappearing into the black hole of the old
account and my new account went negative. I wonder what they do for customers
who buy services from them but don't have a telephone line.

--
greg

Re: Data type to use for primary key

From
Alexandre Leclerc
Date:
On 24 Nov 2004 01:52:52 -0500, Greg Stark <gsstark@mit.edu> wrote:
> Alexandre Leclerc <alexandre.leclerc@gmail.com> writes:
>
> > Thanks for those tips. I'll print and keep them. So in my case, the
> > product_code being varchar(24) is:
> > 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
> > did the good thing using a serial. For my shorter keys (4 bytes + up
> > to 6 char) I will use the natural key.
>
> Realize that space usage is really only part of the issue.

Thank you for this additionnal information. This will help out in the
futur. In my situation this is a good thing to have integer key where
I decided to have them. Event if I was obliged to add UNIQUE
constraints to some other columns. I think they call this "candidate
key" and it's still 3NF (whatever; but only if my db is correctly
organised)... I try to be logical and efficient for good performance.
But in the end, the time (the db will get bigger) and good EXPLAIN
ANALYSE commands will help fine tuning later! This will give me more
experience at that point.

> Actually I see one interesting exception to my policy in my current database
> schema. And I don't think I would do this one differently given the choice
> either. The primary key of the postal code table is the postal code. (postal
> codes are up here in the great white north like zip codes down there.)

(I do understand this one, living in the province of Quebec. ;) And
the great white north is still not arrived; end november! - Still, not
very exceptionnal.)

Regards.

--
Alexandre Leclerc