Thread: Column order in multi column primary key

Column order in multi column primary key

From
"Craig Boucher"
Date:

PG 9.5

 

I’m in the process of converting our application from Sql Server to Postgresql.  I’m taking advantage of this process to make some database design changes. 

 

Our database contains data for many customers and I have a customer_id column in every customer related table.  One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk.  I also changed all of the foreign keys to include both the id number column and the customer_id.  I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer.  The database handles this check for me now through foreign key constraints.  My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around.  Will the pk index perform better one way or the other or will it no matter?

 

The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.

 

Thanks,

Craig

Re: Column order in multi column primary key

From
"David G. Johnston"
Date:
On Mon, Aug 8, 2016 at 1:47 PM, Craig Boucher <craig@wesvic.com> wrote:

PG 9.5

 

I’m in the process of converting our application from Sql Server to Postgresql.  I’m taking advantage of this process to make some database design changes. 

 

Our database contains data for many customers and I have a customer_id column in every customer related table.  One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk.  I also changed all of the foreign keys to include both the id number column and the customer_id.  I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer.  The database handles this check for me now through foreign key constraints.  My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around.  Will the pk index perform better one way or the other or will it no matter?

 

The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.



​The PK should be (child, parent) - selectivity is the most important aspect of the PK.

You will also want an index on just (parent) to support its FK nature.​


But, your description seems lacking...I'd suggest you show an example set of tables with names, PKs and FKs

David J.

Re: Column order in multi column primary key

From
"Craig Boucher"
Date:

Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.

 

CREATE TABLE public.work_session

(

  work_session_id integer NOT NULL DEFAULT nextval('worksession_worksessionid_seq'::regclass),

  customer_id integer NOT NULL,

  store_id integer NOT NULL,

  department_id integer NOT NULL,

  station_id integer NOT NULL,

  start_date_time timestamp(6) without time zone NOT NULL,

  end_date_time timestamp(6) without time zone NOT NULL,

  job_id integer NOT NULL,

  goal smallint NOT NULL,

  employee_count smallint NOT NULL DEFAULT 1,

  pieces integer NOT NULL,

  end_reason_id integer,

  piece_counter_session_id integer,

  work_session_guid uuid,

  CONSTRAINT worksession_pkey PRIMARY KEY (work_session_id, customer_id),

  CONSTRAINT fk_worksession_department FOREIGN KEY (department_id, customer_id)

      REFERENCES public.department (department_id, customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT fk_worksession_endreason FOREIGN KEY (end_reason_id, customer_id)

      REFERENCES public.end_reason (end_reason_id, customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT fk_worksession_job FOREIGN KEY (job_id, customer_id)

      REFERENCES public.job (job_id, customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT fk_worksession_station FOREIGN KEY (station_id, customer_id)

      REFERENCES public.station (station_id, customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT fk_worksession_store FOREIGN KEY (store_id, customer_id)

      REFERENCES public.store (store_id, customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT chk_worksession_employeecount CHECK (employee_count > 0)

)

(indexes on each fk set of columns removed for brevity)

 

Here is the first related table.

 

CREATE TABLE public.department

(

  department_id integer NOT NULL DEFAULT nextval('department_departmentid_seq'::regclass),

  customer_id integer NOT NULL,

  department_name citext NOT NULL,

  track_redos boolean NOT NULL,

  reset_redos boolean NOT NULL,

  CONSTRAINT department_pkey PRIMARY KEY (department_id, customer_id),

  CONSTRAINT fk_department_customer FOREIGN KEY (customer_id)

      REFERENCES public.customer (customer_id) MATCH SIMPLE

      ON UPDATE NO ACTION ON DELETE NO ACTION,

  CONSTRAINT unq_department_customerid_departmentname UNIQUE (customer_id, department_name)

)

 

Thanks,

Craig

 

From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: Monday, August 8, 2016 11:33 AM
To: Craig Boucher <craig@wesvic.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key

 

On Mon, Aug 8, 2016 at 1:47 PM, Craig Boucher <craig@wesvic.com> wrote:

PG 9.5

 

I’m in the process of converting our application from Sql Server to Postgresql.  I’m taking advantage of this process to make some database design changes. 

 

Our database contains data for many customers and I have a customer_id column in every customer related table.  One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk.  I also changed all of the foreign keys to include both the id number column and the customer_id.  I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer.  The database handles this check for me now through foreign key constraints.  My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around.  Will the pk index perform better one way or the other or will it no matter?

 

The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.

 

 

​The PK should be (child, parent) - selectivity is the most important aspect of the PK.

 

You will also want an index on just (parent) to support its FK nature.​

 

 

But, your description seems lacking...I'd suggest you show an example set of tables with names, PKs and FKs

 

David J.

 

Re: Column order in multi column primary key

From
"David G. Johnston"
Date:
On Mon, Aug 8, 2016 at 3:06 PM, Craig Boucher <craig@wesvic.com> wrote:

Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.

 

CREATE TABLE public.work_session

(

  work_session_id integer NOT NULL DEFAULT nextval('worksession_worksessionid_seq'::regclass),

  customer_id integer NOT NULL,

  store_id integer NOT NULL,

  department_id integer NOT NULL,

  station_id integer NOT NULL,


 

CREATE TABLE public.department

(

  department_id integer NOT NULL DEFAULT nextval('department_departmentid_seq'::regclass),

  customer_id integer NOT NULL,



No low-hanging fruit that I can see.

This is just some thought food - much 
would depend on the usage patterns of the table and the exact relationship between the different attributes.  Specifically, you've modeled this as an intersection of 4 orthogonal aspects (customer, store, department, station) while in reality these items likely exist in a strict hierarchy. It would
​be ​
possible to have the only FK on work_session be (station_id, customer_id) - and have station have the FK (department_id, customer_id) - and so forth.

David J.
 

Re: Column order in multi column primary key

From
"Craig Boucher"
Date:

Thanks David.  I’ve thought about the hierarchy you mentioned but the hierarchy can change and I need to capture it as it was when the data was generated.

 

I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session would be better if my primary key was (customer_id, work_session_id) or if (work_session_id,  customer_id) will be fine.  Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table.

 

Thanks,

Craig

 

From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: Monday, August 8, 2016 12:43 PM
To: Craig Boucher <craig@wesvic.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key

 

On Mon, Aug 8, 2016 at 3:06 PM, Craig Boucher <craig@wesvic.com> wrote:

Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.

 

CREATE TABLE public.work_session

(

  work_session_id integer NOT NULL DEFAULT nextval('worksession_worksessionid_seq'::regclass),

  customer_id integer NOT NULL,

  store_id integer NOT NULL,

  department_id integer NOT NULL,

  station_id integer NOT NULL,

 

 

CREATE TABLE public.department

(

  department_id integer NOT NULL DEFAULT nextval('department_departmentid_seq'::regclass),

  customer_id integer NOT NULL,

 

 

No low-hanging fruit that I can see.

 

This is just some thought food - much 

would depend on the usage patterns of the table and the exact relationship between the different attributes.  Specifically, you've modeled this as an intersection of 4 orthogonal aspects (customer, store, department, station) while in reality these items likely exist in a strict hierarchy. It would

​be ​

possible to have the only FK on work_session be (station_id, customer_id) - and have station have the FK (department_id, customer_id) - and so forth.

 

David J.

 

Re: Column order in multi column primary key

From
"David G. Johnston"
Date:
On Mon, Aug 8, 2016 at 4:35 PM, Craig Boucher <craig@wesvic.com> wrote:

Thanks David.  I’ve thought about the hierarchy you mentioned but the hierarchy can change and I need to capture it as it was when the data was generated.

 

I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session would be better if my primary key was (customer_id, work_session_id) or if (work_session_id,  customer_id) will be fine.  Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table.

 


​I answered that in the original response...for PKs to operate most efficiently the most selective attribute ​should come first.

David J.

Re: Column order in multi column primary key

From
Tom Lane
Date:
"Craig Boucher" <craig@wesvic.com> writes:
> I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session
wouldbe better if my primary key was (customer_id, work_session_id) or if (work_session_id,  customer_id) will be fine.
Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table. 

You almost certainly want the more-unique column first, so far as the
performance of the index itself goes.  See
https://www.postgresql.org/docs/9.5/static/indexes-multicolumn.html

Having said that, I'm pretty skeptical of the notion of redefining what
your PK is on performance grounds.  With this definition, you'd allow
two entries with the same work_session_id, if they chanced to have
different customer_ids.  Is that really OK?

            regards, tom lane


Re: Column order in multi column primary key

From
"Craig Boucher"
Date:
Thanks Tom for the link.

It could actually be beneficial if we need to migrate a customer from one
database to another because wouldn't have to worry about pk constraint
violations.

Craig

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, August 8, 2016 1:47 PM
To: Craig Boucher <craig@wesvic.com>
Cc: 'David G. Johnston' <david.g.johnston@gmail.com>;
pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key

"Craig Boucher" <craig@wesvic.com> writes:
> I should have pointed out in my last response that I was wondering if the
performance of the pk index on work_session would be better if my primary
key was (customer_id, work_session_id) or if (work_session_id,  customer_id)
will be fine.  Customer_id will be repeated quite a bit in the table but
work_session_id should be unique across the whole table.

You almost certainly want the more-unique column first, so far as the
performance of the index itself goes.  See
https://www.postgresql.org/docs/9.5/static/indexes-multicolumn.html

Having said that, I'm pretty skeptical of the notion of redefining what your
PK is on performance grounds.  With this definition, you'd allow two entries
with the same work_session_id, if they chanced to have different
customer_ids.  Is that really OK?

            regards, tom lane



Re: Column order in multi column primary key

From
Kevin Grittner
Date:
On Mon, Aug 8, 2016 at 4:01 PM, Craig Boucher <craig@wesvic.com> wrote:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

>> I'm pretty skeptical of the notion of redefining what your PK
>> is on performance grounds.  With this definition, you'd allow
>> two entries with the same work_session_id, if they chanced to
>> have different customer_ids.  Is that really OK?

> It could actually be beneficial if we need to migrate a customer
> from one database to another because wouldn't have to worry about
> pk constraint violations.

Isn't "wouldn't have to worry about pk constraint violations"
another way of saying "we're OK with allowing the same logical row
to be present multiple times in the table with inconsistent data
among the various copies"?  You may want to read up on data
normalization and shoot for 3rd normal form and its benefits.

The short version you can use as a memory device when mentally
testing whether a key is write is that every non-key column should
be functionally dependent on the key, the whole key, and nothing
but the key ("so help you Codd").  That means that:
 (1) given the key values you can determine the values of all other
columns (1st normal form),
 (2) if you omit any of the key columns from the key you cannot
determine the values of all other columns (2nd normal form), and
 (3) no column's value may be determined from non-key columns, and
therefore only determined from the key indirectly (3rd normal form).

You are talking about degrading your normalization to 1st normal
form.  There is a wealth of literature on the problems that can
introduce.  What would be in line with well-established theory and
practice is looking for a "natural key" in each table -- some
combination of columns which naturally occur in the table which
uniquely identify the rows.  In some cases it is necessary to add
some "tie-breaker" column to the end of the key when you do this --
like a sequence within the group or a timestamp.

If you search the archives you will find periodic discussions of
the relative merits of this approach versus adding a meaningless
synthetic key (often called "id" in every table) to use by itself
as the primary key.  This is essentially the same as adding
"pointers" among the various records and constraining how
"navigation" among tables can happen.  It has a big performance
downside in generating statistics and large reports because it
requires a lot of pointer chasing.  Proponents will point out how
convenient it is to be able to change human-visible identifying
values, potentially on a very large scale, by modifying one column
of one row.  That is, of course, a double-edged sword -- in
discussing design alternatives with the CPAs who were going to be
auditing financial data stored in a database, they didn't tend to
see that as nearly as much of a plus as some programmers do.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Column order in multi column primary key

From
"Craig Boucher"
Date:
Thanks Kevin for your response.  I've Googled and debated natural vs surrogate keys and I just find surrogate keys
easierto work with (maybe I'm just being lazy).  It just seems that a description or name is most often the natural
key. I just can't see, In my case, using a department description as part of the primary key in the department table
andhaving it repeated in millions of rows.  Though I always look for ways to use natural keys where they work well. 

Thanks,
Craig

-----Original Message-----
From: Kevin Grittner [mailto:kgrittn@gmail.com]
Sent: Monday, August 8, 2016 2:44 PM
To: Craig Boucher <craig@wesvic.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; David G. Johnston <david.g.johnston@gmail.com>; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key

On Mon, Aug 8, 2016 at 4:01 PM, Craig Boucher <craig@wesvic.com> wrote:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

>> I'm pretty skeptical of the notion of redefining what your PK is on
>> performance grounds.  With this definition, you'd allow two entries
>> with the same work_session_id, if they chanced to have different
>> customer_ids.  Is that really OK?

> It could actually be beneficial if we need to migrate a customer from
> one database to another because wouldn't have to worry about pk
> constraint violations.

Isn't "wouldn't have to worry about pk constraint violations"
another way of saying "we're OK with allowing the same logical row to be present multiple times in the table with
inconsistentdata among the various copies"?  You may want to read up on data normalization and shoot for 3rd normal
formand its benefits. 

The short version you can use as a memory device when mentally testing whether a key is write is that every non-key
columnshould be functionally dependent on the key, the whole key, and nothing but the key ("so help you Codd").  That
meansthat: 
 (1) given the key values you can determine the values of all other columns (1st normal form),
 (2) if you omit any of the key columns from the key you cannot determine the values of all other columns (2nd normal
form),and 
 (3) no column's value may be determined from non-key columns, and therefore only determined from the key indirectly
(3rdnormal form). 

You are talking about degrading your normalization to 1st normal form.  There is a wealth of literature on the problems
thatcan introduce.  What would be in line with well-established theory and practice is looking for a "natural key" in
eachtable -- some combination of columns which naturally occur in the table which uniquely identify the rows.  In some
casesit is necessary to add some "tie-breaker" column to the end of the key when you do this -- like a sequence within
thegroup or a timestamp. 

If you search the archives you will find periodic discussions of the relative merits of this approach versus adding a
meaninglesssynthetic key (often called "id" in every table) to use by itself as the primary key.  This is essentially
thesame as adding "pointers" among the various records and constraining how "navigation" among tables can happen.  It
hasa big performance downside in generating statistics and large reports because it requires a lot of pointer chasing.
Proponentswill point out how convenient it is to be able to change human-visible identifying values, potentially on a
verylarge scale, by modifying one column of one row.  That is, of course, a double-edged sword -- in discussing design
alternativeswith the CPAs who were going to be auditing financial data stored in a database, they didn't tend to see
thatas nearly as much of a plus as some programmers do. 

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Column order in multi column primary key

From
Kevin Grittner
Date:
On Mon, Aug 8, 2016 at 5:59 PM, Craig Boucher <craig@wesvic.com>
wrote:

> Thanks Kevin for your response.  I've Googled and debated natural
> vs surrogate keys and I just find surrogate keys easier to work
> with (maybe I'm just being lazy).  It just seems that a
> description or name is most often the natural key.  I just can't
> see, In my case, using a department description as part of the
> primary key in the department table and having it repeated in
> millions of rows.

I agree that would not make sense, but most organizations I've
worked with already have user-visible mnemonic or numeric codes for
such things.  Those make great keys, or columns within multi-column
keys.

Anyway, the synthetic key discussion was peripheral to my main
point, which was that you were looking at moving from something
that looked like 2nd or 3rd normal form down to 1st normal form,
which will open you up to whole new classes of data integrity
problems.  I strongly recommend you don't do that.  It looked like
the reason was to try to introduce more meaning into the key, which
is why I ventured into the synthetic key discussion in spite of it
being such an unfortunate trigger for rehashing old flame-wars.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company