Thread: Column order in multi column primary key
PG 9.5
I’m in the process of converting our application from Sql Server to Postgresql. I’m taking advantage of this process to make some database design changes.
Our database contains data for many customers and I have a customer_id column in every customer related table. One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk. I also changed all of the foreign keys to include both the id number column and the customer_id. I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer. The database handles this check for me now through foreign key constraints. My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around. Will the pk index perform better one way or the other or will it no matter?
The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.
Thanks,
Craig
PG 9.5
I’m in the process of converting our application from Sql Server to Postgresql. I’m taking advantage of this process to make some database design changes.
Our database contains data for many customers and I have a customer_id column in every customer related table. One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk. I also changed all of the foreign keys to include both the id number column and the customer_id. I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer. The database handles this check for me now through foreign key constraints. My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around. Will the pk index perform better one way or the other or will it no matter?
The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.
Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.
CREATE TABLE public.work_session
(
work_session_id integer NOT NULL DEFAULT nextval('worksession_worksessionid_seq'::regclass),
customer_id integer NOT NULL,
store_id integer NOT NULL,
department_id integer NOT NULL,
station_id integer NOT NULL,
start_date_time timestamp(6) without time zone NOT NULL,
end_date_time timestamp(6) without time zone NOT NULL,
job_id integer NOT NULL,
goal smallint NOT NULL,
employee_count smallint NOT NULL DEFAULT 1,
pieces integer NOT NULL,
end_reason_id integer,
piece_counter_session_id integer,
work_session_guid uuid,
CONSTRAINT worksession_pkey PRIMARY KEY (work_session_id, customer_id),
CONSTRAINT fk_worksession_department FOREIGN KEY (department_id, customer_id)
REFERENCES public.department (department_id, customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_worksession_endreason FOREIGN KEY (end_reason_id, customer_id)
REFERENCES public.end_reason (end_reason_id, customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_worksession_job FOREIGN KEY (job_id, customer_id)
REFERENCES public.job (job_id, customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_worksession_station FOREIGN KEY (station_id, customer_id)
REFERENCES public.station (station_id, customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_worksession_store FOREIGN KEY (store_id, customer_id)
REFERENCES public.store (store_id, customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT chk_worksession_employeecount CHECK (employee_count > 0)
)
(indexes on each fk set of columns removed for brevity)
Here is the first related table.
CREATE TABLE public.department
(
department_id integer NOT NULL DEFAULT nextval('department_departmentid_seq'::regclass),
customer_id integer NOT NULL,
department_name citext NOT NULL,
track_redos boolean NOT NULL,
reset_redos boolean NOT NULL,
CONSTRAINT department_pkey PRIMARY KEY (department_id, customer_id),
CONSTRAINT fk_department_customer FOREIGN KEY (customer_id)
REFERENCES public.customer (customer_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT unq_department_customerid_departmentname UNIQUE (customer_id, department_name)
)
Thanks,
Craig
From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: Monday, August 8, 2016 11:33 AM
To: Craig Boucher <craig@wesvic.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key
On Mon, Aug 8, 2016 at 1:47 PM, Craig Boucher <craig@wesvic.com> wrote:
PG 9.5
I’m in the process of converting our application from Sql Server to Postgresql. I’m taking advantage of this process to make some database design changes.
Our database contains data for many customers and I have a customer_id column in every customer related table. One of the changes I made was to change all of the tables that use an auto incrementing number as the primary key to also include a customer_id in the pk. I also changed all of the foreign keys to include both the id number column and the customer_id. I made this change so when inserting a child record, I don’t have to look up each parent record to verify that it is owned by that customer. The database handles this check for me now through foreign key constraints. My question about multi column primary keys is should I have the customer_id column first (which will have many repeated rows) and then the auto incrementing id field (which will most likely be unique in the table), or should it be there other way around. Will the pk index perform better one way or the other or will it no matter?
The number of customers are in the hundreds and the number of child records in some of the tables can be in the millions.
The PK should be (child, parent) - selectivity is the most important aspect of the PK.
You will also want an index on just (parent) to support its FK nature.
But, your description seems lacking...I'd suggest you show an example set of tables with names, PKs and FKs
David J.
Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.
CREATE TABLE public.work_session
(
work_session_id integer NOT NULL DEFAULT nextval('worksession_
worksessionid_seq'::regclass), customer_id integer NOT NULL,
store_id integer NOT NULL,
department_id integer NOT NULL,
station_id integer NOT NULL,
CREATE TABLE public.department
(
department_id integer NOT NULL DEFAULT nextval('department_
departmentid_seq'::regclass), customer_id integer NOT NULL,
Thanks David. I’ve thought about the hierarchy you mentioned but the hierarchy can change and I need to capture it as it was when the data was generated.
I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session would be better if my primary key was (customer_id, work_session_id) or if (work_session_id, customer_id) will be fine. Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table.
Thanks,
Craig
From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: Monday, August 8, 2016 12:43 PM
To: Craig Boucher <craig@wesvic.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Column order in multi column primary key
On Mon, Aug 8, 2016 at 3:06 PM, Craig Boucher <craig@wesvic.com> wrote:
Here is one of the tables that can have millions of rows and foreign key constraints to 5 other tables.
CREATE TABLE public.work_session
(
work_session_id integer NOT NULL DEFAULT nextval('worksession_worksessionid_seq'::regclass),
customer_id integer NOT NULL,
store_id integer NOT NULL,
department_id integer NOT NULL,
station_id integer NOT NULL,
CREATE TABLE public.department
(
department_id integer NOT NULL DEFAULT nextval('department_departmentid_seq'::regclass),
customer_id integer NOT NULL,
No low-hanging fruit that I can see.
This is just some thought food - much
would depend on the usage patterns of the table and the exact relationship between the different attributes. Specifically, you've modeled this as an intersection of 4 orthogonal aspects (customer, store, department, station) while in reality these items likely exist in a strict hierarchy. It would
be
possible to have the only FK on work_session be (station_id, customer_id) - and have station have the FK (department_id, customer_id) - and so forth.
David J.
Thanks David. I’ve thought about the hierarchy you mentioned but the hierarchy can change and I need to capture it as it was when the data was generated.
I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session would be better if my primary key was (customer_id, work_session_id) or if (work_session_id, customer_id) will be fine. Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table.
"Craig Boucher" <craig@wesvic.com> writes: > I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session wouldbe better if my primary key was (customer_id, work_session_id) or if (work_session_id, customer_id) will be fine. Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table. You almost certainly want the more-unique column first, so far as the performance of the index itself goes. See https://www.postgresql.org/docs/9.5/static/indexes-multicolumn.html Having said that, I'm pretty skeptical of the notion of redefining what your PK is on performance grounds. With this definition, you'd allow two entries with the same work_session_id, if they chanced to have different customer_ids. Is that really OK? regards, tom lane
Thanks Tom for the link. It could actually be beneficial if we need to migrate a customer from one database to another because wouldn't have to worry about pk constraint violations. Craig -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, August 8, 2016 1:47 PM To: Craig Boucher <craig@wesvic.com> Cc: 'David G. Johnston' <david.g.johnston@gmail.com>; pgsql-general@postgresql.org Subject: Re: [GENERAL] Column order in multi column primary key "Craig Boucher" <craig@wesvic.com> writes: > I should have pointed out in my last response that I was wondering if the performance of the pk index on work_session would be better if my primary key was (customer_id, work_session_id) or if (work_session_id, customer_id) will be fine. Customer_id will be repeated quite a bit in the table but work_session_id should be unique across the whole table. You almost certainly want the more-unique column first, so far as the performance of the index itself goes. See https://www.postgresql.org/docs/9.5/static/indexes-multicolumn.html Having said that, I'm pretty skeptical of the notion of redefining what your PK is on performance grounds. With this definition, you'd allow two entries with the same work_session_id, if they chanced to have different customer_ids. Is that really OK? regards, tom lane
On Mon, Aug 8, 2016 at 4:01 PM, Craig Boucher <craig@wesvic.com> wrote: > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> I'm pretty skeptical of the notion of redefining what your PK >> is on performance grounds. With this definition, you'd allow >> two entries with the same work_session_id, if they chanced to >> have different customer_ids. Is that really OK? > It could actually be beneficial if we need to migrate a customer > from one database to another because wouldn't have to worry about > pk constraint violations. Isn't "wouldn't have to worry about pk constraint violations" another way of saying "we're OK with allowing the same logical row to be present multiple times in the table with inconsistent data among the various copies"? You may want to read up on data normalization and shoot for 3rd normal form and its benefits. The short version you can use as a memory device when mentally testing whether a key is write is that every non-key column should be functionally dependent on the key, the whole key, and nothing but the key ("so help you Codd"). That means that: (1) given the key values you can determine the values of all other columns (1st normal form), (2) if you omit any of the key columns from the key you cannot determine the values of all other columns (2nd normal form), and (3) no column's value may be determined from non-key columns, and therefore only determined from the key indirectly (3rd normal form). You are talking about degrading your normalization to 1st normal form. There is a wealth of literature on the problems that can introduce. What would be in line with well-established theory and practice is looking for a "natural key" in each table -- some combination of columns which naturally occur in the table which uniquely identify the rows. In some cases it is necessary to add some "tie-breaker" column to the end of the key when you do this -- like a sequence within the group or a timestamp. If you search the archives you will find periodic discussions of the relative merits of this approach versus adding a meaningless synthetic key (often called "id" in every table) to use by itself as the primary key. This is essentially the same as adding "pointers" among the various records and constraining how "navigation" among tables can happen. It has a big performance downside in generating statistics and large reports because it requires a lot of pointer chasing. Proponents will point out how convenient it is to be able to change human-visible identifying values, potentially on a very large scale, by modifying one column of one row. That is, of course, a double-edged sword -- in discussing design alternatives with the CPAs who were going to be auditing financial data stored in a database, they didn't tend to see that as nearly as much of a plus as some programmers do. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks Kevin for your response. I've Googled and debated natural vs surrogate keys and I just find surrogate keys easierto work with (maybe I'm just being lazy). It just seems that a description or name is most often the natural key. I just can't see, In my case, using a department description as part of the primary key in the department table andhaving it repeated in millions of rows. Though I always look for ways to use natural keys where they work well. Thanks, Craig -----Original Message----- From: Kevin Grittner [mailto:kgrittn@gmail.com] Sent: Monday, August 8, 2016 2:44 PM To: Craig Boucher <craig@wesvic.com> Cc: Tom Lane <tgl@sss.pgh.pa.us>; David G. Johnston <david.g.johnston@gmail.com>; pgsql-general@postgresql.org Subject: Re: [GENERAL] Column order in multi column primary key On Mon, Aug 8, 2016 at 4:01 PM, Craig Boucher <craig@wesvic.com> wrote: > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> I'm pretty skeptical of the notion of redefining what your PK is on >> performance grounds. With this definition, you'd allow two entries >> with the same work_session_id, if they chanced to have different >> customer_ids. Is that really OK? > It could actually be beneficial if we need to migrate a customer from > one database to another because wouldn't have to worry about pk > constraint violations. Isn't "wouldn't have to worry about pk constraint violations" another way of saying "we're OK with allowing the same logical row to be present multiple times in the table with inconsistentdata among the various copies"? You may want to read up on data normalization and shoot for 3rd normal formand its benefits. The short version you can use as a memory device when mentally testing whether a key is write is that every non-key columnshould be functionally dependent on the key, the whole key, and nothing but the key ("so help you Codd"). That meansthat: (1) given the key values you can determine the values of all other columns (1st normal form), (2) if you omit any of the key columns from the key you cannot determine the values of all other columns (2nd normal form),and (3) no column's value may be determined from non-key columns, and therefore only determined from the key indirectly (3rdnormal form). You are talking about degrading your normalization to 1st normal form. There is a wealth of literature on the problems thatcan introduce. What would be in line with well-established theory and practice is looking for a "natural key" in eachtable -- some combination of columns which naturally occur in the table which uniquely identify the rows. In some casesit is necessary to add some "tie-breaker" column to the end of the key when you do this -- like a sequence within thegroup or a timestamp. If you search the archives you will find periodic discussions of the relative merits of this approach versus adding a meaninglesssynthetic key (often called "id" in every table) to use by itself as the primary key. This is essentially thesame as adding "pointers" among the various records and constraining how "navigation" among tables can happen. It hasa big performance downside in generating statistics and large reports because it requires a lot of pointer chasing. Proponentswill point out how convenient it is to be able to change human-visible identifying values, potentially on a verylarge scale, by modifying one column of one row. That is, of course, a double-edged sword -- in discussing design alternativeswith the CPAs who were going to be auditing financial data stored in a database, they didn't tend to see thatas nearly as much of a plus as some programmers do. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 8, 2016 at 5:59 PM, Craig Boucher <craig@wesvic.com> wrote: > Thanks Kevin for your response. I've Googled and debated natural > vs surrogate keys and I just find surrogate keys easier to work > with (maybe I'm just being lazy). It just seems that a > description or name is most often the natural key. I just can't > see, In my case, using a department description as part of the > primary key in the department table and having it repeated in > millions of rows. I agree that would not make sense, but most organizations I've worked with already have user-visible mnemonic or numeric codes for such things. Those make great keys, or columns within multi-column keys. Anyway, the synthetic key discussion was peripheral to my main point, which was that you were looking at moving from something that looked like 2nd or 3rd normal form down to 1st normal form, which will open you up to whole new classes of data integrity problems. I strongly recommend you don't do that. It looked like the reason was to try to introduce more meaning into the key, which is why I ventured into the synthetic key discussion in spite of it being such an unfortunate trigger for rehashing old flame-wars. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company