Thread: Referential integrity (foreign keys) across multiple tables
Simplified schema: create table hosts ( id serial primary key, hostname text not null ); create table pages ( id serial primary key, hostid int not null references hosts (id), url text not null, unique(hostid, url) ); create table page_contents ( pageid int not null references pages (id), section text not null ); (There are many hosts, many pages per host, and many page_contents sections per page). Now I want to add a column to page_contents, say called link_name, which is going to reference the pages.url column for the particular host that this page belongs to. Something like: alter table page_contents add link_name text; alter table page_contents add constraint foo foreign key (p.hostid, link_name) references pages (hostid, url) where p.id = pageid; Obviously that second statement isn't going to compile. I don't want to add the hostid column to page_contents table because I have a lot of old code accessing the database which would be hard to change (the old code would no longer be able to insert page_contents rows). Is this possible somehow? Perhaps by adding a second table? Do I have to use triggers, and if so is that as robust as referential integrity? Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com
On Sat, Jul 22, 2006 at 14:32:57 +0100, Richard Jones <rich@annexia.org> wrote: > > Now I want to add a column to page_contents, say called link_name, > which is going to reference the pages.url column for the particular > host that this page belongs to. What are you trying to accomplish by this? The information is available by doing a join. If you are trying to simplify things for applications, you can probably do it with a view or rules depending on whether you want to have an updatable view. If you are denormalizing for performance and want constraints to maintain consistancy, then you probably want to push the hostid down to page_contents as well as the url. These could both be set with a trigger. (I think a rule could be used as well.)
On Sun, Jul 23, 2006 at 01:32:37PM -0500, Bruno Wolff III wrote: > On Sat, Jul 22, 2006 at 14:32:57 +0100, > Richard Jones <rich@annexia.org> wrote: > > > > Now I want to add a column to page_contents, say called link_name, > > which is going to reference the pages.url column for the particular > > host that this page belongs to. > > What are you trying to accomplish by this? Data integrity. > The information is available by doing a join. If you are trying to > simplify things for applications, you can probably do it with a view > or rules depending on whether you want to have an updatable view. If > you are denormalizing for performance and want constraints to > maintain consistancy, then you probably want to push the hostid down > to page_contents as well as the url. These could both be set with a > trigger. (I think a rule could be used as well.) So if I get this right, I should use a trigger to ensure that the old code causes the hostid field to be set in page_contents? Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com
On 7/22/06, Richard Jones <rich@annexia.org> wrote:
The easiest, and not necessarily elegant, way to acomplish this is to create linkid rather than link_name and make it a foreign key to pages.id. Then add a trigger that checks to make sure the pages you link to from page_contents to pages is for the same host. If not, raise an exception.
Another option is to do this:
create table page_contents (
hostid int not null,
url text not null,
linkhostid int,
linkurl text,
section text not null,
foreign key (hostid, url) references pages (hostid, url),
foreign key (linkhostid, linkurl) references pages (hostid, url)
);
Or if you really want to restructure things:
create table hosts (
id serial primary key,
hostname text not null
);
create table pages (
id serial primary key,
url text not null,
unique (url)
);
create table page_contents (
pageid int not null references pages (id),
hostsid int not null references hosts (id),
linkpageid int references pages(id),
section text not null
);
That should give you some options to play with.
As a side comment, you should also name your primary key columns more meaningfully. Use hosts.hostsid and pages.pagesid, not hosts.id and pages.id. When you begin writing large queries, the use of the column name id all over the place will make your queries more prone to error, harder to read and harder to write.
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
==================================================================
Simplified schema:
create table hosts (
id serial primary key,
hostname text not null
);
create table pages (
id serial primary key,
hostid int not null references hosts (id),
url text not null,
unique (hostid, url)
);
create table page_contents (
pageid int not null references pages (id),
section text not null
);
(There are many hosts, many pages per host, and many page_contents
sections per page).
Now I want to add a column to page_contents, say called link_name,
which is going to reference the pages.url column for the particular
host that this page belongs to.
Something like:
alter table page_contents add link_name text;
alter table page_contents
add constraint foo foreign key (p.hostid, link_name)
references pages (hostid, url)
where p.id = pageid;
Another option is to do this:
create table page_contents (
hostid int not null,
url text not null,
linkhostid int,
linkurl text,
section text not null,
foreign key (hostid, url) references pages (hostid, url),
foreign key (linkhostid, linkurl) references pages (hostid, url)
);
Or if you really want to restructure things:
create table hosts (
id serial primary key,
hostname text not null
);
create table pages (
id serial primary key,
url text not null,
unique (url)
);
create table page_contents (
pageid int not null references pages (id),
hostsid int not null references hosts (id),
linkpageid int references pages(id),
section text not null
);
That should give you some options to play with.
As a side comment, you should also name your primary key columns more meaningfully. Use hosts.hostsid and pages.pagesid, not hosts.id and pages.id. When you begin writing large queries, the use of the column name id all over the place will make your queries more prone to error, harder to read and harder to write.
==================================================================
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
==================================================================
On Mon, Jul 24, 2006 at 09:59:07 +0100, Richard Jones <rich@annexia.org> wrote: > On Sun, Jul 23, 2006 at 01:32:37PM -0500, Bruno Wolff III wrote: > > On Sat, Jul 22, 2006 at 14:32:57 +0100, > > Richard Jones <rich@annexia.org> wrote: > > > > > > Now I want to add a column to page_contents, say called link_name, > > > which is going to reference the pages.url column for the particular > > > host that this page belongs to. > > > > What are you trying to accomplish by this? > > Data integrity. This doesn't make sense in isolation. If that is all you are trying to do, then you don't need to do anything to the database design as the information is already there. The application just needs to do a join when querying the data. > > The information is available by doing a join. If you are trying to > > simplify things for applications, you can probably do it with a view > > or rules depending on whether you want to have an updatable view. If > > you are denormalizing for performance and want constraints to > > maintain consistancy, then you probably want to push the hostid down > > to page_contents as well as the url. These could both be set with a > > trigger. (I think a rule could be used as well.) > > So if I get this right, I should use a trigger to ensure that the old > code causes the hostid field to be set in page_contents? No unless you are trying to do something else in addition to maintaining data integrity.
On Mon, Jul 24, 2006 at 12:51:48PM -0500, Bruno Wolff III wrote: > On Mon, Jul 24, 2006 at 09:59:07 +0100, > Richard Jones <rich@annexia.org> wrote: > > On Sun, Jul 23, 2006 at 01:32:37PM -0500, Bruno Wolff III wrote: > > > On Sat, Jul 22, 2006 at 14:32:57 +0100, > > > Richard Jones <rich@annexia.org> wrote: > > > > > > > > Now I want to add a column to page_contents, say called link_name, > > > > which is going to reference the pages.url column for the particular > > > > host that this page belongs to. > > > > > > What are you trying to accomplish by this? > > > > Data integrity. > > This doesn't make sense in isolation. If that is all you are trying to do, > then you don't need to do anything to the database design as the information > is already there. The application just needs to do a join when querying the > data. I'm not sure what this means. By "data integrity" I just meant that I don't want applications to create page_contents.link_name fields which could point to non-existent URLs. (A URL consists of a particular hostid and url, since you're not allowed to have one host pointing to pages on another, and this is where the requirement for a foreign key which spans two tables comes from). Perhaps I meant "data consistency"? Anyway without some sort of check, be it a reference or a trigger -- assuming a trigger is possible -- then an application might create such a bad link. Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com
On Mon, Jul 24, 2006 at 12:26:15PM -0500, Aaron Bono wrote: > On 7/22/06, Richard Jones <rich@annexia.org> wrote: > > create table hosts ( > > id serial primary key, > > hostname text not null > > ); > > > > create table pages ( > > id serial primary key, > > hostid int not null references hosts (id), > > url text not null, > > unique (hostid, url) > > ); > > > > create table page_contents ( > > pageid int not null references pages (id), > > section text not null > > ); > > > >(There are many hosts, many pages per host, and many page_contents > >sections per page). > > > >Now I want to add a column to page_contents, say called link_name, > >which is going to reference the pages.url column for the particular > >host that this page belongs to. > > > >Something like: > > > > alter table page_contents add link_name text; > > alter table page_contents > > add constraint foo foreign key (p.hostid, link_name) > > references pages (hostid, url) > > where p.id = pageid; > > The easiest, and not necessarily elegant, way to acomplish this is to create > linkid rather than link_name and make it a foreign key to pages.id. Unfortunately this isn't possible :-( My schema above is simplified a little too far. In the real schema we keep old versions of pages around in the pages table (we move the 'url' field to a 'url_deleted' field so that the unique (hostid, url) isn't violated by the new version of the page). This means that the pageid can be updated, so link_name must store a url, not a pageid. > Then add a trigger that checks to make sure the pages you link to > from page_contents to pages is for the same host. If not, raise an > exception. I think though that this suggestion is right. I'm not sure what difference it makes if it's link_name or linkid, but it looks like I'll have to write a trigger for this. It doesn't seem like there's a way using just ordinary foreign keys. Rich. -- Richard Jones, CTO Merjis Ltd. Merjis - web marketing and technology - http://merjis.com Team Notepad - intranets and extranets for business - http://team-notepad.com
Richard Jones wrote: > Simplified schema: > > create table hosts ( > id serial primary key, > hostname text not null > ); > > create table pages ( > id serial primary key, > hostid int not null references hosts (id), > url text not null, > unique (hostid, url) > ); > > create table page_contents ( > pageid int not null references pages (id), > section text not null > ); > > (There are many hosts, many pages per host, and many page_contents > sections per page). > > Now I want to add a column to page_contents, say called link_name, > which is going to reference the pages.url column for the particular > host that this page belongs to. Ah! It's only from reading later messages that I realise you're trying to ensure that a link in page_contents can only reference pages on the same host as itself. > Something like: > > alter table page_contents add link_name text; > alter table page_contents > add constraint foo foreign key (p.hostid, link_name) > references pages (hostid, url) > where p.id = pageid; > > Obviously that second statement isn't going to compile. > > I don't want to add the hostid column to page_contents table because I > have a lot of old code accessing the database which would be hard to > change (the old code would no longer be able to insert page_contents > rows). I'm sure you know, but for the benefit of the list the problem here is that the surrogate primary-key on "pages" has concealed valuable information (i.e. the hostid) from the "page_contents" table. Unfortunate that you can't update the application [* see rant below] > Is this possible somehow? Perhaps by adding a second table? Do I > have to use triggers, and if so is that as robust as referential > integrity? Write triggers to enforce the host dependency. Just make a table of all the columns involved on all three tables and decide what (if anything) should happen when a value is inserted/updated/deleted. Write your triggers then write a short test script. Oh, test it with concurrent transactions too - just to make sure it's doing what you'd expect. The only reason hand-built triggers would be less reliable than built-in foreign-keys is because they'll have had less testing. The other alternative would be to add the hostid column to page_contents and use a BEFORE trigger to set it based on the pageid the application inserts. Then you can have a straightforward foreign-key. If your application can't cope with having a column added to the table you might have to mask this with a view. [*minor rant follows: Why is it that when changes to the database schema break applications because the column-names are hard-wired, it is the RDBMS that is at fault? I'm prepared to bet good money (up to the value of say, a whole pound) that the primary key is just being used anonymously in the application and that if the database interface had been able to refer to it as "column(s) referencing pages primary-key" then Richard wouldn't have a problem to post about. And it's relational databases that aren't "agile" enough for the modern world, not most of your programming languages. ] -- Richard Huxton Archonet Ltd
On Mon, Jul 24, 2006 at 18:53:20 +0100, Richard Jones <rich@annexia.org> wrote: > > I'm not sure what this means. By "data integrity" I just meant that I > don't want applications to create page_contents.link_name fields which > could point to non-existent URLs. (A URL consists of a particular > hostid and url, since you're not allowed to have one host pointing to > pages on another, and this is where the requirement for a foreign key > which spans two tables comes from). Perhaps I meant "data > consistency"? Anyway without some sort of check, be it a reference or > a trigger -- assuming a trigger is possible -- then an application > might create such a bad link. I think part of the problem might have been over simplification of the problem. In your example there was no reason to create that new column since the information was available by doing a join between two tables and this would prevent a problem with data being out of sync. I see from another message in this thread that there is really more to this example. It may be that you still don't need the new column, but I haven't examined the new example carefully to say one way or the other.