Thread: Database migration and redesign

Database migration and redesign

From

"Brandon Aiken"

Date:

11 September 2006, 15:28:18

I’ve been tasked with the unenviable job or migrating a MySQL 4.0 database to something more usable (namely, PostgreSQL 8). MySQL 4.0 doesn’t even support basic things like subqueries, and in order to emulate the effects that RULEs, TRIGGERs and VIEWs bring, they had been using PHP scripts to replicate and build tables across the database (not from one server to another – within the DB itself). The database was built across 5 separate schemata simply to organize the 50 odd tables, and all the tables are using the MyISAM engine which means no transactions, no row-level locking, and no foreign key constraints.

Yeah. It’s ugly. You should see the front-end.

My question relates to primary keys. The vast majority of tables have a primary key on a single char or varchar field. Is it considered better practice to create a serial type id key to use as the primary key for the table, and then create a unique index on the char and varchar fields? Should foreign keys reference the new primary id or the old unique key? What about compound [primary] keys?

Also, any suggestions for good DB design books would be appreciated. I no longer have any good DB design books, and I don’t know what’s good anymore.

Brandon Aiken

CS/IT Systems Engineer

Confidentiality Notice

This email, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this email is not the intended recipient or his/her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this email is prohibited. If you have received this email in error, please notify the sender by replying to this message and deleting this email immediately.

Re: Database migration and redesign

From

"Marco Bizzarri"

Date:

11 September 2006, 15:34:31

Since you're in the process of modifying a database, you could find this useful:

Agile Databases Techniques, by Scott Ambler.


Regards
Marco


On 9/11/06, Brandon Aiken <BAiken@winemantech.com> wrote:
>
>
>
>
> I've been tasked with the unenviable job or migrating a MySQL 4.0 database
> to something more usable (namely, PostgreSQL 8).  MySQL 4.0 doesn't even
> support basic things like subqueries, and in order to emulate the effects
> that RULEs, TRIGGERs and VIEWs bring, they had been using PHP scripts to
> replicate and build tables across the database (not from one server to
> another – within the DB itself).  The database was built across 5 separate
> schemata simply to organize the 50 odd tables, and all the tables are using
> the MyISAM engine which means no transactions, no row-level locking, and no
> foreign key constraints.
>
>
>
> Yeah.  It's ugly.  You should see the front-end.
>
>
>
> My question relates to primary keys.  The vast majority of tables have a
> primary key on a single char or varchar field.  Is it considered better
> practice to create a serial type id key to use as the primary key for the
> table, and then create a unique index on the char and varchar fields?
> Should foreign keys reference the new primary id or the old unique key?
> What about compound [primary] keys?
>
>
>
> Also, any suggestions for good DB design books would be appreciated.  I no
> longer have any good DB design books, and I don't know what's good anymore.
>
>
>
> --
>
>
> Brandon Aiken
>
>
> CS/IT Systems Engineer
>
>
>
>
>
> Confidentiality Notice
>
>
> This email, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity to
> which it is addressed.  If the reader of this email is not the intended
> recipient or his/her authorized agent, the reader is hereby notified that
> any dissemination, distribution or copying of this email is prohibited.  If
> you have received this email in error, please notify the sender by replying
> to this message and deleting this email immediately.
>
>


--
Marco Bizzarri
http://notenotturne.blogspot.com/

Re: Database migration and redesign

From

"Merlin Moncure"

Date:

11 September 2006, 17:08:23

On 9/11/06, Brandon Aiken <BAiken@winemantech.com> wrote:
> My question relates to primary keys.  The vast majority of tables have a
> primary key on a single char or varchar field.  Is it considered better
> practice to create a serial type id key to use as the primary key for the
> table, and then create a unique index on the char and varchar fields?
> Should foreign keys reference the new primary id or the old unique key?
> What about compound [primary] keys?

This is a somewhat controversial topic.  I personally am not a big fan
of Ambler or his approach to database design.  This is more or less
the surrogate/natural key debate which is a perennial flamewar in the
database world.

I tend to think natural keys are generally superior but surrogates are
useful in some situations.  Automatic use of ID column in every table
leads to lazy thinking and overcomplicated designs.  However at times
they can really optimize your database.  So I will lay down some
extremes and leave it up to you to find the middle ground.

A trivial case of when not to use a serial key is like this:
create table sex(id serial, sex char(1));
insert into sex(sex) select 'M' union all select 'F';

The reason is obvious, adding a join for no reason whenever you need
to know the sex, albeit a simple one, and expanding the 1 character
type to an int type.

A somewhat more interesting case is:
create table email(id serial, email text);

In this case, while it may seem like a waste to store the full email
in every table that references the email, you are optimizing the join
out in such cases, which can be a big win and since there is no other
properties of the email the email table only serves the purpose of
maintaining relational integrity, iow no duplicates.  In the event the
email changes, we allow RI to cleanup the other tables...an integer
proxy would be (at least to me) an meaningless abstraction of the
email.

A case of when not to use a natural key for relating is a bit more
complex, some times you  just get sick and tired of writing the key
fields over and over, or you have measured and determined the natural
key to be wasetful in terms of index performance.  Another reason is
if the p-key data changes frequently and the RI mechism is too
expensive.  In this case I would advise you to strictly key on the
natural and make a candidate, serial key.

create table foo
(
  nat_key1 text,
  nat_key2 int,
  foo_id serial,
  [...]
  primary key(nat_key1, nat_key2),
  unique(foo_id)  -- candidate unique key
);

This is my middle ground: there highly situational cases where an id
column is a win on pracital reasons.  however, a strong design around
natural keys tends to make you think the problem through much more
carefully and lead to a tighter database.

> Also, any suggestions for good DB design books would be appreciated.  I no
> longer have any good DB design books, and I don't know what's good anymore.

http://www.amazon.com/exec/obidos/ASIN/0201485559/databasede095-20?creative=327641&camp=14573&adid=07TEH0J3FS9SYN309QMS&link_code=as1

merlin

Re: Database migration and redesign

From

"Brandon Aiken"

Date:

11 September 2006, 18:55:58

Excellent, I managed to find one of the major sticking points all by
myself!  This is exactly what I was looking for.  Thanks!

There seems to be a lot of that in the DB world.  Practical vs
theoretical.  Or pragmatic vs strict.  It seems to be whether you came
from a math background -- in which case you're concerned with logical
data sets in the most effective theory possible -- or from a software
engineering background -- in which case you're concerned about the
usability and performance of the database software, particularly in how
other applications reference and access the database.

There are advantages to both schools.  A highly theoretical database can
also be highly normalized and therefore have very agile data models
built on top of them.  However, the practical limitation of computerized
relational databases means that they can't implement all features of the
relational database model very well or very efficiently.  Additionally,
while the logical models built on top of the theoretical DB structure
can be very agile, designing those models is very complex due to the
complex nature of the highly normalized relational structure involved.
If you have to do a multiple nested join of 10 tables just to run what
will be a query of the most basic object in the final app, you've
probably normalized way too far.

I think I can draw some lines of distinction now, though.  There are
some places where the index is just a means to relate the otherwise
unrelated tables (if that makes sense), and sometimes where using
natural keys will save several table joins on many queries.  This was
what I was seeing, and I was wondering if it was acceptable to mix it up
and do both where it seems to make the most sense.

My next task is a bit more difficult simply to define what the
relationship needs to be.  The app is business management software that
manages, among other things, Jobs and Quotes.  The relationship between
Jobs and Quotes is a little odd.  Quotes are sent to customers, and if
those customers accept the Quote then a Job is created and the Quote is
assigned to it.  So Quotes exist without Jobs.  Additionally, customers
will often request additional services after the original Quote and
these services get Quoted in the same way, so multiple Quotes can be
assigned to a single Job.

It gets worse.  We use Internal Jobs for cost tracking, and these Jobs
do not have Quotes at all.   Also, there are some Jobs that get 'verbal'
quotes (that is, quotes given outside the normal quoting system).  So
Jobs exist without Quotes.

So, one Job can be associated with many Quotes, so Quote is the child of
the relationship even though it gets created first.  There can be 0 or 1
parents, and 0, 1, or many children in any combination.  I can simply
define foreign keys normally and use NULL values where no relationship
exists, but isn't there a better way to do this?  Allowing NULLs is one
of the problems with many of these tables.

Should I create another table with two fields, one for the Quote number
and one for the Job number, and then have those two fields comprise a
compound primary key (and Quote having a unique constraint for itself as
well)?  That way I don't think I can get logically orphaned children
since both tables are parents to this third table.  Is that how it's
supposed to be done?

There are several places across the DB where this style relationship
occurs, and I'd like to try to conform to best practices (especially
since the last guy managed to miss just about every single one).

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com]
Sent: Monday, September 11, 2006 1:08 PM
To: Brandon Aiken
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database migration and redesign

On 9/11/06, Brandon Aiken <BAiken@winemantech.com> wrote:
> My question relates to primary keys.  The vast majority of tables have
a
> primary key on a single char or varchar field.  Is it considered
better
> practice to create a serial type id key to use as the primary key for
the
> table, and then create a unique index on the char and varchar fields?
> Should foreign keys reference the new primary id or the old unique
key?
> What about compound [primary] keys?

This is a somewhat controversial topic.  I personally am not a big fan
of Ambler or his approach to database design.  This is more or less
the surrogate/natural key debate which is a perennial flamewar in the
database world.

I tend to think natural keys are generally superior but surrogates are
useful in some situations.  Automatic use of ID column in every table
leads to lazy thinking and overcomplicated designs.  However at times
they can really optimize your database.  So I will lay down some
extremes and leave it up to you to find the middle ground.

A trivial case of when not to use a serial key is like this:
create table sex(id serial, sex char(1));
insert into sex(sex) select 'M' union all select 'F';

The reason is obvious, adding a join for no reason whenever you need
to know the sex, albeit a simple one, and expanding the 1 character
type to an int type.

A somewhat more interesting case is:
create table email(id serial, email text);

In this case, while it may seem like a waste to store the full email
in every table that references the email, you are optimizing the join
out in such cases, which can be a big win and since there is no other
properties of the email the email table only serves the purpose of
maintaining relational integrity, iow no duplicates.  In the event the
email changes, we allow RI to cleanup the other tables...an integer
proxy would be (at least to me) an meaningless abstraction of the
email.

A case of when not to use a natural key for relating is a bit more
complex, some times you  just get sick and tired of writing the key
fields over and over, or you have measured and determined the natural
key to be wasetful in terms of index performance.  Another reason is
if the p-key data changes frequently and the RI mechism is too
expensive.  In this case I would advise you to strictly key on the
natural and make a candidate, serial key.

create table foo
(
  nat_key1 text,
  nat_key2 int,
  foo_id serial,
  [...]
  primary key(nat_key1, nat_key2),
  unique(foo_id)  -- candidate unique key
);

This is my middle ground: there highly situational cases where an id
column is a win on pracital reasons.  however, a strong design around
natural keys tends to make you think the problem through much more
carefully and lead to a tighter database.

> Also, any suggestions for good DB design books would be appreciated.
I no
> longer have any good DB design books, and I don't know what's good
anymore.

http://www.amazon.com/exec/obidos/ASIN/0201485559/databasede095-20?creat
ive=327641&camp=14573&adid=07TEH0J3FS9SYN309QMS&link_code=as1

merlin