Re: Inheritance vs. LIKE - need advice - Mailing list pgsql-general

From Jeff Davis
Subject Re: Inheritance vs. LIKE - need advice
Date
Msg-id 42F52867.8000308@empires.org
Whole thread Raw
In response to Inheritance vs. LIKE - need advice  (William Bug <wb27@drexel.edu>)
List pgsql-general
William Bug wrote:
> Hi All,
>
> Sorry to bring up the topic of PostgreSQL inheritance again, but  after
> going through the archives and Google results, I still don't  have a
> clear sense of whether my plan to implement a schema I'm  working on is
> the preferred way to go.
>
> First, I'd like to find out if the way I'm thinking about Inheritance
> vs. the SQL DDL CREATE TABLE modifier LIKE is correct.
>
> The simplest analogy I can think of from OO techniques is PGSQL
> Inheritance corresponds to Class Inheritance, while LIKE is more like
> an inheritable Interface (Java) or Mixin (Ruby).  Inheritance  maintains
> strict hierarchical relationships propagating the "Class"  identity down
> through to all progeny. LIKE on the other hand simply  provides a means
> to re-use a set of fields in an unlimited number  tables without having
> to redefine those fields for each table you use  them in.

As far as the "database table = application object class" analogy goes,
I think you're correct.

> This view is incomplete and far from a perfect fit to the way PGSQL
> Inheritance & LIKE work, but I think it's a helpful way of thinking  of
> these 2 related mechanisms, when trying to decide how and when to  use
> them in their current form.  As has been mentioned many times in  posts
> here, as well as in the PGSQL docs, PGSQL Inheritance is only  partial.
> Table fields are propagated as well as the group identity,  but no other
> RDBMS objects created on the parent ( INDEXES,  CONSTRAINTS & SEQUENCES
> primarily) are inherited.  As has been  endlessly stated in posts here
> and elsewhere, this is a significant  short-coming for the PGSQL
> Inheritance mechanism which those of us  desirous of using Inheritance
> would love to see fixed (I understand  it has been on the TODO list for
> many years, as this mechanism has  been in the PGSQL code base for over
> 15 years).

For a variety of reasons, it seems Inheritance is basically on
life-support in PostgreSQL, only there for backwards-compatibility.
First, some of the problems like inheriting keys and constraints are the
exact same problems with table partitioning. When PG has a
mostly-complete table partitioning system, I would imagine those
features will appear in the inheritance facilities. Second, there is
limited demand, because inheritance isn't "needed" (more on that below).

> I don't agree this makes PGSQL Inheritance unusable.  There are
> situations where I think it can still be useful, and I describe one
> below.  I'd welcome feedback on that opinion, however, as I'd hate to
> have my relative ignorance doom the data schema I'm about to fill  with
> a few million rows of data to serious problems later.

It's not unusable, but because it's not all that well defined, and not
under active development and maintenance, you might be wary of using it.

> The following is an example of using both Inheritance and LIKE in the
> context described above.

<snip>

>
> This is not the best way to model book info (for instance, books are
> only allowed to have 1 author in this schema), but it will help me to
> make my point.
>
> Books, novels and textbooks will be considered equivalent in the
> context of many queries.  At the same time, there will be other  queries
> where it will be important to consider novels & textbooks as  distinct
> entities.  The PGSQL Inheritance mechanism easily supports  both of
> these situations.
>
> The curation fields listed in the 'curation_info' table are found
> ubiquitously in tables throughout many data schema.  However, it is  not
> likely there would be a circumstance where you would want to  consider
> all tables containing these fields "curatable entities" to  be queried
> as a group.  That simply makes no sense.  In this case,  LIKE seems to
> be the best way to propagate these fields, since it  doesn't couple all
> tables containing them to the parent  'curation_info' table.
>
> As I see it, there are at least 3 major problems with adopting such a
> schema - despite the obvious efficiencies it offers (most of which  have
> been reported elsewhere):
>     1) none of the parent table ('book') CONSTRAINTS or INDEXES are
> propagated to the children.  This means if you want the children to
> have the same CONSTRAINTS - as you probably will - you need to build
> them yourself for each child table.

That's because PostgreSQL hasn't really solved the table-partitioning
problem yet, as I mentioned above.

>     2) the primary keys generated across the book, novel & textbook
> tables are completely uncouple and will definitely collide.  In other
> words, due to the fact that neither the SEQUENCE behind the
> 'book.id_pk' SERIAL field, not the PK CONSTRAINTS & INDEX that comes
> with that field will automatically propagate to the child tables.   That
> is why the SQL DDL given above has an 'id_pk' SERIAL field in  all 3
> tables.  There may be some conditions where you want those PKs  to be
> independent from one another, but those will be much less  frequent than
> the times when you will require they all derive from  the same SEQUENCE.

You answered this one below.

>     3) The fields inherited from the 'curation_info' table via the  LIKE
> modifier are in no way linked back to the table from which they
> originated, unlike a an Interface (in Java) or Mixin (in Ruby) would
> be.  If the 'mod_date' field is remove from 'curation_info' it will
> still remain in all the tables created using 'curation_info' prior to
> making that change.  Same is true if a new field is added to
> 'curation_info'.  If you want that field to be represented in all  those
> tables that had previously been created using the LIKE  'curation_info'
> modifier, you will have to re-CREATE those tables  from scratch.

An interface can't be changed after compile-time, right? So that part of
the analogy is a moot point.

And also, PostgreSQL provides the ALTER TABLE ... ADD COLUMN ..., so you
don't have to recreate the tables from scratch. Of course if you have a
lot of tables that could be a problem.

> As I see it, '1' & '3' above are significant drawbacks with no  obvious
> work-around, but they are not deal breakers.  I would still  have reason
> to want to use both Inheritance and LIKE because of the  efficiencies
> they provide.
>

What efficiencies, specifically?

> '2' above has a simple and obvious work around which I've been
> surprised I've not been able to find posted anywhere (leading me to
> believe I must be missing something).  PGSQL automatically builds a
> SEQUENCE object on SERIAL fields.  The name of the SEQUENCE is simply
> the concatenated name of the table + SERIAL field with '_seq'  appended
> to the end - e.g., for the 'book' table, it would be  'book_id_pk_seq'.
> In order to guarantee the child tables use that  same sequence, you
> simply declare them as follows instead of using  the SERIAL type:
>
> CREATE TABLE textbook (
>     id_pk                    INT8     DEFAULT nextval
> ('book_id_pk_seq')         NOT NULL,
>     subject_id_fk      INT        NOT NULL
> ) INHERITS (book);
> ALTER TABLE textbook ADD CONSTRAINT textbook_pk_c PRIMARY KEY(id_pk);
>
> Is it a pain to have to write - and maintain - this extra SQL DDL?
> Yes.  Having said that, will it provide the desired behavior?  Most
> definitely yes - again, unless I'm missing something.
>
> So this is how I plan to use INHERIT & LIKE.
>

If there was no such thing as INHERITS, you could basically replace each
"INHERITS" with a foreign key.

To insert a book, just insert into "book". To insert a textbook, insert
into "book" and "textbook".

Now, anytime you want to select all books (textbook, novel, anything),
you select from "book", and you only get the set of attributes
associated with all books (title, pub_year, etc).

When you want to select all novels, you join "book" and "novel" and you
get all the novels and all of the attributes associated with a novel
(which include all attributes associated with a book).

If you want to exclude some inherited type, use WHERE NOT EXISTS.

This model seems to mostly agree with your inheritance model. To make it
more sytactically clean, you might want to use views and rules.

The book id is globally unique, solving problem 2. Problem 1 is solved.
And problem 3 is pretty much as you left it. You could solve problem 3
by using foreign keys for that as well, but problem 3 wasn't a big
problem to begin with, because LIKE is only shorthand for defining the
table to begin with.

The main new problem this might create is more joins, which could be a
performance problem. You can largely solve that problem by adjusting the
physical layout as you see fit, and providing views which look and act
like a sane design. Possible physical layouts might be:
(1) one giant table with enough information to distinguish a novel from
another book, for instance
(2) separate tables which have copied attributes

The great part about this is that the performance problem can be solved
after the application is already running, and only after the performance
problem has appeared. You can create views and rename tables in atomic
transactions, and the application would never miss a beat.

You might be careful of the assumption at the beginning, that "table
heading = object class", or the similar assumptions, that "tuple =
application object variable" and that "table contents = collection of
application object variables". If you go too far down that road, your
relational database ends up being nothing more than an object
persistance system. You might forego some of the great benefits of a
relational database.

One benefit of a relational system is that you radically change the way
a single application (or single application version) reads and writes
data to a common central database without changing the physical layout
of the data. And conversely, you can radically change the physical
layout of the data without affecting the way any of your applications
read or write data. PostgreSQL provides the best mechanisms to do this
around, because schema changes are atomic transactional changes (perhaps
some other databases do this, I don't know).

I advise you to think of application object variables as entities that
are represented in a group of relations*, that may be virtual relations
(views) or may be physical relations (tables).

The reason for this is because there may come a time when your
application object class hierarchy changes. For instance, you may
introduce an intermediate class between book and novel. If your layout
is too dependent on your class hierarchy, you will have trouble adapting
your data (keep in mind the new applications still want to access the
old data, and old versions of the application may still want to access
the new data).

If you adopt a relational model, and you use the application object
class hierarchy as merely a mental model or goal for the way you'd like
to read and write data, you'll be in good shape. You have a set of very
well-defined relational operators from which you can create virtual
relations to solve whatever needs your application has, and you don't
need to even consider the effects on other applications (because those
other applications can continue as they were before). Every relational
operator results in another valid relation (the set of relations is
closed wrt all relational operators).

I'll ignore LIKE because it's really just a syntactic shorthand. But
INHERITS could create severe problems if a new application does not
treat a novel as a direct descendent of a book. In fact, it will create
severe problems if all of your applications' class heirarchies aren't
identical to the database layout. When you use INHERITS, you're locked
in, or rather your data is locked in.

There are many similar pitfalls any time you take the analogy of
"database table = application object class" or similar analogies too far.

You'll notice I avoided using the bare word "object". It's a little
unclear, so I tried to specify what I was really talking about, e.g.
"object class" (a type/domain), "object variable" (a variable that holds
an "object value"), or "object value" (a value in the domain defined by
an "object class").

I hope this is helpful. My word is by no means definitive, so determine
what benefits inheritance can offer you, and determine if those benefits
outweigh any potential pitfalls that I mention.

Regards,
    Jeff Davis

*: I oversimplified in that description, but basically what I was
getting at was an E/R model of data. You can search online, but it's the
basic system of one-one, one-many, and many-many relationships that can
represent entities and their relationships to eachother.

PS: I'd be interested to hear some follow-up. If you think something
doesn't belong on -general, feel free to email me directly.


pgsql-general by date:

Previous
From: Zlatko Matić
Date:
Subject: Re: tables in public
Next
From: Mike Mascari
Date:
Subject: ACM Sigmod interview with Bruce Lindsay