Home > mailing lists

DB design advice: lots of small tables? - Mailing list pgsql-general

From	lender
Subject	DB design advice: lots of small tables?
Date	March 15, 2013 14:50:20
Msg-id	51430AF4.3090404@gmail.com Whole thread Raw
Responses	Re: DB design advice: lots of small tables?
List	pgsql-general

Tree view

Hello.

We are currently redesigning a medium/large office management web
application. There are 75 tables in our existing PostgreSQL database,
but that number is artificially low, due to some unfortunate design choices.

The main culprits are two tables named "catalog" and "catalog_entries".
They contain all those data sets that the previous designer deemed too
small for a separate table, so now they are all stored together. The
values in catalog_entries are typically used to populate dropdown select
fields.

The catalog table (simplified):

  id          SERIAL     -- artificial record ID
  code        VARCHAR    -- unique text identifier
  description VARCHAR    -- (info only, otherwise unused)

The catalog_entries table (simplified):

  id          SERIAL     -- artificial record ID
  catalog_id  INTEGER    -- parent catalog ID
  code        VARCHAR    -- unique (per catalog) text identifier
  rank        INTEGER    -- used for sorting the values
  text_short  VARCHAR    -- text for display (short version)
  text_long   TEXT       -- text for display (long version)

Here are some examples of what the tables contain:

Catalog: department
Entries: it, sales, accounting, cases, ...

Catalog: printers
Entries: ma_color, pa_color, pa_black, pdf, ...

Catalog: invoice_status
Entries: open, locked, entered, booked, cancelled, ...

Catalog: coverage
Entries: national, regional, international, obsolete

Records in other tables reference the values in catalog_entries by id.
For example, the "invoices" table has a "status_id" column pointing to a
record in catalog_entries. Of course, this leads to possible integrity
issues (there is nothing to prevent an invoice record referencing the
"ma_color" value instead of "open" in its status_id field).

There are 64 "catalogs" (data sets), in addition to the 75 real tables.

Now we have finally got the go-ahead to refactor this mess. Most of the
old "catalogs" will probably end up as separate tables. Others might be
replaced with ENUMs or booleans, especially the ones with only 2-3 values.

The reason why I'm hesitating and asking for advice now, is that after
refactoring, we'll end up with ~60 new tables, all of them rather small
and with practically identical columns. (Only five catalogs have more
than 20 entries; about half have five entries or less; five catalogs
have only two entries)

So, my first main question would be: is it "normal" or desirable to have
that many tiny tables? And is it a problem that many of the tables have
the same (or a similar) column definitions?


The second point is that we have redundant unique identifiers in
catalog_entries (id and code). The code value is used by the application
whenever we need to find to one of the values. For example, for a query
like "show all open invoices", we would either -

  1) select the id from catalog_entries where catalog_id refers to the
     "invoice_status" catalog and the code is "open"
  2) use that id to filter select * from invoices

- or do the same in one query using joins. This pattern occurs hundreds
of times in the application code. From a programming viewpoint, having
all-text ids would make things a lot simpler and cleaner (i.e., keep
only the "code" column).

The "id" column was used (AFAIK) to reduce the storage size. Most of the
data tables have less than 100k records, so the overhead wouldn't be too
dramatic, but a few tables (~10) have more; one of them has 1.2m
records. These tables can also refer to the old catalog_entries table
from more than one column. Changing all these references from INT to
VARCHAR would increase the DB size, and probably make scans less
performant. I'm not sure know how indexes on these columns would be
affected.

To summarize, the second question is whether we should ditch the
artificial numeric IDs and just use the "code" column as primary key in
the new tiny tables.


Thanks in advance for your advice.
crl

pgsql-general by date:

From: Bill Moran
Date: 15 March 2013, 14:17:05
Subject: unexpected lock waits (was Re: Do not understand why this happens)

From: Kevin Grittner
Date: 15 March 2013, 16:36:44
Subject: Re: DB design advice: lots of small tables?

DB design advice: lots of small tables? - Mailing list pgsql-general

Previous

Next