Re: DB design advice: lots of small tables? - Mailing list pgsql-general
From | Kevin Grittner |
---|---|
Subject | Re: DB design advice: lots of small tables? |
Date | |
Msg-id | 1363646970.31085.YahooMailNeo@web162906.mail.bf1.yahoo.com Whole thread Raw |
In response to | Re: DB design advice: lots of small tables? (Shaun Thomas <sthomas@optionshouse.com>) |
List | pgsql-general |
Shaun Thomas <sthomas@optionshouse.com> wrote: > On 03/15/2013 08:36 AM, Kevin Grittner wrote: > >> I occasionally hear someone maintaining that having a meaningless >> sequential ID column as the primary key of each table is required >> by the relational model. > > You know, I've heard you mention this a couple times, and I still don't > understand why you maintain such a position. Artificial the linkages may be, but > unless I'm missing something, the level of abstraction is often necessary to > avoid excessive storage requirements and bulky foreign keys. > > For me, it comes down to two things: > > 1. Copy a "natural" arbitrary-length text string possibly millions of > times, or use a 32/64-bit integer lookup value. > 2. Use multiple natural columns in a primary key, necessitating copying all > columns to child tables for foreign keys, or use one surrogate key for both. When I was working for the Wisconsin Courts the Circuit Court databases had some tables which were maintained by a central Court Operations group and some tables holding data generated by each county. The primary key of each county maintained table included a county number (smallint). Most county-maintained tables were related to court cases, and included a court case number as part of the primary key (varchar(14)). Beyond that, other fields were added to each table, and where there was a natural heirarchy to the data the "child" normally had the PK of its parent and something to make it unique within that set. Often this was a unique number that started at 1 for each parent. Some of these tables, when replicated to the central state-wide site, had hundreds of millions of rows. You would probably be surprised how many queries were able to optimize much better than they could have with a single-column ID in each table. The plans often surprised me, and when I would force the planner to take the plan which seemed obvious to me, 90% of the time the planner had me beat. The other times often suggested new indexes or adjustments to cost factors which benefited the whole workload. > Yes, it complicates the schema. That's not the point. > But until someone comes up with a database storage method that automatically deduplicates stored data, I > can never advocate using arbitrary strings as natural keys, no matter how unique > and meaningful they are. If micro-managing disk space usage it the primary concern, single-column synthetic keys are likely to win in many circumstances. I have no doubt that there are cases where it can benefit performance. I can guarantee you that many of the queries we ran at Wisconsin Courts would have been much slower with such synthetic keys because they limit the available plans and force some unnatural "navigation". I know because I saw cases where people forced the type of navigation the planner would need to do if synthetic keys were used, and the result was performance orders of magnitude worse. > Maybe I just say that because I was burned by just such a column in a previous > engagement. It was 64-characters of arbitrary text, and was used as a lookup > value for dozens of tables. Had it been mapped to a "meaningless" > surrogate key, several tables would have been halved (or more) in size. > Don't even get me started on indexing that horrible monstrosity in every > table it lived in. That is getting pretty extreme. > Small tables? Yeah, whatever. I don't care. But anything that has the > potential to be duplicated millions of times? You better bet that's going to > be a serial identifier. Like I said, Wis Cts. has a table that has hundreds of millions of rows and the primary key is a smallint, a varchar(14), and another smallint. You can poke around in it at this site by drilling down on the "Court Record Events" button for a case: http://wcca.wicourts.gov/ > Now, you've been doing this longer than I have, in a wider array of > contexts, so you clearly have some perspective I don't. From where I'm > sitting though, I don't get the barely suppressed rage. ;) Oh, every time a programmer who had been working in Access or MySQL was hired, I had to have this discussion all over again. You're ripping the scabs off the old wounds from those battles. :-) I usually had to pull out plans from complex (and fast!) queries to make my point. The other thing is that it brings back memories from the early '80s of working in a database product called TOTAL which required explicit navigation over links. Relational technology was such a breath of fresh air compared to that, I dread a pendulum swing back to that. That's been my experience, anyway. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-general by date: