On 8/01/2010 2:11 AM, Nikolas Everett wrote:
> This table is totally unnormalized. Normalize it and try again. You'll
> probably see a huge speedup. Maybe even 10x. My mantra has always been
> less data stored means less data to scan means faster scans.
Sometimes one intentionally denormalizes storage, though. JOIN costs can
be considerable too, and if most of the time you're interested in all
the data for a record not just a subset of it, storing it denormalized
is often faster and cheaper than JOINing for it or using subqueries to
fetch it.
Normalization or any other splitting of record into multiple separately
stored records also has costs in complexity, management, the need for
additional indexes, storage of foreign key references, all the extra
tuple headers you need to store, etc.
It's still generally the right thing to do, but it should be thought
about, not just tackled blindly. I only tend to view it as a no-brainer
if the alternative is storing numbered fields ("field0", "field1",
"field2", etc) ... and even then there are exceptions. One of my schema
at the moment has address_line_1 through address_line_4 in a `contact'
entity, and there's absolutely *no* way I'm splitting that into a
separate table of address_lines accessed by join and sort! (Arguably I
should be using a single `text' field with embedded newlines instead,
though).
Sometimes it's even better to hold your nose and embed an array in a
record rather than join to an external table. Purism can be taken too far.
Note that Pg's TOAST mechanism plays a part here, too. If you have a big
`text' field, it's probably going to get stored out-of-line (TOASTed)
anyway, and TOAST is going to be cleverer about fetching it than you
will be using a JOIN. So storing it in-line is likely to be the right
way to go. You can even force out-of-line storage if you're worried.
In the case of this benchmark, even if they split much of this data out
into other tables by reference, it's likely to be slower rather than
faster if they still want the data they've split out for most of their
queries.
--
Craig Ringer