Thread: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Roy Anderson
Date:
We have an OLTP database and no data warehouse. We are currently
planning out a build for a data warehouse however (possibly using
Hadoop). "X" is recommending that we convert our current, normalized
OLTP database into a flattened Star Schema.

The primary rationale for this OLTP flattening is that since most of
the app calls to the database are SELECT statements, the DB should be
optimized for that. A peripheral rationale I've heard is that the
1NF-3NF database is hard to understand sometimes.

I'm not going to state what I believe. I would like to hear my
esteemed colleagues take on this situation. What would you do? What
would you say? Is there anything you would investigate? In short, what
say you?

Thanks for your time.


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Merlin Moncure
Date:
On Tue, Feb 4, 2014 at 10:06 PM, Roy Anderson <roy.anderson@gmail.com> wrote:
> We have an OLTP database and no data warehouse. We are currently
> planning out a build for a data warehouse however (possibly using
> Hadoop). "X" is recommending that we convert our current, normalized
> OLTP database into a flattened Star Schema.
>
> The primary rationale for this OLTP flattening is that since most of
> the app calls to the database are SELECT statements, the DB should be
> optimized for that. A peripheral rationale I've heard is that the
> 1NF-3NF database is hard to understand sometimes.
>
> I'm not going to state what I believe. I would like to hear my
> esteemed colleagues take on this situation. What would you do? What
> would you say? Is there anything you would investigate? In short, what
> say you?

couple quick thoughts:
*) what kinds of problems is the data warehouse going to solve the
current database is not solving?

*) Have you conclusively proven the current database can't solve those problems?

*) the phrase '1NF-3NF is poor for very general application style X'
is a typical comment from someone who read a book or an article on
databases without having a lot of practical experience in implementing
systems.  Experienced database guys rarely talk in such terms; they
slip in an out of 'normal forms' naturally in order to meet the
demands of whatever task they may be happening to solve.

*) a deficit in knowledge in scaling large data is typically made
worse, not better, by switching up the software stack, particularly if
you don't have very clear idea of what problems you are trying to
solve by switching and what problems you will run into with the new
stack

*) Hardware is on the march (SSD).  Large data scaling is often a
function of hardware, not software.  Scaling out to multiple systems
from a monolithic architecture is not a decision to be taken lightly
and the calculus of that decision must be constantly revisited in the
face of hardware improvements.  Scaling horizontally may ultimately
cost more in hardware in the end.

*) Anyone who has ever uttered the statement: "databases are slower
than technology X because databases utilize disks and technology X
stores everything in memory" has absolutely no concept of how
databases (or operating systems) work.

*) Star schemas are great for supporting certain kinds of OLAP tools
but are IMNSHO overrated.  If your reporting requirements are
extremely generic or not very well defined you may have to use that or
a similar pattern however.

*) There is no rule that says if you use databases there must only be
one.  Partitioning databases for large analytical queries is a
reasonable path if you have tons of data.  You can then have the
application or (my preference) dblink then aggregate data together.

*) Do not consider any advice to implement exotic storage backend from
someone that has not previously implemented that same technology on a
similar scale on a previous project, ever.  Data of large scale is
hard.  Installing magical tool X often besides not solving the problem
gives you another difficult problem to solve.

merlin


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Scott Marlowe
Date:
On Thu, Feb 6, 2014 at 7:51 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> *) Do not consider any advice to implement exotic storage backend from
> someone that has not previously implemented that same technology on a
> similar scale on a previous project, ever.  Data of large scale is
> hard.  Installing magical tool X often besides not solving the problem
> gives you another difficult problem to solve.

Merlin, this reminds me of the quote from Mencken: For every complex
problem there is an answer that is clear, simple, and wrong.


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Merlin Moncure
Date:
On Thu, Feb 6, 2014 at 12:44 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Feb 6, 2014 at 7:51 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> *) Do not consider any advice to implement exotic storage backend from
>> someone that has not previously implemented that same technology on a
>> similar scale on a previous project, ever.  Data of large scale is
>> hard.  Installing magical tool X often besides not solving the problem
>> gives you another difficult problem to solve.
>
> Merlin, this reminds me of the quote from Mencken: For every complex
> problem there is an answer that is clear, simple, and wrong.

well said, heh.

merlin


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Neil Tiffin
Date:
On Feb 6, 2014, at 12:44 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> Merlin, this reminds me of the quote from Mencken: For every complex
> problem there is an answer that is clear, simple, and wrong.

Or as Niklaus Wirth said.

... complexity has and will maintain a strong fascination for many people.  It is true that we live in a complex world
andstrive to solve inherently complex problems, which often do require complex mechanisms.  However, this should not
diminishour desire for elegant solutions, which convince by their clarity and effectiveness. Simple, elegant solutions
aremore effective, but they are harder to find than complex ones, and they require more time, which we to often believe
tobe unaffordable. Communications of the ACM, Feb. 1985 

Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Andy Colson
Date:
On 2/4/2014 10:06 PM, Roy Anderson wrote:
> We have an OLTP database and no data warehouse. We are currently
> planning out a build for a data warehouse however (possibly using
> Hadoop). "X" is recommending that we convert our current, normalized
> OLTP database into a flattened Star Schema.
>
> The primary rationale for this OLTP flattening is that since most of
> the app calls to the database are SELECT statements, the DB should be
> optimized for that. A peripheral rationale I've heard is that the
> 1NF-3NF database is hard to understand sometimes.
>
> I'm not going to state what I believe. I would like to hear my
> esteemed colleagues take on this situation. What would you do? What
> would you say? Is there anything you would investigate? In short, what
> say you?
>
> Thanks for your time.
>
>

I think NF is more about correctness than about speed.  People saying
they denormalized a database to improve speed also de-corrected it, and
I doubt they improved the speed.

The only experience I have with data warehousing is storing my apache
logs in PG.  I have have many mega-millions of rows.  So many that I
don't run select count(*) because it takes forever.

I have an index on hitdate, so I can pull any small number of records
based on time very quickly, but I can never select all of them.  When I
add detail rows to the table I also add to a table of hourly summaries.
  My graphs are run off the summary tables and pulling a few months
worth of data is very quick.

So I guess it would come down to how you need to access your data
warehouse.  If you need to "learn" things by hitting every detail record
for all time, then PG isn't gonna work well for you.

If you can do indexed queries and return small subsets, PG will work
great.  If you can keep ongoing summaries you'll be fine too.
Rebuilding a summary table (or creating a new one) is painful, but
possible as long as your not in a huge hurry.

-Andy



Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Kevin Grittner
Date:
Roy Anderson <roy.anderson@gmail.com> wrote:

> We have an OLTP database and no data warehouse. We are currently
> planning out a build for a data warehouse however (possibly using
> Hadoop). "X" is recommending that we convert our current,
> normalized OLTP database into a flattened Star Schema.

I'm not going to repeat good advice you've already gotten in other
answers, but I will point out that complex reporting off of
normalized data is often much faster if you have been able to use
natural keys, even if you need to go to multi-column primary keys
to do so.  One of the biggest down-sides of synthetic primary keys
(where, for example, you might have a single-column PK column
called "id" in every table) is that forces one particular route to
"navigate" the tables.  With natural keys a complex query often
finds intriguing plans to give the results you ask for using plans
you might never have thought of, and which can be orders of
magnitude faster than the plans which would be possible if the
joins are all done using synthetic keys.

Beyond that, I would say that I would never jump to some "star"
schema automatically.  There are various ways data can be
summarized for faster reporting, and a flattened star schema is
only one option, which is not always the fastest option -- or even
an improvement over 3NF.  At the risk of repeating a little, I
recommend looking at what it would take to generate the reports you
want of the current data; and only denormalize where something is
too slow, using the summarization which appears to be the most
sensible for the use case.

As an aside, I had a case where auditors wanted a particular report
off of a 3 TB OLTP database.  One programmer tried to write it
using imperative code and looping.  Based on how far it got in the
first 5 hours, it would have taken a year to complete.  Rewritten
with a couple CTEs as a single declarative query (one SELECT
statement) it ran in ten minutes.  No star schema needed -- just
some clear thinking, and making use of the power of declarative
coding and a great optimizer.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Merlin Moncure
Date:
On Tue, Mar 4, 2014 at 2:15 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Roy Anderson <roy.anderson@gmail.com> wrote:
>
>> We have an OLTP database and no data warehouse. We are currently
>> planning out a build for a data warehouse however (possibly using
>> Hadoop). "X" is recommending that we convert our current,
>> normalized OLTP database into a flattened Star Schema.
>
> I'm not going to repeat good advice you've already gotten in other
> answers, but I will point out that complex reporting off of
> normalized data is often much faster if you have been able to use
> natural keys, even if you need to go to multi-column primary keys
> to do so.  One of the biggest down-sides of synthetic primary keys
> (where, for example, you might have a single-column PK column
> called "id" in every table) is that forces one particular route to
> "navigate" the tables.  With natural keys a complex query often
> finds intriguing plans to give the results you ask for using plans
> you might never have thought of, and which can be orders of
> magnitude faster than the plans which would be possible if the
> joins are all done using synthetic keys.

If we ever happen to meet, you just bought yourself a steak dinner
with this email.  Natural key database design has to my great
displeasure become something of a lost art.  Data modeling and
performance expectations have really suffered as a consequence of that
knowledge gap.  Now, natural keys have issues also -- update
performance on the key in particular -- so you have to be nimble and
adjust the model as appropriate to the task at hand.

merlin


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Rob Sargent
Date:
On 03/04/2014 01:40 PM, Merlin Moncure wrote:
On Tue, Mar 4, 2014 at 2:15 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Roy Anderson <roy.anderson@gmail.com> wrote:

We have an OLTP database and no data warehouse. We are currently
planning out a build for a data warehouse however (possibly using
Hadoop). "X" is recommending that we convert our current,
normalized OLTP database into a flattened Star Schema.
I'm not going to repeat good advice you've already gotten in other
answers, but I will point out that complex reporting off of
normalized data is often much faster if you have been able to use
natural keys, even if you need to go to multi-column primary keys
to do so.  One of the biggest down-sides of synthetic primary keys
(where, for example, you might have a single-column PK column
called "id" in every table) is that forces one particular route to
"navigate" the tables.  With natural keys a complex query often
finds intriguing plans to give the results you ask for using plans
you might never have thought of, and which can be orders of
magnitude faster than the plans which would be possible if the
joins are all done using synthetic keys.
If we ever happen to meet, you just bought yourself a steak dinner
with this email.  Natural key database design has to my great
displeasure become something of a lost art.  Data modeling and
performance expectations have really suffered as a consequence of that
knowledge gap.  Now, natural keys have issues also -- update
performance on the key in particular -- so you have to be nimble and
adjust the model as appropriate to the task at hand.

merlin


Do you make a distinction between a key and an index?  I'm not picking up on design-by-natural-key and what that entails. Especially the notion that the natural key of a given item might be mutable. What stops it from colliding with the next item? (I have not had the pleasure of working in a domain where natural keys are obvious if they existed at all. "What's in a name", after all. )

Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Merlin Moncure
Date:
On Tue, Mar 4, 2014 at 3:08 PM, Rob Sargent <robjsargent@gmail.com> wrote:
> Do you make a distinction between a key and an index?  I'm not picking up on
> design-by-natural-key and what that entails. Especially the notion that the
> natural key of a given item might be mutable. What stops it from colliding
> with the next item? (I have not had the pleasure of working in a domain
> where natural keys are obvious if they existed at all. "What's in a name",
> after all. )

If your keys are mutable then you definitely have to take that into
consideration for key style choice...but not for your stated concern.
 Even though you can cascade (via RI) updated keys to various tables
performance can certainly suffer vs a surrogate.  This is the main
reason not to use natural keys: slow, perhaps even pathologically slow
update performance on the key.

However, collisions are a reason *to* use natural keys.  If you can'd
handle them with your proposed key then either:
a) you've misidentified the key
or
b) you'er allowing duplicate unique entries in the system and when you
should not be

Even when using surrogates, it's still a good practice to identify
what makes a record unique wherever possible and place unique
constraints on those fields.

merlin


Re: Offending My Tender Sensibilities -OR- OLTP on a Star Schema

From
Rob Sargent
Date:
On 03/06/2014 03:52 PM, Merlin Moncure wrote:
On Tue, Mar 4, 2014 at 3:08 PM, Rob Sargent <robjsargent@gmail.com> wrote:
Do you make a distinction between a key and an index?  I'm not picking up on
design-by-natural-key and what that entails. Especially the notion that the
natural key of a given item might be mutable. What stops it from colliding
with the next item? (I have not had the pleasure of working in a domain
where natural keys are obvious if they existed at all. "What's in a name",
after all. )
If your keys are mutable then you definitely have to take that into
consideration for key style choice...but not for your stated concern.Even though you can cascade (via RI) updated keys to various tables
performance can certainly suffer vs a surrogate.  This is the main
reason not to use natural keys: slow, perhaps even pathologically slow
update performance on the key.

However, collisions are a reason *to* use natural keys.  If you can'd
handle them with your proposed key then either:
a) you've misidentified the key
or
b) you'er allowing duplicate unique entries in the system and when you
should not be

Even when using surrogates, it's still a good practice to identify
what makes a record unique wherever possible and place unique
constraints on those fields.

merlin
Yes.  And in my mind therein resides the semantic difference between a primary key and a unique index.
Thanks.