Thread: shared data for different applications

shared data for different applications

From

Kent Tong

Date:

21 November 2010, 03:17:33

Hi,

Let's say that there is some data that should be logically shared by
many applications in the company, such as some core information about
its customers (name, address, contact info). In principle, such data
should be stored in a DB for sharing.

But then if a certain application needs to access such shared data, it
will need to access its own DB plus this shared DB, which is
troublesome and requires distributed transactions.

An alternative is to store everything into a single DB and use, say,
schemas to separate the data. The shared data can be put into one
schema and that application can have its own schema. This eliminates
the need for distributed transactions but then you'll have a single
DB for the whole company! This creates problems in:
1) availability: if the DB is down, everything is down.
2) maintenance: if we need to deploy a new application, we'll need to
create a new schema in that single DB, potentially impacting other
apps.
3) performance: all apps are access that single DB server.

I was wondering in practice, how people address this problem?

Thanks!

Re: shared data for different applications

From

Alban Hertroys

Date:

21 November 2010, 12:52:02

On 21 Nov 2010, at 5:08, Kent Tong wrote:

> Hi,
>
> Let's say that there is some data that should be logically shared by
> many applications in the company, such as some core information about
> its customers (name, address, contact info). In principle, such data
> should be stored in a DB for sharing.
>
> But then if a certain application needs to access such shared data, it
> will need to access its own DB plus this shared DB, which is
> troublesome and requires distributed transactions.

I think most companies have ended up at that point just by the progress of time. They have several different databases
(oftenfrom different vendors even) that they need to aggregate their information from. 

Usually the different databases contain different kinds of contents, so the need for distributed transactions and such
isquite minimal. Where there is a need to keep relational integrity, the related data is usually in one big central
database.You'd be amazed how much a database like that can handle! 

That said, separating things out would be an improvement. For example, for generating reports (big aggregations of many
datasources at once), you tend to generate heavy-weight queries that are likely to impact other queries (unless your
databaseserver is capable of limiting I/O and such per process, that helps some). 

> An alternative is to store everything into a single DB and use, say,
> schemas to separate the data. The shared data can be put into one
> schema and that application can have its own schema. This eliminates
> the need for distributed transactions but then you'll have a single
> DB for the whole company! This creates problems in:
> 1) availability: if the DB is down, everything is down.
> 2) maintenance: if we need to deploy a new application, we'll need to
> create a new schema in that single DB, potentially impacting other
> apps.
> 3) performance: all apps are access that single DB server.
>
> I was wondering in practice, how people address this problem?

You limit access to who can touch what and you replicate.

For example, the people in your sales department will need to be able to add and modify customer information, but the
guysin the IT department don't need to. So the first group gets access to a database server where the customer database
isa master, while the others get access to a read-only slave. 

I wouldn't go so far as to create a separate database for every business unit though, the distinction is more a
role-basedone than an organisational one - there will be overlap between who has access to what. 

That said, unless you're in a very large company, a central database will probably do for almost anything. The
exceptionsare more likely to be among the lines of reporting and frequent short but specialised queries for, for
example,machine performance statistics. 

At the company where I work we have a central DB2/400 database for production, shipping and sales. That database gets
replicatedfor reporting. We also have a few separate MS SQL databases where for example machines on our production
facilitysend their performance statistics, which they get back in an aggregated form every 5 minutes or so. 

It isn't ideal, but that 20-year old DB2/400 database (although the hardware's been upgraded to i5-series or so) can
handleits load just fine. 

I suspect that Postgres would actually perform better, but you can't just switch a big 24/7 company from one database
toanother (provided I had anything to say about it at all, which I don't). That's an expensive, time-consuming and
riskyprocess. 
Just saying, I don't know from experience how well Postgres would fare there, as it's not what we're using. I have no
reasonto suspect it to perform less well though. 

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

!DSPAM:737,4ce923f910421136214443!

Re: shared data for different applications

From

"Kent Tong"

Date:

22 November 2010, 02:44:27

Hi Alban,

Thanks a lot for your useful info!

> I think most companies have ended up at that point just by the progress
> of time. They have several different databases (often from different
> vendors even) that they need to aggregate their information from.

So, is this the result of lack of central coordination or a carefully
thought-out decision?

> Usually the different databases contain different kinds of contents, so
> the need for distributed transactions and such is quite minimal.

Let's say, if a customer would like to change his address through
a certain app (eg, the web interface for customers), and assuming
that customer info is shared across the whole company, then a
distributed transaction will be required, right? Or there is a better
approach?

> That said, separating things out would be an improvement. For example,
> for generating reports (big aggregations of many data sources at once),
> you tend to generate heavy-weight queries that are likely to impact
> other queries (unless your database server is capable of limiting I/O
> and such per process, that helps some).

I see. For reporting, I agree that using a replicated copy of the database
is a good way to do it.

> You limit access to who can touch what and you replicate.

Fully agree. I am just concerned that maintenance work for a schema may
have foreseen or un foreseen impacts on the other schemas as they're
in the same DB on the same server.

> I wouldn't go so far as to create a separate database for every business
> unit though, the distinction is more a role-based one than an
> organisational one - there will be overlap between who has access to
> what.

Fully agree. Therefore I've seen people suggesting doing it on a process
and data flow basis.

> That said, unless you're in a very large company, a central database will
> probably do for almost anything. The exceptions are more likely to be
> among the lines of reporting and frequent short but specialised queries
> for, for example, machine performance statistics.

Thanks for the clear conclusion. Wondering how common is it in practice
(I am not really familiar with the data management status quo)?

> At the company where I work we have a central DB2/400 database for
> production, shipping and sales. That database gets replicated for
> reporting. We also have a few separate MS SQL databases where for
> example machines on our production facility send their performance
> statistics, which they get back in an aggregated form every 5 minutes
> or so.
>
> It isn't ideal, but that 20-year old DB2/400 database (although the
> hardware's been upgraded to i5-series or so) can handle its load just
> fine.

Thanks a lot for sharing the practical info! It is very useful!

Re: shared data for different applications

From

Alban Hertroys

Date:

22 November 2010, 06:47:13

On 22 Nov 2010, at 4:43, Kent Tong wrote:

> Hi Alban,
>
> Thanks a lot for your useful info!
>
>> I think most companies have ended up at that point just by the progress
>> of time. They have several different databases (often from different
>> vendors even) that they need to aggregate their information from.
>
> So, is this the result of lack of central coordination or a carefully
> thought-out decision?

It's usually a mix of both. While a company grows, requirements change, for example. Also, there tend to be 3rd-party
applicationsthat don't work with the company's database of choice, necessitating to install a second database, etc. 

Being in a situation where you get to decide this without the historic requirements is pretty cool, but, no matter how
carefulyou plan now, history tends to catch up with you. That doesn't mean you shouldn't try your best to prevent it
to,though ;) 

>> Usually the different databases contain different kinds of contents, so
>> the need for distributed transactions and such is quite minimal.
>
> Let's say, if a customer would like to change his address through
> a certain app (eg, the web interface for customers), and assuming
> that customer info is shared across the whole company, then a
> distributed transaction will be required, right? Or there is a better
> approach?

I don't really see why that would require a distributed transaction. They can just directly change a record in the
masterdatabase, can't they? 

If I were you, I certainly wouldn't let them change their _live_ data directly in your production database though!
You'llwant somebody (in your company) to approve what they entered - people who're not familiar with a system (it's
yours,not theirs, after all) are bound to make mistakes, no matter how obvious you make the interface. 

There are all kinds of approaches to that though, it doesn't mean you need a separate database for their data. You
couldfor example keep an approval flag if the data is from the application that the customers use to update their data. 
You could also keep a separate database around, but then you're quickly moving in the direction of master-master
replication,which is really quite complicated due to the conflicting data it tends to generate between masters. 

Now, before you storm off to implement what I'm telling you; I'm not an expert on this issue. I know my databases and
all,but I haven't been in your position and I've never had an opportunity (or a reason) to put replication to practice.
I'veread a lot about it, mostly from this mailing list, but my knowledge in that respect is mostly theoretical. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4cea1ff710421896774915!