Thread: shared data for different applications
Hi, Let's say that there is some data that should be logically shared by many applications in the company, such as some core information about its customers (name, address, contact info). In principle, such data should be stored in a DB for sharing. But then if a certain application needs to access such shared data, it will need to access its own DB plus this shared DB, which is troublesome and requires distributed transactions. An alternative is to store everything into a single DB and use, say, schemas to separate the data. The shared data can be put into one schema and that application can have its own schema. This eliminates the need for distributed transactions but then you'll have a single DB for the whole company! This creates problems in: 1) availability: if the DB is down, everything is down. 2) maintenance: if we need to deploy a new application, we'll need to create a new schema in that single DB, potentially impacting other apps. 3) performance: all apps are access that single DB server. I was wondering in practice, how people address this problem? Thanks!
On 21 Nov 2010, at 5:08, Kent Tong wrote: > Hi, > > Let's say that there is some data that should be logically shared by > many applications in the company, such as some core information about > its customers (name, address, contact info). In principle, such data > should be stored in a DB for sharing. > > But then if a certain application needs to access such shared data, it > will need to access its own DB plus this shared DB, which is > troublesome and requires distributed transactions. I think most companies have ended up at that point just by the progress of time. They have several different databases (oftenfrom different vendors even) that they need to aggregate their information from. Usually the different databases contain different kinds of contents, so the need for distributed transactions and such isquite minimal. Where there is a need to keep relational integrity, the related data is usually in one big central database.You'd be amazed how much a database like that can handle! That said, separating things out would be an improvement. For example, for generating reports (big aggregations of many datasources at once), you tend to generate heavy-weight queries that are likely to impact other queries (unless your databaseserver is capable of limiting I/O and such per process, that helps some). > An alternative is to store everything into a single DB and use, say, > schemas to separate the data. The shared data can be put into one > schema and that application can have its own schema. This eliminates > the need for distributed transactions but then you'll have a single > DB for the whole company! This creates problems in: > 1) availability: if the DB is down, everything is down. > 2) maintenance: if we need to deploy a new application, we'll need to > create a new schema in that single DB, potentially impacting other > apps. > 3) performance: all apps are access that single DB server. > > I was wondering in practice, how people address this problem? You limit access to who can touch what and you replicate. For example, the people in your sales department will need to be able to add and modify customer information, but the guysin the IT department don't need to. So the first group gets access to a database server where the customer database isa master, while the others get access to a read-only slave. I wouldn't go so far as to create a separate database for every business unit though, the distinction is more a role-basedone than an organisational one - there will be overlap between who has access to what. That said, unless you're in a very large company, a central database will probably do for almost anything. The exceptionsare more likely to be among the lines of reporting and frequent short but specialised queries for, for example,machine performance statistics. At the company where I work we have a central DB2/400 database for production, shipping and sales. That database gets replicatedfor reporting. We also have a few separate MS SQL databases where for example machines on our production facilitysend their performance statistics, which they get back in an aggregated form every 5 minutes or so. It isn't ideal, but that 20-year old DB2/400 database (although the hardware's been upgraded to i5-series or so) can handleits load just fine. I suspect that Postgres would actually perform better, but you can't just switch a big 24/7 company from one database toanother (provided I had anything to say about it at all, which I don't). That's an expensive, time-consuming and riskyprocess. Just saying, I don't know from experience how well Postgres would fare there, as it's not what we're using. I have no reasonto suspect it to perform less well though. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling. !DSPAM:737,4ce923f910421136214443!
Hi Alban, Thanks a lot for your useful info! > I think most companies have ended up at that point just by the progress > of time. They have several different databases (often from different > vendors even) that they need to aggregate their information from. So, is this the result of lack of central coordination or a carefully thought-out decision? > Usually the different databases contain different kinds of contents, so > the need for distributed transactions and such is quite minimal. Let's say, if a customer would like to change his address through a certain app (eg, the web interface for customers), and assuming that customer info is shared across the whole company, then a distributed transaction will be required, right? Or there is a better approach? > That said, separating things out would be an improvement. For example, > for generating reports (big aggregations of many data sources at once), > you tend to generate heavy-weight queries that are likely to impact > other queries (unless your database server is capable of limiting I/O > and such per process, that helps some). I see. For reporting, I agree that using a replicated copy of the database is a good way to do it. > You limit access to who can touch what and you replicate. Fully agree. I am just concerned that maintenance work for a schema may have foreseen or un foreseen impacts on the other schemas as they're in the same DB on the same server. > I wouldn't go so far as to create a separate database for every business > unit though, the distinction is more a role-based one than an > organisational one - there will be overlap between who has access to > what. Fully agree. Therefore I've seen people suggesting doing it on a process and data flow basis. > That said, unless you're in a very large company, a central database will > probably do for almost anything. The exceptions are more likely to be > among the lines of reporting and frequent short but specialised queries > for, for example, machine performance statistics. Thanks for the clear conclusion. Wondering how common is it in practice (I am not really familiar with the data management status quo)? > At the company where I work we have a central DB2/400 database for > production, shipping and sales. That database gets replicated for > reporting. We also have a few separate MS SQL databases where for > example machines on our production facility send their performance > statistics, which they get back in an aggregated form every 5 minutes > or so. > > It isn't ideal, but that 20-year old DB2/400 database (although the > hardware's been upgraded to i5-series or so) can handle its load just > fine. Thanks a lot for sharing the practical info! It is very useful!
On 22 Nov 2010, at 4:43, Kent Tong wrote: > Hi Alban, > > Thanks a lot for your useful info! > >> I think most companies have ended up at that point just by the progress >> of time. They have several different databases (often from different >> vendors even) that they need to aggregate their information from. > > So, is this the result of lack of central coordination or a carefully > thought-out decision? It's usually a mix of both. While a company grows, requirements change, for example. Also, there tend to be 3rd-party applicationsthat don't work with the company's database of choice, necessitating to install a second database, etc. Being in a situation where you get to decide this without the historic requirements is pretty cool, but, no matter how carefulyou plan now, history tends to catch up with you. That doesn't mean you shouldn't try your best to prevent it to,though ;) >> Usually the different databases contain different kinds of contents, so >> the need for distributed transactions and such is quite minimal. > > Let's say, if a customer would like to change his address through > a certain app (eg, the web interface for customers), and assuming > that customer info is shared across the whole company, then a > distributed transaction will be required, right? Or there is a better > approach? I don't really see why that would require a distributed transaction. They can just directly change a record in the masterdatabase, can't they? If I were you, I certainly wouldn't let them change their _live_ data directly in your production database though! You'llwant somebody (in your company) to approve what they entered - people who're not familiar with a system (it's yours,not theirs, after all) are bound to make mistakes, no matter how obvious you make the interface. There are all kinds of approaches to that though, it doesn't mean you need a separate database for their data. You couldfor example keep an approval flag if the data is from the application that the customers use to update their data. You could also keep a separate database around, but then you're quickly moving in the direction of master-master replication,which is really quite complicated due to the conflicting data it tends to generate between masters. Now, before you storm off to implement what I'm telling you; I'm not an expert on this issue. I know my databases and all,but I haven't been in your position and I've never had an opportunity (or a reason) to put replication to practice. I'veread a lot about it, mostly from this mailing list, but my knowledge in that respect is mostly theoretical. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4cea1ff710421896774915!