Thread: Research and EAV models
I've followed this list for quite a long time, and I think that I've discovered a pattern that I would like to discuss. It seems like there are two camps considering EAV models. On the one hand, there are researchers who think that EAV is a great way to meet their objectives. On the other hand, there are the "business" guys who thnk that EAV is crap. I've seen this pattern often enough and consistently enough that I think there may be an underlying difference of objectives concerning the use of databases itself that may be responsible for this divergence. I'm a researcher type, and I've made an EAV model that suits me well in my genealogy research. How can you associate an essentially unknown number of sundry "events" to a "person" without an EAV model? It seems to me that data models made for research is a quite different animal than data models made for business. In research, we often need to register data that may be hard to pin down in exactly the right pigeon hole, but never the less need to be recorded. The most sensible way to do this, IMO, is frequently to associate the data with some already- known or postulated entity. That's where the EAV model comes in really handy. -- Leif Biberg Kristensen | Registered Linux User #338009 Me And My Database: http://solumslekt.org/blog/
"Leif B. Kristensen" <leif@solumslekt.org> writes: > It seems like there are two camps considering EAV models. On the one > hand, there are researchers who think that EAV is a great way to meet > their objectives. On the other hand, there are the "business" guys who > thnk that EAV is crap. Well, no, it's not that EAV is crap. It's that EAV is strong evidence that you're using the wrong tool for the job. If a SQL database is actually a good fit for your application, then it should be possible to extract a stronger schema for your data. If you cannot, then you probably should be storing your data in something else. Otherwise you'll spend untold man-hours beating your head against assorted walls while you try to persuade the SQL database to do things it was never meant for, and coping with performance issues because the cases you need are not optimized nor indeed optimizable. (I can just see that other guy trying to search on one of his "polymorphic" columns :-(.) SQL isn't the be-all and end-all of data storage. It does relational stuff well, and other stuff poorly. regards, tom lane
On Fri, Oct 23, 2009 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Leif B. Kristensen" <leif@solumslekt.org> writes: >> It seems like there are two camps considering EAV models. On the one >> hand, there are researchers who think that EAV is a great way to meet >> their objectives. On the other hand, there are the "business" guys who >> thnk that EAV is crap. > > Well, no, it's not that EAV is crap. It's that EAV is strong evidence > that you're using the wrong tool for the job. If a SQL database is > actually a good fit for your application, then it should be possible to > extract a stronger schema for your data. If you cannot, then you > probably should be storing your data in something else. Otherwise > you'll spend untold man-hours beating your head against assorted walls > while you try to persuade the SQL database to do things it was never > meant for, and coping with performance issues because the cases you need > are not optimized nor indeed optimizable. (I can just see that other > guy trying to search on one of his "polymorphic" columns :-(.) I can certainly see where Tom is coming from on this and I tend to agree in general. However, the reality of the research world is never that clear cut. In particular, what you find is that you have some split where some percentage of the domain is well understood and can be placed in a conventional schema and some portion is not so well understood and requires something more flexible. You could try and split the data between two databases but that itself is very problematic. My take on this, for the research world, is to not go pure EAV, but rather normalize by some more generic concepts within the domain. Eg. "measurement", or "evaluation", etc. You might ultimately end up with a sort of EAV model, but the "V" portion is strongly typed within the database and you're not trying to cast a string into 20 conventional data types. This still requires rigorous metadata management on the EA side of the EAV model, but you can tackle that in many ways. > SQL isn't the be-all and end-all of data storage. It does relational > stuff well, and other stuff poorly. You can build variations on EAV that are closer to a regular relational schema. These don't necessarily work well or poorly but often, at least in the research world, the middle ground is good enough. You are after all, talking about people who spit out MySQL databases at the drop of a hat.... -- Peter Hunsberger
On Fri, Oct 23, 2009 at 11:53:26PM +0200, Leif B. Kristensen wrote: > I'm a researcher type, and I've made an EAV model that suits me well in > my genealogy research. How can you associate an essentially unknown > number of sundry "events" to a "person" without an EAV model? create table person ( pk serial primary key, who_is_it text ); create table event ( pk serial primary key, fk_person integer references person(pk), what_happened text ); I'm sure you meant something else, no ? Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Fri, 23 Oct 2009, Leif B. Kristensen wrote: > I'm a researcher type, and I've made an EAV model that suits me well in > my genealogy research. How can you associate an essentially unknown > number of sundry "events" to a "person" without an EAV model? CREATE TABLE events(person text,key text,value text); You seem to think there's some sort of fundamental difference between the two models. There isn't. The tuples of a standard relational row are simply a fixed set of key/value pairs, but you can always store EAV data explicitly like this simple example. Similarly, you can always decompose relational data into an equivalent EAV set, where the keys are the column names. The difference between the two data models is that knowing the keys in advance allows you to do optimizations when looking up the data that result in faster queries than any really generic system can do. Trying to optimize queries against huge EAV sets will just end up reinventing concepts like indexes if you want them to run well, and then you're back to having to specify which are the important keys/columns in advance. Your thinking this is a business/research distinction isn't right, it comes down to the size of the data set and how fast/expressive the queries against them are. In the research realm, you can watch this struggle play in things like bioinformatics, where there's giant stacks of very similar data they need to sort through. What often ends up happening is you have scientists start with a simple EAV system, then watch it completely fall down under real-world load and complicated queries once it gets past prototype. Then they try to patch it for a while by reinventing SQL query and storage concepts (badly). Eventually someone with traditional database background comes along, notes that the data you really need is for the most part predetermined, rewrites that into something more amenable to standard SQL, and then the whole thing performs better for that subset. But now you've got stuff like ALTER TABLE to add a column every time you want to track something new, and people don't like that when the schema is still fluid. So they start putting stuff into EAV stores for their next project...and the cycle begins anew. Ultimately you can be really flexible in how your store your data, or you can get good query performance, but it's quite hard to do both. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Oct 23, 2009 at 5:30 PM, Greg Smith <gsmith@gregsmith.com> wrote: > > Your thinking this is a business/research distinction isn't right, it comes > down to the size of the data set and how fast/expressive the queries against > them are. In the research realm, you can watch this struggle play in things > like bioinformatics, where there's giant stacks of very similar data they > need to sort through. What often ends up happening is you have scientists > start with a simple EAV system, then watch it completely fall down under > real-world load and complicated queries once it gets past prototype. Then > they try to patch it for a while by reinventing SQL query and storage > concepts (badly). I've got little idea what the OP is really dealing with but there is a fundamental difference between the research world and the more traditional business world. On the research side the data models are not well known in advance, there are few established best practices and even data types are not always well known. In a conventional schema we would have about 2,000+ tables with 20,000+ columns. This is 6 year old system with 75% of our metadata having been added over the last 3 years. I expect growth to slow over the next 3 years, but a 25% growth would not be unreasonable. The current physical schema is some 40 tables with maybe some 300 columns (we also render about 6000 customized presentations of the data and manage work flow and validation with this schema). Performance tuning can be a pain, but we mostly got that mastered at this stage in the game: it's a trade off, but using a conventional schema would have consumed far more resources than we've spent over the last 6 years. The first two versions of the system were conventional schema and new research protocols where taking 6 months to a year to add. We can now add simple ones in a day or two and complex one in a couple months (the largest having some 60,000+ pieces of metadata, including validation rules and workflow). > > Eventually someone with traditional database background comes along, notes > that the data you really need is for the most part predetermined, rewrites > that into something more amenable to standard SQL, and then the whole thing > performs better for that subset. But now you've got stuff like ALTER TABLE > to add a column every time you want to track something new, and people don't > like that when the schema is still fluid. So they start putting stuff into > EAV stores for their next project...and the cycle begins anew. > > Ultimately you can be really flexible in how your store your data, or you > can get good query performance, but it's quite hard to do both. Not much argument there! However, it's a three way trade off: such a schema can get good performance for a small number of users and / or a small amount of data (we're about 300 users and 45gb total data). -- Peter Hunsberger
Hi Peter, I agree 100% with you. EAV can be a good "middle of the road" appoach as you suggest. Peter Hunsberger wrote: > My take on this, for the research world, is to not go pure EAV, but > rather normalize by some more generic concepts within the domain. Eg. > "measurement", or "evaluation", etc. You might ultimately end up with > a sort of EAV model, but the "V" portion is strongly typed within the > database and you're not trying to cast a string into 20 conventional > data types. This still requires rigorous metadata management on the EA > side of the EAV model, but you can tackle that in many ways. > >> SQL isn't the be-all and end-all of data storage. It does relational >> stuff well, and other stuff poorly. > > You can build variations on EAV that are closer to a regular > relational schema. These don't necessarily work well or poorly but > often, at least in the research world, the middle ground is good > enough. You are after all, talking about people who spit out MySQL > databases at the drop of a hat.... I use a very similar approach in managing meta-data, normalize the data that can be normalized and use EAV for the rest. Potentially eliminating as much as possible text search, however in some scenarios it might be necessary but an additional where on some normalized columns can help a lot with performance. One of my application meta-data frameworks uses only two tables to store all meta-data about an application and have basically the following structure: CREATE TABLE controls ( ctrl_no SERIAL PRIMARY KEY NOT NULL, app_id varchar(30) NOT NULL, ctrl_type varchar(30) NOT NULL, ctrl_id varchar(30) NOT NULL, ctrl_property text, -- This can be also hstore to add some intelligence CONSTRAINT controls_unique UNIQUE (app_id, ctrl_type, ctrl_id)); CREATE TABLE members ( ctrlmember_no SERIAL PRIMARY KEY NOT NULL, ctrl_no INTEGER NOT NULL REFERENCES controls(ctrl_no), member_no INTEGER NOT NULL REFERENCES controls(ctrl_no), member_type varchar(30) NOT NULL, member_property text, -- This can be a hstore to add more intelligence CONSTRAINT member_unique UNIQUE (ctrl_no, member_no)); ctrl_property is used to store meta-data based on ctrl_type. member_property stores meta-data based on member_type and/or overriding default ctrl_property values based on the parent ctrl_no it is associated with. Without this approach I would have to create more than 100 tables if I want to fully normalize this meta-data. Many people have indicated that I am actually duplicating many of the catalog features of a relational database, and I agree. However, it allows me to port this meta-data onto any user specified RDBMS without having to worry about database specifics. With two [recursive] queries on the above two tables I can answer most of the fundamental questions regarding the meta-data: 1. Show me the members (parent -> child) of a specific feature. 2. Show me the owners (child -> parent) of a specific feature. Extending the above, it allows for an easy development plan for writing a generic application framework that not only manages the meta-data, but also allows the same framework to run/create a user interface on the fly by a couple of nuances on the above two queries and using the EAV (%_property) columns to supply all the default properties and behaviour of the application. Changing the behavior of an application becomes primarily a database management issue, a lot less application upgrade management in a distributed environment. To come back to the original message. Yes there are a place for EAV, not only in Research but even in Business data. I have a Environmental software scenario, and EAV on the business data provide me the edge against my competitors. Lot less time needed to implement new features compared to doing the normal functional decomposition and system development life cycle. In conclusion, I include a extract from an article by Dan Appleton (Datamation, 1983) that my approach is based on: “The nature of end-user software development and maintenance will change radically over the next five years simply because 500 000 programmers will not be able to rewrite $400 billion of existing software (which is hostage to a seven- to 10-year life cycle). They'll be further burdened by those new applications in the known backlog, as well as by those applications in the hidden backlog. To solve the problem, dp (data processing) shops must improve productivity in generating end-user software and provide end-users with the means of generating their own software without creating anarchy... The answer to this is a data-driven (meta-data) prototyping approach, and companies that do not move smoothly in this direction will either drown in their own information pollution or loose millions on systems that are late, cost too much, and atrophy too quickly.” Regards, Johan Nel Pretoria, South Africa.
On Fri, Oct 23, 2009 at 5:53 PM, Leif B. Kristensen <leif@solumslekt.org> wrote: > I've followed this list for quite a long time, and I think that I've > discovered a pattern that I would like to discuss. > > It seems like there are two camps considering EAV models. On the one > hand, there are researchers who think that EAV is a great way to meet > their objectives. On the other hand, there are the "business" guys who > thnk that EAV is crap. I think where people get into trouble with EAV is they tend to simplify data too much so that the database can no longer deliver on the requirements of the application. They then have to support those requirements in other layers on top of the database which adds complexity and eats performance. But judiciously to solve particular problems it can be a fine solution. ISTM the EAV debate is an example of a much broader debate going on in the information management world, which is whether to keep data-managing logic in the database or outside of the database. The 'in the database' camp tends to prefer things like views, rich constraints, stored procedures and other database features that support and maintain your data. 'out of the database' people tend to try and keep the database simple so they can manage things in an application framework or an ORM. The EAV position taken to the extreme could be considered the fundamentalists of the latter camp. People here are obviously more database biased...postgresql is probably the best open source solution out there that provides rich database features. Personally, I have long ago come to the position that I don't like code that writes sql (with certain limited exceptions)...generated sql tends to lead to many problems IMO. merlin
On Fri, 2009-10-23 at 23:53 +0200, Leif B. Kristensen wrote: > I've followed this list for quite a long time, and I think that I've > discovered a pattern that I would like to discuss. > > It seems like there are two camps considering EAV models. On the one > hand, there are researchers who think that EAV is a great way to meet > their objectives. On the other hand, there are the "business" guys who > thnk that EAV is crap. > > I've seen this pattern often enough and consistently enough that I think > there may be an underlying difference of objectives concerning the use > of databases itself that may be responsible for this divergence. > > I'm a researcher type, and I've made an EAV model that suits me well in > my genealogy research. How can you associate an essentially unknown > number of sundry "events" to a "person" without an EAV model? > > It seems to me that data models made for research is a quite different > animal than data models made for business. In research, we often need to > register data that may be hard to pin down in exactly the right pigeon > hole, but never the less need to be recorded. The most sensible way to > do this, IMO, is frequently to associate the data with some already- > known or postulated entity. That's where the EAV model comes in really > handy. This problem is common in many different areas, not just research. In most data models there will be parts that are well known and parts that are changing rapidly. For example, in banking, a customer "account" has been defined the same way for almost as long as computers have been around. However, customer characteristics that the bank wishes to track for fraud detection are newly invented each week. The way you model data should not be only one or the other way. You should model your well-known portions using relational models and the faster changing/dynamically developing aspects using EAV models. You can put both into an RDBMS, as Karsten shows. I've seen that implemented in various ways, from including XML blobs to text strings, EAV tables or something like hstore. -- Simon Riggs www.2ndQuadrant.com