Thread: Research and EAV models

Research and EAV models

From
"Leif B. Kristensen"
Date:
I've followed this list for quite a long time, and I think that I've
discovered a pattern that I would like to discuss.

It seems like there are two camps considering EAV models. On the one
hand, there are researchers who think that EAV is a great way to meet
their objectives. On the other hand, there are the "business" guys who
thnk that EAV is crap.

I've seen this pattern often enough and consistently enough that I think
there may be an underlying difference of objectives concerning the use
of databases itself that may be responsible for this divergence.

I'm a researcher type, and I've made an EAV model that suits me well in
my genealogy research. How can you associate an essentially unknown
number of sundry "events" to a "person" without an EAV model?

It seems to me that data models made for research is a quite different
animal than data models made for business. In research, we often need to
register data that may be hard to pin down in exactly the right pigeon
hole, but never the less need to be recorded. The most sensible way to
do this, IMO, is frequently to associate the data with some already-
known or postulated entity. That's where the EAV model comes in really
handy.
--
Leif Biberg Kristensen | Registered Linux User #338009
Me And My Database: http://solumslekt.org/blog/

Re: Research and EAV models

From
Tom Lane
Date:
"Leif B. Kristensen" <leif@solumslekt.org> writes:
> It seems like there are two camps considering EAV models. On the one
> hand, there are researchers who think that EAV is a great way to meet
> their objectives. On the other hand, there are the "business" guys who
> thnk that EAV is crap.

Well, no, it's not that EAV is crap.  It's that EAV is strong evidence
that you're using the wrong tool for the job.  If a SQL database is
actually a good fit for your application, then it should be possible to
extract a stronger schema for your data.  If you cannot, then you
probably should be storing your data in something else.  Otherwise
you'll spend untold man-hours beating your head against assorted walls
while you try to persuade the SQL database to do things it was never
meant for, and coping with performance issues because the cases you need
are not optimized nor indeed optimizable.  (I can just see that other
guy trying to search on one of his "polymorphic" columns :-(.)

SQL isn't the be-all and end-all of data storage.  It does relational
stuff well, and other stuff poorly.

            regards, tom lane

Re: Research and EAV models

From
Peter Hunsberger
Date:
On Fri, Oct 23, 2009 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Leif B. Kristensen" <leif@solumslekt.org> writes:
>> It seems like there are two camps considering EAV models. On the one
>> hand, there are researchers who think that EAV is a great way to meet
>> their objectives. On the other hand, there are the "business" guys who
>> thnk that EAV is crap.
>
> Well, no, it's not that EAV is crap.  It's that EAV is strong evidence
> that you're using the wrong tool for the job.  If a SQL database is
> actually a good fit for your application, then it should be possible to
> extract a stronger schema for your data.  If you cannot, then you
> probably should be storing your data in something else.  Otherwise
> you'll spend untold man-hours beating your head against assorted walls
> while you try to persuade the SQL database to do things it was never
> meant for, and coping with performance issues because the cases you need
> are not optimized nor indeed optimizable.  (I can just see that other
> guy trying to search on one of his "polymorphic" columns :-(.)

I can certainly see where Tom is coming from on this and I tend to
agree in general.  However, the reality of the research world is never
that clear cut. In particular, what you find is that you have some
split where some percentage of the domain is well understood and can
be placed in a conventional schema and some portion is not so well
understood and requires something more flexible.  You could try and
split the data between two databases but that itself is very
problematic.

My take on this, for the research world, is to not go pure EAV, but
rather normalize by some more generic concepts within the domain.  Eg.
"measurement", or "evaluation", etc. You might ultimately end up with
a sort of EAV model, but the "V" portion is strongly typed within the
database and you're not trying to cast a string into 20 conventional
data types. This still requires rigorous metadata management on the EA
side of the EAV model, but you can tackle that in many ways.

> SQL isn't the be-all and end-all of data storage.  It does relational
> stuff well, and other stuff poorly.

You can build variations on EAV that are closer to a regular
relational schema.  These don't necessarily work well or poorly but
often, at least in the research world, the middle ground is good
enough.  You are after all, talking about people who spit out MySQL
databases at the drop of a hat....

--
Peter Hunsberger

Re: Research and EAV models

From
Karsten Hilbert
Date:
On Fri, Oct 23, 2009 at 11:53:26PM +0200, Leif B. Kristensen wrote:

> I'm a researcher type, and I've made an EAV model that suits me well in
> my genealogy research. How can you associate an essentially unknown
> number of sundry "events" to a "person" without an EAV model?

create table person (
    pk serial primary key,
    who_is_it text
);

create table event (
    pk serial primary key,
    fk_person integer references person(pk),
    what_happened text
);

I'm sure you meant something else, no ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Research and EAV models

From
Greg Smith
Date:
On Fri, 23 Oct 2009, Leif B. Kristensen wrote:

> I'm a researcher type, and I've made an EAV model that suits me well in
> my genealogy research. How can you associate an essentially unknown
> number of sundry "events" to a "person" without an EAV model?

CREATE TABLE events(person text,key text,value text);

You seem to think there's some sort of fundamental difference between the
two models.  There isn't.  The tuples of a standard relational row are
simply a fixed set of key/value pairs, but you can always store EAV data
explicitly like this simple example.  Similarly, you can always decompose
relational data into an equivalent EAV set, where the keys are the column
names.

The difference between the two data models is that knowing the keys in
advance allows you to do optimizations when looking up the data that
result in faster queries than any really generic system can do.  Trying to
optimize queries against huge EAV sets will just end up reinventing
concepts like indexes if you want them to run well, and then you're back
to having to specify which are the important keys/columns in advance.

Your thinking this is a business/research distinction isn't right, it
comes down to the size of the data set and how fast/expressive the queries
against them are.  In the research realm, you can watch this struggle play
in things like bioinformatics, where there's giant stacks of very similar
data they need to sort through.  What often ends up happening is you have
scientists start with a simple EAV system, then watch it completely fall
down under real-world load and complicated queries once it gets past
prototype.  Then they try to patch it for a while by reinventing SQL query
and storage concepts (badly).

Eventually someone with traditional database background comes along, notes
that the data you really need is for the most part predetermined, rewrites
that into something more amenable to standard SQL, and then the whole
thing performs better for that subset.  But now you've got stuff like
ALTER TABLE to add a column every time you want to track something new,
and people don't like that when the schema is still fluid.  So they start
putting stuff into EAV stores for their next project...and the cycle
begins anew.

Ultimately you can be really flexible in how your store your data, or you
can get good query performance, but it's quite hard to do both.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Research and EAV models

From
Peter Hunsberger
Date:
On Fri, Oct 23, 2009 at 5:30 PM, Greg Smith <gsmith@gregsmith.com> wrote:
>
> Your thinking this is a business/research distinction isn't right, it comes
> down to the size of the data set and how fast/expressive the queries against
> them are.  In the research realm, you can watch this struggle play in things
> like bioinformatics, where there's giant stacks of very similar data they
> need to sort through.  What often ends up happening is you have scientists
> start with a simple EAV system, then watch it completely fall down under
> real-world load and complicated queries once it gets past prototype.  Then
> they try to patch it for a while by reinventing SQL query and storage
> concepts (badly).

I've got little idea what the OP is really dealing with but there is a
fundamental difference between the research world and the more
traditional business world.  On the research side the data models are
not well known in advance, there are few established best practices
and even data types are not always well known.  In a conventional
schema we would have about 2,000+ tables with 20,000+ columns.  This
is 6 year old system with 75% of our metadata having been added over
the last 3 years.  I expect growth to slow over the next 3 years, but
a 25% growth would not be unreasonable.  The current physical schema
is some 40 tables with maybe some 300 columns (we also render about
6000 customized presentations of the data and manage work flow and
validation with this schema).  Performance tuning can be a pain, but
we mostly got that mastered at this stage in the game: it's a trade
off, but using a conventional schema would have consumed far more
resources than we've spent over the last 6 years.  The first two
versions of the system were conventional schema and new research
protocols where taking 6 months to a year to add.  We can now add
simple ones in a day or two and complex one in a couple months (the
largest having some 60,000+ pieces of metadata, including validation
rules and workflow).

>
> Eventually someone with traditional database background comes along, notes
> that the data you really need is for the most part predetermined, rewrites
> that into something more amenable to standard SQL, and then the whole thing
> performs better for that subset.  But now you've got stuff like ALTER TABLE
> to add a column every time you want to track something new, and people don't
> like that when the schema is still fluid.  So they start putting stuff into
> EAV stores for their next project...and the cycle begins anew.
>
> Ultimately you can be really flexible in how your store your data, or you
> can get good query performance, but it's quite hard to do both.

Not much argument there!  However, it's a three way trade off: such a
schema can get good performance for a small number of users and / or a
small amount of data (we're about 300 users and 45gb total data).

--
Peter Hunsberger

Re: Research and EAV models

From
Johan Nel
Date:
Hi Peter,

I agree 100% with you.  EAV can be a good "middle of the road" appoach as
you suggest.

Peter Hunsberger wrote:
> My take on this, for the research world, is to not go pure EAV, but
> rather normalize by some more generic concepts within the domain.  Eg.
> "measurement", or "evaluation", etc. You might ultimately end up with
> a sort of EAV model, but the "V" portion is strongly typed within the
> database and you're not trying to cast a string into 20 conventional
> data types. This still requires rigorous metadata management on the EA
> side of the EAV model, but you can tackle that in many ways.
>
>> SQL isn't the be-all and end-all of data storage.  It does relational
>> stuff well, and other stuff poorly.
>
> You can build variations on EAV that are closer to a regular
> relational schema.  These don't necessarily work well or poorly but
> often, at least in the research world, the middle ground is good
> enough.  You are after all, talking about people who spit out MySQL
> databases at the drop of a hat....

I use a very similar approach in managing meta-data, normalize the data
that can be normalized and use EAV for the rest.  Potentially eliminating
as much as possible text search, however in some scenarios it might be
necessary but an additional where on some normalized columns can help a
lot with performance.

One of my application meta-data frameworks uses only two tables to store
all meta-data about an application and have basically the following structure:

CREATE TABLE controls (
ctrl_no SERIAL PRIMARY KEY NOT NULL,
app_id varchar(30) NOT NULL,
ctrl_type varchar(30) NOT NULL,
ctrl_id varchar(30) NOT NULL,
ctrl_property text, -- This can be also hstore to add some intelligence
CONSTRAINT controls_unique UNIQUE (app_id, ctrl_type, ctrl_id));

CREATE TABLE members (
ctrlmember_no SERIAL PRIMARY KEY NOT NULL,
ctrl_no INTEGER NOT NULL REFERENCES controls(ctrl_no),
member_no INTEGER NOT NULL REFERENCES controls(ctrl_no),
member_type varchar(30) NOT NULL,
member_property text, -- This can be a hstore to add more intelligence
CONSTRAINT member_unique UNIQUE (ctrl_no, member_no));

ctrl_property is used to store meta-data based on ctrl_type.

member_property stores meta-data based on member_type and/or overriding
default ctrl_property values based on the parent ctrl_no it is associated
with.

Without this approach I would have to create more than 100 tables if I
want to fully normalize this meta-data.

Many people have indicated that I am actually duplicating many of the
catalog features of a relational database, and I agree.  However, it
allows me to port this meta-data onto any user specified RDBMS without
having to worry about database specifics.

With two [recursive] queries on the above two tables I can answer most of
the fundamental questions regarding the meta-data:

1. Show me the members (parent -> child) of a specific feature.
2. Show me the owners (child -> parent) of a specific feature.

Extending the above, it allows for an easy development plan for writing a
generic application framework that not only manages the meta-data, but
also allows the same framework to run/create a user interface on the fly
by a couple of nuances on the above two queries and using the EAV
(%_property) columns to supply all the default properties and behaviour of
the application.

Changing the behavior of an application becomes primarily a database
management issue, a lot less application upgrade management in a
distributed environment.

To come back to the original message.  Yes there are a place for EAV, not
only in Research but even in Business data.  I have a Environmental
software scenario, and EAV on the business data provide me the edge
against my competitors.  Lot less time needed to implement new features
compared to doing the normal functional decomposition and system
development life cycle.

In conclusion, I include a extract from an article by Dan Appleton
(Datamation, 1983) that my approach is based on:

“The nature of end-user software development and maintenance will change
radically over the next five years simply because 500 000 programmers will
not be able to rewrite $400 billion of existing software (which is hostage
to a seven- to 10-year life cycle).  They'll be further burdened by those
new applications in the known backlog, as well as by those applications in
the hidden backlog.  To solve the problem, dp (data processing) shops must
improve productivity in generating end-user software and provide end-users
with the means of generating their own software without creating
anarchy...  The answer to this is a data-driven (meta-data) prototyping
approach, and companies that do not move smoothly in this direction will
either drown in their own information pollution or loose millions on
systems that are late, cost too much, and atrophy too quickly.”

Regards,

Johan Nel
Pretoria, South Africa.

Re: Research and EAV models

From
Merlin Moncure
Date:
On Fri, Oct 23, 2009 at 5:53 PM, Leif B. Kristensen <leif@solumslekt.org> wrote:
> I've followed this list for quite a long time, and I think that I've
> discovered a pattern that I would like to discuss.
>
> It seems like there are two camps considering EAV models. On the one
> hand, there are researchers who think that EAV is a great way to meet
> their objectives. On the other hand, there are the "business" guys who
> thnk that EAV is crap.

I think where people get into trouble with EAV is they tend to
simplify data too much so that the database can no longer deliver on
the requirements of the application.  They then have to support those
requirements in other layers on top of the database which adds
complexity and eats performance.  But judiciously to solve particular
problems it can be a fine solution.

ISTM the EAV debate is an example of a much broader debate going on in
the information management world, which is whether to keep
data-managing logic in the database or outside of the database.  The
'in the database' camp tends to prefer things like views, rich
constraints, stored procedures and other database features that
support and maintain your data.  'out of the database' people tend to
try and keep the database simple so they can manage things in an
application framework or an ORM.  The EAV position taken to the
extreme could be considered the fundamentalists of the latter camp.
People here are obviously more database biased...postgresql is
probably the best open source solution out there that provides rich
database features.

Personally, I have long ago come to the position that I don't like
code that writes sql (with certain limited exceptions)...generated sql
tends to lead to many problems IMO.

merlin

Re: Research and EAV models

From
Simon Riggs
Date:
On Fri, 2009-10-23 at 23:53 +0200, Leif B. Kristensen wrote:
> I've followed this list for quite a long time, and I think that I've
> discovered a pattern that I would like to discuss.
>
> It seems like there are two camps considering EAV models. On the one
> hand, there are researchers who think that EAV is a great way to meet
> their objectives. On the other hand, there are the "business" guys who
> thnk that EAV is crap.
>
> I've seen this pattern often enough and consistently enough that I think
> there may be an underlying difference of objectives concerning the use
> of databases itself that may be responsible for this divergence.
>
> I'm a researcher type, and I've made an EAV model that suits me well in
> my genealogy research. How can you associate an essentially unknown
> number of sundry "events" to a "person" without an EAV model?
>
> It seems to me that data models made for research is a quite different
> animal than data models made for business. In research, we often need to
> register data that may be hard to pin down in exactly the right pigeon
> hole, but never the less need to be recorded. The most sensible way to
> do this, IMO, is frequently to associate the data with some already-
> known or postulated entity. That's where the EAV model comes in really
> handy.

This problem is common in many different areas, not just research.

In most data models there will be parts that are well known and parts
that are changing rapidly. For example, in banking, a customer "account"
has been defined the same way for almost as long as computers have been
around. However, customer characteristics that the bank wishes to track
for fraud detection are newly invented each week.

The way you model data should not be only one or the other way. You
should model your well-known portions using relational models and the
faster changing/dynamically developing aspects using EAV models. You can
put both into an RDBMS, as Karsten shows. I've seen that implemented in
various ways, from including XML blobs to text strings, EAV tables or
something like hstore.

--
 Simon Riggs           www.2ndQuadrant.com