Thread: Relational Algebra and Aggregate Functions
I'm working on improving my background database theory, to aid in practice. I've found learning relational algebra to be very helpful. One thing which relational algebra doesn't cover is aggregate functions. Can anyone recommend any papers or web pages which provide some good theoretical background for aggregate functions?
On Sun, Jul 26, 2009 at 03:36:26PM -0400, Robert James wrote: > Can anyone > recommend any papers or web pages which provide some good theoretical > background for aggregate functions? My knowledge of relational algebra is somewhat non-existent as well; I tend to just think of them as a "fold" from normal functional programming. The Wikipedia page is a reasonable description: http://en.wikipedia.org/wiki/Fold_(higher-order_function) Not sure how helpful that is though! -- Sam http://samason.me.uk/
On Sun, 2009-07-26 at 15:36 -0400, Robert James wrote: > I'm working on improving my background database theory, to aid in > practice. I've found learning relational algebra to be very helpful. > One thing which relational algebra doesn't cover is aggregate > functions. Can anyone recommend any papers or web pages which provide > some good theoretical background for aggregate functions? When it comes to relational theory, C.J. Date is a good author. "An Introduction To Database Systems" covers pretty much everything. There's a formal definition of a relational algebra (including SUMMARIZE, which is the authors' version of an aggregate operator) defined with only two operators here: http://thethirdmanifesto.com/ (look for "Appendix A") Although Appendix A is not easy to understand without some basic familiarity with the authors' other works. Regards, Jeff Davis
Thanks for all the good replies (both on and off list). It seems the consensus is for me to read Christopher Date. I found two relevant Date books:
1) Introduction to Database Systems
and
2) Database in Depth: Relational Theory for Practitioners
Any recommendations as to which? From the titles, I'd be inclined towards the second, but not if the first is better. One thing I'm not interested in is polemics against SQL and lamentations on how ignorant all practitioners are.
On Mon, Jul 27, 2009 at 2:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:
When it comes to relational theory, C.J. Date is a good author. "AnOn Sun, 2009-07-26 at 15:36 -0400, Robert James wrote:
> I'm working on improving my background database theory, to aid in
> practice. I've found learning relational algebra to be very helpful.
> One thing which relational algebra doesn't cover is aggregate
> functions. Can anyone recommend any papers or web pages which provide
> some good theoretical background for aggregate functions?
Introduction To Database Systems" covers pretty much everything.
There's a formal definition of a relational algebra (including
SUMMARIZE, which is the authors' version of an aggregate operator)
defined with only two operators here:
http://thethirdmanifesto.com/
(look for "Appendix A")
Although Appendix A is not easy to understand without some basic
familiarity with the authors' other works.
Regards,
Jeff Davis
On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote: > 1) Introduction to Database Systems > http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5 > > and > 2) Database in Depth: Relational Theory for Practitioners > http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7 I recommend #2. It's shorter and easier to read than "An Introduction to Database Systems", and I think it will answer your question about relational theory and aggregates (see "SUMMARIZE"). Appendix A is a part of "Databases, Types, and The Relational Model: The Third Manifesto". That's interesting, but it's not an easy read, it's describing a system in formal detail. Regards, Jeff Davis
Many wrote that the functional programming 'fold' is a good model for relational aggregate functions. I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essential component of aggregation.
2. I don't believe fold can handle things like AVG() or STDDEV(). Can it? Conversely, fold can handle non-commutative and non-associative operators, which I don't believe can be used for aggregation.
3. fold is defined on sequences, not sets. This doesn't seem to be a problem until you think about cases where there a duplicates of the aggregated field. (For instance, there are 10 bags each weighing 5 lbs, and you want SUM(weight) - you need to project weight onto a collection which allows for 10 occurences, or define the aggregate function to work on the whole tuple somehow... I know a man named Krug worked out a formal theory for this...)
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote: > Many wrote that the functional programming 'fold' is a good model for > relational aggregate functions. I have a few difficulties with this: > 1. fold doesn't offer any type of GROUP BY, which is an essential component > of aggregation. Not sure if I'd agree, a GROUP BY without any aggregate functions looks pretty indistinguishable from just a DISTINCT on the same columns to me. > 2. I don't believe fold can handle things like AVG() or STDDEV(). Can it? An "aggregate" in a database bundles together several things that you get direct access to in most functional languages; have a look at: http://www.postgresql.org/docs/current/static/sql-createaggregate.html In Haskell, to pick one particularly express language, I'd write AVG as: avg = uncurry (/) . foldl (\(t,n) v -> (t+v,n+1)) (0,0) running "avg [1,7,15]" gives me back approximately "7.6". If I move things around so they look like they would in PG, I get: CREATE AGGREGATE avg (float8) ( INITCOND = (0,0) STYPE = (float8,float8), SFUNC = foldl (\(t,n) v -> (t+v,n+1)), FINALFUNC = uncurry (/), ); > Conversely, fold can handle non-commutative and non-associative operators, > which I don't believe can be used for aggregation. Strictly speaking this is correct; but in practice PG doesn't care about this. array_accum is a reasonable example (I think) as the resulting array will be in the same order as it was given to the aggregation function. > 3. fold is defined on sequences, not sets. This doesn't seem to be a > problem until you think about cases where there a duplicates of the > aggregated field. (For instance, there are 10 bags each weighing 5 lbs, and > you want SUM(weight) - you need to project weight onto a collection which > allows for 10 occurences, or define the aggregate function to work on the > whole tuple somehow... I know a man named Krug worked out a formal theory > for this...) I don't see why this is a problem at all; could you give a concrete example? -- Sam http://samason.me.uk/
On Jul 27, 2009, at 21:05 , Robert James wrote: > 2) Database in Depth: Relational Theory for Practitioners > http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7 "Database in Depth" is good, though he's effectively rewritten it as "SQL and Relational Theory: How to Write Accurate SQL Code" http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068 Michael Glaesemann grzm seespotcode net
Thanks! "SQL and Relational Theory: How to Write Accurate SQL Code" looks like the best pick of the bunch.
On Tue, Jul 28, 2009 at 10:08 AM, Michael Glaesemann <grzm@seespotcode.net> wrote:
"Database in Depth" is good, though he's effectively rewritten it as "SQL and Relational Theory: How to Write Accurate SQL Code"
On Jul 27, 2009, at 21:05 , Robert James wrote:2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7
http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068
Michael Glaesemann
grzm seespotcode net
On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote:
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:Not sure if I'd agree, a GROUP BY without any aggregate functions looks
> Many wrote that the functional programming 'fold' is a good model for
> relational aggregate functions. I have a few difficulties with this:
> 1. fold doesn't offer any type of GROUP BY, which is an essential component
> of aggregation.
pretty indistinguishable from just a DISTINCT on the same columns to me.
DISTINCT will collapse duplicates, which is not what we want when computing COUNT, SUM, or AVG - please see below.
> 3. fold is defined on sequences, not sets. This doesn't seem to be aI don't see why this is a problem at all; could you give a concrete
> problem until you think about cases where there a duplicates of the
> aggregated field. (For instance, there are 10 bags each weighing 5 lbs, and
> you want SUM(weight) - you need to project weight onto a collection which
> allows for 10 occurences, or define the aggregate function to work on the
> whole tuple somehow... I know a man named Krug worked out a formal theory
> for this...)
example?
Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)}
How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could project_weight(LUGGAGE) and then apply SUM, except that would give us {(weight:3), (weight:3)}, which is not a set (it has duplicates). We could define a new operation: project_to_list (allowing duplicates), or we could define SUM(weight) over the LUGGAGE relation as a whole - either way, we need to extend the theory a bit.
On Tue, Jul 28, 2009 at 11:26:01AM -0400, Robert James wrote: > On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote: > > On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote: > > > Many wrote that the functional programming 'fold' is a good model for > > > relational aggregate functions. I have a few difficulties with this: > > > 1. fold doesn't offer any type of GROUP BY, which is an essential > > component > > > of aggregation. > > > > Not sure if I'd agree, a GROUP BY without any aggregate functions looks > > pretty indistinguishable from just a DISTINCT on the same columns to me. > > DISTINCT will collapse duplicates, which is not what we want when computing > COUNT, SUM, or AVG - please see below. That's why I said "without any aggregate functions". E.g: SELECT a,b FROM foo GROUP BY a,b; vs. SELECT DISTINCT a,b FROM foo; I guess we're talking about different things. Folds are an example of something called a Catamorphism (as wikipedia seems to know these days). If you're really interested in this then I liked the "bananas in space" paper: http://citeseer.ist.psu.edu/293490.html wikipedia seems to like: http://citeseer.ist.psu.edu/meijer91functional.html I don't think these are the same, but citeseer seems to be down at the moment and I can't check. > > > 3. fold is defined on sequences, not sets. This doesn't seem to be a > > > problem until you think about cases where there a duplicates of the > > > aggregated field. (For instance, there are 10 bags each weighing 5 lbs, > > and > > > you want SUM(weight) - you need to project weight onto a collection which > > > allows for 10 occurences, or define the aggregate function to work on the > > > whole tuple somehow... I know a man named Krug worked out a formal theory > > > for this...) > > > > I don't see why this is a problem at all; could you give a concrete > > example? > > Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)} > How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could > project_weight(LUGGAGE) and then apply SUM, except that would give us > {(weight:3), (weight:3)}, which is not a set (it has duplicates). We could > define a new operation: project_to_list (allowing duplicates), or we could > define SUM(weight) over the LUGGAGE relation as a whole - either way, we > need to extend the theory a bit. Which "theory" are we talking about here? I assume you're talking about relational algebra, at which point I believe that aggregates have to be handled explicitly and this whole conversation becomes somewhat meaningless. I mentioned the analogy with folds because their semantics are somewhat more accessible to software people, if you're really interested in relational algebra then I'd stay away from folds and deal with aggregates directly. -- Sam http://samason.me.uk/
On Tuesday 28 July 2009 04:38:03 Jeff Davis wrote: > On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote: > > 1) Introduction to Database Systems > > http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp > >/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5 > > > > and > > 2) Database in Depth: Relational Theory for Practitioners > > http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0 > >596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7 > > I recommend #2. It's shorter and easier to read than "An Introduction to > Database Systems", and I think it will answer your question about > relational theory and aggregates (see "SUMMARIZE"). Is it weird that "Database in Depth" is shorter and easier than "Introduction to Database Systems"? And they're by the same author, too.
On Wed, 2009-07-29 at 10:19 +0300, Peter Eisentraut wrote: > Is it weird that "Database in Depth" is shorter and easier than "Introduction > to Database Systems"? And they're by the same author, too. I agree that it's a little strange. The former is more conceptual and starts off assuming that you are familiar with things like normalization. The latter is more like a textbook: more complete and more formal. Regards, Jeff Davis