Thread: Relational Algebra and Aggregate Functions

Relational Algebra and Aggregate Functions

From
Robert James
Date:
I'm working on improving my background database theory, to aid in practice.  I've found learning relational algebra to be very helpful.  One thing which relational algebra doesn't cover is aggregate functions.  Can anyone recommend any papers or web pages which provide some good theoretical background for aggregate functions?

Re: Relational Algebra and Aggregate Functions

From
Sam Mason
Date:
On Sun, Jul 26, 2009 at 03:36:26PM -0400, Robert James wrote:
> Can anyone
> recommend any papers or web pages which provide some good theoretical
> background for aggregate functions?

My knowledge of relational algebra is somewhat non-existent as well;
I tend to just think of them as a "fold" from normal functional
programming.  The Wikipedia page is a reasonable description:

  http://en.wikipedia.org/wiki/Fold_(higher-order_function)

Not sure how helpful that is though!

--
  Sam  http://samason.me.uk/

Re: Relational Algebra and Aggregate Functions

From
Jeff Davis
Date:
On Sun, 2009-07-26 at 15:36 -0400, Robert James wrote:
> I'm working on improving my background database theory, to aid in
> practice.  I've found learning relational algebra to be very helpful.
>  One thing which relational algebra doesn't cover is aggregate
> functions.  Can anyone recommend any papers or web pages which provide
> some good theoretical background for aggregate functions?

When it comes to relational theory, C.J. Date is a good author. "An
Introduction To Database Systems" covers pretty much everything.

There's a formal definition of a relational algebra (including
SUMMARIZE, which is the authors' version of an aggregate operator)
defined with only two operators here:
http://thethirdmanifesto.com/
(look for "Appendix A")

Although Appendix A is not easy to understand without some basic
familiarity with the authors' other works.

Regards,
    Jeff Davis


Re: Relational Algebra and Aggregate Functions

From
Robert James
Date:
Thanks for all the good replies (both on and off list).  It seems the consensus is for me to read Christopher Date.  I found two relevant Date books:
1) Introduction to Database Systems
and
2) Database in Depth: Relational Theory for Practitioners
Any recommendations as to which? From the titles, I'd be inclined towards the second, but not if the first is better.  One thing I'm not interested in is polemics against SQL and lamentations on how ignorant all practitioners are.

On Mon, Jul 27, 2009 at 2:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Sun, 2009-07-26 at 15:36 -0400, Robert James wrote:
> I'm working on improving my background database theory, to aid in
> practice.  I've found learning relational algebra to be very helpful.
>  One thing which relational algebra doesn't cover is aggregate
> functions.  Can anyone recommend any papers or web pages which provide
> some good theoretical background for aggregate functions?

When it comes to relational theory, C.J. Date is a good author. "An
Introduction To Database Systems" covers pretty much everything.

There's a formal definition of a relational algebra (including
SUMMARIZE, which is the authors' version of an aggregate operator)
defined with only two operators here:
http://thethirdmanifesto.com/
(look for "Appendix A")

Although Appendix A is not easy to understand without some basic
familiarity with the authors' other works.

Regards,
       Jeff Davis


Re: Relational Algebra and Aggregate Functions

From
Jeff Davis
Date:
On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote:
> 1) Introduction to Database Systems
>
http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5
>
> and
> 2) Database in Depth: Relational Theory for Practitioners
>
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7

I recommend #2. It's shorter and easier to read than "An Introduction to
Database Systems", and I think it will answer your question about
relational theory and aggregates (see "SUMMARIZE").

Appendix A is a part of "Databases, Types, and The Relational Model: The
Third Manifesto". That's interesting, but it's not an easy read, it's
describing a system in formal detail.

Regards,
    Jeff Davis


Re: Relational Algebra and Aggregate Functions

From
Robert James
Date:
Many wrote that the functional programming 'fold' is a good model for relational aggregate functions.  I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essential component of aggregation.
2. I don't believe fold can handle things like AVG() or STDDEV().  Can it?  Conversely, fold can handle non-commutative and non-associative operators, which I don't believe can be used for aggregation.
3. fold is defined on sequences, not sets.  This doesn't seem to be a problem until you think about cases where there a duplicates of the aggregated field.  (For instance, there are 10 bags each weighing 5 lbs, and you want SUM(weight) - you need to project weight onto a collection which allows for 10 occurences, or define the aggregate function to work on the whole tuple somehow... I know a man named Krug worked out a formal theory for this...)

Re: Relational Algebra and Aggregate Functions

From
Sam Mason
Date:
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
> Many wrote that the functional programming 'fold' is a good model for
> relational aggregate functions.  I have a few difficulties with this:
> 1. fold doesn't offer any type of GROUP BY, which is an essential component
> of aggregation.

Not sure if I'd agree, a GROUP BY without any aggregate functions looks
pretty indistinguishable from just a DISTINCT on the same columns to me.

> 2. I don't believe fold can handle things like AVG() or STDDEV().  Can it?

An "aggregate" in a database bundles together several things that you
get direct access to in most functional languages; have a look at:

  http://www.postgresql.org/docs/current/static/sql-createaggregate.html

In Haskell, to pick one particularly express language, I'd write AVG as:

  avg = uncurry (/) . foldl (\(t,n) v -> (t+v,n+1)) (0,0)

running "avg [1,7,15]" gives me back approximately "7.6".  If I move
things around so they look like they would in PG, I get:

  CREATE AGGREGATE avg (float8) (
    INITCOND = (0,0)
    STYPE = (float8,float8),
    SFUNC = foldl (\(t,n) v -> (t+v,n+1)),
    FINALFUNC = uncurry (/),
  );

>  Conversely, fold can handle non-commutative and non-associative operators,
> which I don't believe can be used for aggregation.

Strictly speaking this is correct; but in practice PG doesn't care about
this. array_accum is a reasonable example (I think) as the resulting
array will be in the same order as it was given to the aggregation
function.

> 3. fold is defined on sequences, not sets.  This doesn't seem to be a
> problem until you think about cases where there a duplicates of the
> aggregated field.  (For instance, there are 10 bags each weighing 5 lbs, and
> you want SUM(weight) - you need to project weight onto a collection which
> allows for 10 occurences, or define the aggregate function to work on the
> whole tuple somehow... I know a man named Krug worked out a formal theory
> for this...)

I don't see why this is a problem at all; could you give a concrete
example?

--
  Sam  http://samason.me.uk/

Re: Relational Algebra and Aggregate Functions

From
Michael Glaesemann
Date:
On Jul 27, 2009, at 21:05 , Robert James wrote:

> 2) Database in Depth: Relational Theory for Practitioners
>
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7

"Database in Depth" is good, though he's effectively rewritten it as
"SQL and Relational Theory: How to Write Accurate SQL Code"

http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068

Michael Glaesemann
grzm seespotcode net




Re: Relational Algebra and Aggregate Functions

From
Robert James
Date:
Thanks! "SQL and Relational Theory: How to Write Accurate SQL Code" looks like the best pick of the bunch.

On Tue, Jul 28, 2009 at 10:08 AM, Michael Glaesemann <grzm@seespotcode.net> wrote:

On Jul 27, 2009, at 21:05 , Robert James wrote:

2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7

"Database in Depth" is good, though he's effectively rewritten it as "SQL and Relational Theory: How to Write Accurate SQL Code"

http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068

Michael Glaesemann
grzm seespotcode net




Re: Relational Algebra and Aggregate Functions

From
Robert James
Date:

On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote:
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
> Many wrote that the functional programming 'fold' is a good model for
> relational aggregate functions.  I have a few difficulties with this:
> 1. fold doesn't offer any type of GROUP BY, which is an essential component
> of aggregation.

Not sure if I'd agree, a GROUP BY without any aggregate functions looks
pretty indistinguishable from just a DISTINCT on the same columns to me.

DISTINCT will collapse duplicates, which is not what we want when computing COUNT, SUM, or AVG - please see below.
 
> 3. fold is defined on sequences, not sets.  This doesn't seem to be a
> problem until you think about cases where there a duplicates of the
> aggregated field.  (For instance, there are 10 bags each weighing 5 lbs, and
> you want SUM(weight) - you need to project weight onto a collection which
> allows for 10 occurences, or define the aggregate function to work on the
> whole tuple somehow... I know a man named Krug worked out a formal theory
> for this...)

I don't see why this is a problem at all; could you give a concrete
example?
Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)}
How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could project_weight(LUGGAGE) and then apply SUM, except that would give us {(weight:3), (weight:3)}, which is not a set (it has duplicates).  We could define a new operation: project_to_list (allowing duplicates), or we could define SUM(weight) over the LUGGAGE relation as a whole - either way, we need to extend the theory a bit.

Re: Relational Algebra and Aggregate Functions

From
Sam Mason
Date:
On Tue, Jul 28, 2009 at 11:26:01AM -0400, Robert James wrote:
> On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote:
> > On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
> > > Many wrote that the functional programming 'fold' is a good model for
> > > relational aggregate functions.  I have a few difficulties with this:
> > > 1. fold doesn't offer any type of GROUP BY, which is an essential
> > component
> > > of aggregation.
> >
> > Not sure if I'd agree, a GROUP BY without any aggregate functions looks
> > pretty indistinguishable from just a DISTINCT on the same columns to me.
>
> DISTINCT will collapse duplicates, which is not what we want when computing
> COUNT, SUM, or AVG - please see below.

That's why I said "without any aggregate functions".  E.g:

  SELECT a,b FROM foo GROUP BY a,b;

vs.

   SELECT DISTINCT a,b FROM foo;

I guess we're talking about different things.  Folds are an example of
something called a Catamorphism (as wikipedia seems to know these days).
If you're really interested in this then I liked the "bananas in space"
paper:

  http://citeseer.ist.psu.edu/293490.html

wikipedia seems to like:

  http://citeseer.ist.psu.edu/meijer91functional.html

I don't think these are the same, but citeseer seems to be down at the
moment and I can't check.

> > > 3. fold is defined on sequences, not sets.  This doesn't seem to be a
> > > problem until you think about cases where there a duplicates of the
> > > aggregated field.  (For instance, there are 10 bags each weighing 5 lbs,
> > and
> > > you want SUM(weight) - you need to project weight onto a collection which
> > > allows for 10 occurences, or define the aggregate function to work on the
> > > whole tuple somehow... I know a man named Krug worked out a formal theory
> > > for this...)
> >
> > I don't see why this is a problem at all; could you give a concrete
> > example?
>
> Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)}
> How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could
> project_weight(LUGGAGE) and then apply SUM, except that would give us
> {(weight:3), (weight:3)}, which is not a set (it has duplicates).  We could
> define a new operation: project_to_list (allowing duplicates), or we could
> define SUM(weight) over the LUGGAGE relation as a whole - either way, we
> need to extend the theory a bit.

Which "theory" are we talking about here?  I assume you're talking
about relational algebra, at which point I believe that aggregates have
to be handled explicitly and this whole conversation becomes somewhat
meaningless.

I mentioned the analogy with folds because their semantics are somewhat
more accessible to software people, if you're really interested
in relational algebra then I'd stay away from folds and deal with
aggregates directly.

--
  Sam  http://samason.me.uk/

Re: Relational Algebra and Aggregate Functions

From
Peter Eisentraut
Date:
On Tuesday 28 July 2009 04:38:03 Jeff Davis wrote:
> On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote:
> > 1) Introduction to Database Systems
> > http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp
> >/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5
> >
> > and
> > 2) Database in Depth: Relational Theory for Practitioners
> > http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0
> >596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7
>
> I recommend #2. It's shorter and easier to read than "An Introduction to
> Database Systems", and I think it will answer your question about
> relational theory and aggregates (see "SUMMARIZE").

Is it weird that "Database in Depth" is shorter and easier than "Introduction
to Database Systems"?  And they're by the same author, too.

Re: Relational Algebra and Aggregate Functions

From
Jeff Davis
Date:
On Wed, 2009-07-29 at 10:19 +0300, Peter Eisentraut wrote:
> Is it weird that "Database in Depth" is shorter and easier than "Introduction
> to Database Systems"?  And they're by the same author, too.

I agree that it's a little strange. The former is more conceptual and
starts off assuming that you are familiar with things like
normalization. The latter is more like a textbook: more complete and
more formal.

Regards,
    Jeff Davis