Thread: writing a MIN(RECORD) aggregate

writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
Hi,

I'm trying to write a version of the MIN aggregate for values of RECORD
type.  I'm somewhat stuck on getting type information about the argument
out, I can determine how many attributes it's got but I can't seem to do
any better than that.  Does anyone have any good pointers into the code
for places to help me understand what's happening?

The reason for doing this is mainly because I think it'd be nicer to be
doing things like:
 SELECT i, (MIN((j,k))).k FROM tbl GROUP BY i;

instead of:
 SELECT DISTINCT ON (i) i, k FROM tbl ORDER BY i,j,k;

Which as far as I can tell should produce identical results, except the
first has cleaner semantics.  It also allows you to combine MIN and MAX
in the same query, giving the value of k for the smallest and largest j
in this example--requiring two queries if it was done using the DISTINCT
ON method.

 Sam


Re: writing a MIN(RECORD) aggregate

From
Decibel!
Date:
On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:
> I'm trying to write a version of the MIN aggregate for values of  
> RECORD
> type.  I'm somewhat stuck on getting type information about the  
> argument
> out, I can determine how many attributes it's got but I can't seem  
> to do
> any better than that.  Does anyone have any good pointers into the  
> code
> for places to help me understand what's happening?
>
> The reason for doing this is mainly because I think it'd be nicer  
> to be
> doing things like:
>
>   SELECT i, (MIN((j,k))).k
>   FROM tbl
>   GROUP BY i;

How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

> instead of:
>
>   SELECT DISTINCT ON (i) i, k
>   FROM tbl
>   ORDER BY i,j,k;
>
> Which as far as I can tell should produce identical results, except  
> the
> first has cleaner semantics.  It also allows you to combine MIN and  
> MAX
> in the same query, giving the value of k for the smallest and  
> largest j
> in this example--requiring two queries if it was done using the  
> DISTINCT
> ON method.

I don't see how min(record) even allows for that.

I'm not saying that min/avg/max/etc(RECORD) wouldn't be useful; I'm  
just failing to see the use in these examples.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:
> On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:
> >  SELECT i, (MIN((j,k))).k
> >  FROM tbl
> >  GROUP BY i;
> 
> How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?

Because I want the value of k associated with the minimum value of j.
For example, if I have data looking like:
 i  j  k 1  3  7 1  4  8 2  5  10 2  6  9

I want to get this out:
 i  k 1  7 2  10

I would get this if I used the DISTINCT ON or if MIN was valid over
records.  With your code I'd get this:
 i  k 1  7 2  9

> I'm not saying that min/avg/max/etc(RECORD) wouldn't be useful; 

AVG wouldn't work, because it relies on treating it's parameter as a
numeric field over which summation and division are valid operations.
MIN/MAX just relies on there being a (total) ordering operator available
and with PG there pretty much always is.

> I'm just failing to see the use in these examples.

Did the example above make things any clearer?


I've also just realised that PG's current handling of NULLs inside
records is also going to cause problems.  The main problem seems to be
that the IS NULL operator isn't consistent with comparison operators.
For example:
 (1,NULL) IS NULL      --> FALSE (1,NULL) = (1,NULL)   --> NULL

I'm not sure if it's just my intuition is off, or whether there is an
invariant (e.g. a comparison returns NULL if-and-only-if either side
evaluate TRUE to IS NULL) that's being broken.


Thanks, Sam


Re: writing a MIN(RECORD) aggregate

From
Gregory Stark
Date:
"Sam Mason" <sam@samason.me.uk> writes:

> On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:
>> On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:
>> >  SELECT i, (MIN((j,k))).k
>> >  FROM tbl
>> >  GROUP BY i;
>> 
>> How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?
>
> Because I want the value of k associated with the minimum value of j.
> For example, if I have data looking like:

I have nothing against having min(record) and it does seem like it would let
you do this at least for reasonably simple cases.

But I'm more eager to see full OLAP window functions which would let you do
this and a whole lot else as well.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
On Mar 25, 2008, at 4:43PM, Gregory Stark wrote:
> On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:
> >  SELECT i, (MIN((j,k))).k
> >  FROM tbl
> >  GROUP BY i;
> 
> I have nothing against having min(record) and it does seem like it would let
> you do this at least for reasonably simple cases.

The main reason for this was that I've needed min(record) a few times
before and thought it would be reasonably easy to code.

> But I'm more eager to see full OLAP window functions which would let you do
> this and a whole lot else as well.

I've never used window functions before so don't think about them when
solving my problems.  If they were available I'd probably start using
them.  From the small bit of reading that I've done around them, they
seem very imperative in nature.  I'm not sure if this is a good or a bad
thing.

In a database that did support them, how would I write my query with
them?  My first naive attempt was this:
 SELECT i, MIN(k) OVER (PARTITION BY j) FROM tbl GROUP BY i;

This is obviously wrong, but I don't see how to get to where I need to
be.

 Sam


Re: writing a MIN(RECORD) aggregate

From
Gregory Stark
Date:
"Sam Mason" <sam@samason.me.uk> writes:

>   SELECT i, MIN(k) OVER (PARTITION BY j)
>   FROM tbl
>   GROUP BY i;
>
> This is obviously wrong, but I don't see how to get to where I need to
> be.

I'm not entirely sure myself. I think it might involve RANK OVER j though.

I suspect it will look more like the DISTINCT ON solution than the min(record)
solution.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
On Tue, Mar 25, 2008 at 06:58:06PM +0000, Gregory Stark wrote:
> "Sam Mason" <sam@samason.me.uk> writes:
> >   SELECT i, MIN(k) OVER (PARTITION BY j)
> >   FROM tbl
> >   GROUP BY i;
> >
> > This is obviously wrong, but I don't see how to get to where I need to
> > be.
> 
> I'm not entirely sure myself. I think it might involve RANK OVER j though.

The main thing I wanted to avoid was an explosion of sub-queries that
you get with DISTINCT ON style queries.  For example, with record style
syntax, I can do:
 SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb FROM tbl GROUP BY i;

whereas using DISTINCT ON I'd have to do:
 SELECT a.i, a.k AS ka, b.k as kb FROM (   SELECT DISTINCT ON (i) i, k   FROM tbl   ORDER BY i, j) a, (   SELECT
DISTINCTON (i) i, k   FROM tbl   ORDER BY i, mycode(j)) b WHERE a.i = b.i;
 

Which gets unmanageable quickly.  Any idea how window functions would
cope with this sort of complexity?  Or is this what you meant by:

> I suspect it will look more like the DISTINCT ON solution than the min(record)
> solution.


Thanks, Sam


Re: writing a MIN(RECORD) aggregate

From
Gregory Stark
Date:
"Sam Mason" <sam@samason.me.uk> writes:

> On Tue, Mar 25, 2008 at 06:58:06PM +0000, Gregory Stark wrote:
> The main thing I wanted to avoid was an explosion of sub-queries that
> you get with DISTINCT ON style queries.  For example, with record style
> syntax, I can do:
>
>   SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb
>   FROM tbl
>   GROUP BY i;
>
> whereas using DISTINCT ON I'd have to do:
...
> Which gets unmanageable quickly.  Any idea how window functions would
> cope with this sort of complexity?  Or is this what you meant by:
>
>> I suspect it will look more like the DISTINCT ON solution than the min(record)
>> solution.

The flip side is that if you want to get several fields based on min(j) the
min(record) approach requires you to write that expression several times (and
the database to calculate it several times).

I think the window functions might (assuming an ideal implementation) get the
best of both worlds. You would be able to do something with multiple
partitions so you could ask of a few columns where rank over j = 1 and a few
more columns where rank over k = 1.

But, uh, I'm not sure. I'll have to sit down with the spec and see if that's
true. Furthermore it may be wishful thinking to hope that the implementation
will do anything special with the special case where you're only selecting
records where rank = 1.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
On Tue, Mar 25, 2008 at 07:54:17PM +0000, Gregory Stark wrote:
> "Sam Mason" <sam@samason.me.uk> writes:
> > SELECT i, (MIN((j,k))).k AS ka, (MIN((mycode(j),k))).k AS kb
> > FROM tbl
> > GROUP BY i;
> 
> The flip side is that if you want to get several fields based on min(j) the
> min(record) approach requires you to write that expression several times (and
> the database to calculate it several times).

No.  My demos have only used one column because that's the smallest
useful demo.
 SELECT i, r.k, r.l FROM (   SELECT i, MIN((j,k,l)) AS r   FROM tbl   GROUP BY i) x;

The reason for the sub-select is only because SQL doesn't provide any
other way to name expressions.  Hum, or at least this should work...
There doesn't seem to be any nice way of getting fields out of a record!

If I really want to do this, it's going to turn into quite an overhaul
of record handling in PG.  It would also remove the nice syntactic trick
that a.b identifies the field "b" from table "a", and s.a.b means that
the above is in schema "s".

> I think the window functions might (assuming an ideal implementation) get the
> best of both worlds. You would be able to do something with multiple
> partitions so you could ask of a few columns where rank over j = 1 and a few
> more columns where rank over k = 1.
>
> But, uh, I'm not sure. I'll have to sit down with the spec and see if that's
> true. Furthermore it may be wishful thinking to hope that the implementation
> will do anything special with the special case where you're only selecting
> records where rank = 1.

I don't really understand what you're saying above.  Optimisation is
another can of worms that shouldn't be opened until we know how this
sort of thing is going to be used.

 Sam


Re: writing a MIN(RECORD) aggregate

From
Gregory Stark
Date:
"Sam Mason" <sam@samason.me.uk> writes:

> The reason for the sub-select is only because SQL doesn't provide any
> other way to name expressions.  Hum, or at least this should work...
> There doesn't seem to be any nice way of getting fields out of a record!
>
> If I really want to do this, it's going to turn into quite an overhaul
> of record handling in PG.  It would also remove the nice syntactic trick
> that a.b identifies the field "b" from table "a", and s.a.b means that
> the above is in schema "s".

Yeah, to disambiguate it you have to use (r).i


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: writing a MIN(RECORD) aggregate

From
Sam Mason
Date:
On Wed, Mar 26, 2008 at 01:03:18AM +0000, Gregory Stark wrote:
> "Sam Mason" <sam@samason.me.uk> writes:
> > The reason for the sub-select is only because SQL doesn't provide any
> > other way to name expressions.  Hum, or at least this should work...
> > There doesn't seem to be any nice way of getting fields out of a record!
> 
> Yeah, to disambiguate it you have to use (r).i

OK, that sort of makes sense.  The next problem is that PG doesn't
remember the column names:
 SELECT (ROW(i)).i FROM (SELECT 1) x(i);

Results in PG saying it doesn't know where "i" is inside the row, which
seems a little strange.  I think it's this detail that accounts for
my problems in trying to get this all working before.  This seems to
suggest that there are two record-like data structures in PG, one for
the records returned as part of the SELECT list and another that I'm
using here.

As a side case, would it be nice if:
 SELECT (SELECT 1 AS a, 2 AS b);

resulted in a record with two members?

 Sam


Re: writing a MIN(RECORD) aggregate

From
Decibel!
Date:
On Mar 25, 2008, at 11:33 AM, Sam Mason wrote:
> On Mon, Mar 24, 2008 at 05:27:04PM -0500, Decibel! wrote:
>> On Mar 20, 2008, at 2:23 PM, Sam Mason wrote:
>>>  SELECT i, (MIN((j,k))).k
>>>  FROM tbl
>>>  GROUP BY i;
>>
>> How is that any better than SELECT i, min(k) FROM tbl GROUP BY i ?
>
> Because I want the value of k associated with the minimum value of j.


Ahh, makes sense. FWIW...

SELECT i, (SELECT k FROM ... WHERE i = i.i ORDER BY j LIMIT 1)    FROM (SELECT DISTINCT i FROM ...) i
;

If you needed more than just k, I think there's a way to do that in  
the FROM clause, too.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828