Re: aggregation question - Mailing list pgsql-sql

From Samuel Gendler
Subject Re: aggregation question
Date
Msg-id AANLkTik3-kgfGdQx6iKTDW7te5fEFeVZtONQZLqJM-Ut@mail.gmail.com
Whole thread Raw
In response to aggregation question  (Samuel Gendler <sgendler@ideasculptor.com>)
List pgsql-sql


On Tue, Nov 30, 2010 at 7:47 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:
I have a fact table with a number of foreign keys to dimension data and some measure columns.  Let's assume I've got dimension columns a,b,c,d,e, and f and measure columns x, and y.

I need to be able to find the value of f, x, and y for the row with min(x/y) when e in (1,2) and max(x/y) when e not in (1,2) with grouping by a,b,c, and d.

A query which would identify the row in question would look like this:

select min(x/y), a, b, c, d, e from my_table group by a,b,c,d,e order by a,b,c,d 

But I don't know how to go about also extracting the value of x, y, and f for the row which is matched - I suppose I could join to a subquery, but that strikes me as probably not being the most efficient way to accomplish it.

Actually, it occurs to me that I've potentially got more than one row for any given a,b,c,d,e,f tuple, but I want a min for all rows aggregated by f

Basically, for each (a,b,c,d,e) I want to return sum(x), sum(y), and f (as well as a,b,c,d,e) for the f with min(sum(x)/sum(y))


I solved this, but I'm not sure if my solution is ideal

    -- select distinct causes multiple rows that tie for rank = 1
    -- to condense down to a single row. Outer query also strips 
    -- any rows from windows with less than 3 entries.
    select distinct count, total_ms, time, fk_e,
           fk_a, fk_b, fk_c
    from (        -- grab count and total_ms for each fk_g, grouped by        -- a,b,c, time, and e.  Use rank() window function        -- to rank each row in the window, ordered by total_ms/count        -- and use count(*) as window function to get total rows in        -- the window. 
        select sum(f.x) as count, sum(f.y) as total_ms,
               f.time as time,
               f.fk_a, f.fk_b, f.fk_c,
               f.fk_e, f.fk_f,
               rank() over (w_agg order by sum(f.x)/sum(f.y)) as rank,
               count(*) over (w_agg) as rows
        from fact_table f
        group by 3,4,5,6,7,8
        window w_agg as (partition by time,
                         fk_a, fk_b, fk_c,
                         fk_e)
    ) q
    where q.rank = 1 and q.rows >= 3
 

I will need to run this twice - once with rank() ordered descending and the inner query actually limited by fk_e in (4,14) and once with rank() ordered ascending for fk_e not in (4,14).   It is possible I can put a case statement in the inner select in order to handle that.  I haven't tried it yet.

Is there a better solution than the outer 'select distinct' in order to only see rows that have rank = 1 and to never have duplicate rows?

pgsql-sql by date:

Previous
From: Jeff Bland
Date:
Subject: DELETE WHERE EXISTS unexpected results
Next
From: Lee Hachadoorian
Date:
Subject: Re: OT - load a shp file