Thread: Count of records in a row
I have a table of event_id, event_time. Many times, several events happen in a row. I'd like a query which replaces all of those events with a single record, showing the count. Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; D,1; A,2; D,2; B,1; C,2 How can I do that?
Robert James wrote > I have a table of event_id, event_time. Many times, several events > happen in a row. I'd like a query which replaces all of those events > with a single record, showing the count. > > Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; > D,1; A,2; D,2; B,1; C,2 > > How can I do that? <Theory Only> Window functions are going to be your friend. To solve the grouping problem I would assign the first row's value a group value of zero (0). Using the "lag(...)" window function and an appropriately defined frame you conditionally add one (1) to the prior row's group value if the value of lag(1) does not equal the current row's value. The result should be a new column where all sequential duplicates share the same group number. Distinct will give you a lookup relation for which letter belongs to which group Group By + Count on the group will give you counts Use string_agg(...) to condense the above into single row/column HTH David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-tp5775363p5775365.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Hey,
I tried something very similar to compute generalized union of numeric range (which was innapropriate, anyway).
My conclusion were that it's not possible using windows function as you need either a memory (windows function are not allowed in update) or iterations to propagate information (windows functions cannot be nested).
There may be a theoretical possibility of success using windows function and recursive CTE.
(see end of this mail for a taste to this kind of solution)
(see end of this mail for a taste to this kind of solution)
But it is immensely easier and sometimes mandatory to use instead
a plpgsql function using cursor (or cursors).
It would be something like that in plpgsql :
It would be something like that in plpgsql :
cursor on table of letter ordered
accum = 0;
loop on rows of table ordered
if letter = previous letter, new_id = accumelse accum ++ ; new_id = accumold letter = new_letternew letter = next letter;
end of loop,
Cheers,
Rémi-C
Piste for solving it with windows function and recursive CTE :
--defining test env :
--this query gives the result, but it needs to be iterated using a recursive CTE (not done here):drop table if exists test_grouping;create table test_grouping(id serial,letter text--,previous_letter text,for_computation int--,previous_for_computation INT);INSERT INTO test_grouping (letter) VALUEs('A'), ('A'),('A'),('A'),('B'),('C'),('A'),('D'),('A'),('A'),('D'),('D'),('B'),('C'),('C' );UPDATE test_grouping set for_computation=id;SELECT *FROM test_grouping;
--you can do it manually by executing it several times
WITH computation AS (SELECT id, letter, for_computation,lag( letter, 1,NULL) over w,CASEWHEN lag( letter, 1,NULL) over w = letterTHENlag( for_computation, 1,NULL) over w--NULLELSEidEND AS new_id,(SELECT count(*) over ())FROM test_grouping AS tgWINDOW w AS (ORDER BY id ASC ROWS 1 preceding)ORDER BY tg.id ASC)RETURNING tg.*
2013/10/22 David Johnston <polobo@yahoo.com>
Robert James wrote> I have a table of event_id, event_time. Many times, several events<Theory Only>
> happen in a row. I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
>
> How can I do that?
Window functions are going to be your friend.
To solve the grouping problem I would assign the first row's value a group
value of zero (0). Using the "lag(...)" window function and an
appropriately defined frame you conditionally add one (1) to the prior row's
group value if the value of lag(1) does not equal the current row's value.
The result should be a new column where all sequential duplicates share the
same group number.
Distinct will give you a lookup relation for which letter belongs to which
group
Group By + Count on the group will give you counts
Use string_agg(...) to condense the above into single row/column
HTH
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-tp5775363p5775365.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote: > But it is immensely easier and sometimes mandatory to use instead > a plpgsql function using cursor (or cursors). > > It would be something like that in plpgsql : > > cursor on table of letter ordered > accum = 0; > loop on rows of table ordered > > if letter = previous letter, new_id = accum > else accum ++ ; new_id = accum > > old letter = new_letter > new letter = next letter; > > end of loop, Shouldn't it be possible to do that with a FOR loop without a cursor? It might be that procedural is the way to go. But I still believe that relational algebra can handle this, even without a window function. Something like: SELECT event e, COUNT( SELECT event oe ... WHERE oe.event_time > e.event_time AND NOT EXISTS ( SELECT event te WHERE te.event_time > e.event_time AND te.event_time < oe.event_time)) .
Hey,
Nevertheless if you find something purely relational please keep me posted !
Cheers,
Rémi-C
when using a for you implicitly use a cursor (I think),
so this is the same, use FOR if you like it more.
It should be *very* fast to write !
It should be *very* fast to write !
As I wrote, relational algebra can handle it, but it is not practically feasible :
If you just execute 3 times the query I wrote, you will have your answer.
It is 3 times because the biggest sequence is A A A A.
It is 3 times because the biggest sequence is A A A A.
That's the problem, your number of execution depends on the max size of sequence.
The problems boils down to this : the answer for one row depends on the answer of the previous row, the row before , etc.
You could succeed with ordering by id in a windows function, and in this window function order by new_id and putting null to the end, but such nested windows functions calls are not allowed.
Nevertheless if you find something purely relational please keep me posted !
Cheers,
Rémi-C
2013/10/22 Robert James <srobertjames@gmail.com>
On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote:Shouldn't it be possible to do that with a FOR loop without a cursor?
> But it is immensely easier and sometimes mandatory to use instead
> a plpgsql function using cursor (or cursors).
>
> It would be something like that in plpgsql :
>
> cursor on table of letter ordered
> accum = 0;
> loop on rows of table ordered
>
> if letter = previous letter, new_id = accum
> else accum ++ ; new_id = accum
>
> old letter = new_letter
> new letter = next letter;
>
> end of loop,
It might be that procedural is the way to go. But I still believe
that relational algebra can handle this, even without a window
function. Something like:
SELECT event e, COUNT(
SELECT event oe ... WHERE oe.event_time > e.event_time AND NOT EXISTS (
SELECT event te WHERE te.event_time > e.event_time AND
te.event_time < oe.event_time))
.
On Tue, Oct 22, 2013 at 8:16 AM, Rémi Cura <remi.cura@gmail.com> wrote: > Hey, > when using a for you implicitly use a cursor (I think), > so this is the same, use FOR if you like it more. > It should be *very* fast to write ! > > As I wrote, relational algebra can handle it, but it is not practically > feasible : > > If you just execute 3 times the query I wrote, you will have your answer. > It is 3 times because the biggest sequence is A A A A. > That's the problem, your number of execution depends on the max size of > sequence. > > The problems boils down to this : the answer for one row depends on the > answer of the previous row, the row before , etc. > > You could succeed with ordering by id in a windows function, and in this > window function order by new_id and putting null to the end, but such nested > windows functions calls are not allowed. > > Nevertheless if you find something purely relational please keep me posted ! CREATE TYPE count_same_t AS ( item TEXT, count_item INT ); CREATE OR REPLACE FUNCTION count_same_internal(state count_same_t, item TEXT) RETURNS count_same_t AS $$ BEGIN state.count_item := CASE WHEN item = state.item THEN state.count_item + 1 ELSE 1 END; state.item := item; RETURN state; END; $$ LANGUAGE PLPGSQL; CREATE FUNCTION count_same_count(state count_same_t) RETURNS INT AS $$ BEGIN RETURN state.count_item; END; $$ LANGUAGE PLPGSQL; CREATE AGGREGATE count_same(TEXT) ( SFUNC=count_same_internal, STYPE=count_same_t, FINALFUNC=count_same_count, INITCOND='(,)' ); WITH testdata as (select s, chr((floor(random() * 3))::int + 65) as v from generate_series(1,50) s) SELECT s, v, count_same(v) OVER(order by s) from testdata; s | v | count_same ----+---+------------ 1 | A | 1 2 | B | 1 3 | A | 1 4 | A | 2 5 | C | 1 6 | A | 1 7 | C | 1 8 | C | 2 9 | C | 3 10 | C | 4 /snip merlin :-D
On pon, paź 21, 2013 at 08:38:52 -0400, Robert James wrote: > I have a table of event_id, event_time. Many times, several events > happen in a row. I'd like a query which replaces all of those events > with a single record, showing the count. > > Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; > D,1; A,2; D,2; B,1; C,2 > How can I do that? "A" or other letters don't really match your schema description. Please provide sample schema (as in: create table ...), sample data, and expected output. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
héhé,
nice snipping Merlin !
I guess you are almost there, output is still wrong (should be) (
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
> D,1; A,2; D,2; B,1; C,2
)
I don't understand enough to make the modifications =)
I don't understand enough to make the modifications =)
Cheers,
Rémi-C
2013/10/22 hubert depesz lubaczewski <depesz@depesz.com>
On pon, paź 21, 2013 at 08:38:52 -0400, Robert James wrote:"A" or other letters don't really match your schema description.
> I have a table of event_id, event_time. Many times, several events
> happen in a row. I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
> How can I do that?
Please provide sample schema (as in: create table ...), sample data, and
expected output.
Best regards,
depesz
--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 22, 2013 at 8:41 AM, Rémi Cura <remi.cura@gmail.com> wrote: > héhé, > nice snipping Merlin ! > > I guess you are almost there, output is still wrong (should be) ( >> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; >> D,1; A,2; D,2; B,1; C,2 > ) > > I don't understand enough to make the modifications =) oh right -- whoops. CREATE TYPE count_same_t AS ( item TEXT, item_group INT ); CREATE OR REPLACE FUNCTION count_same_internal(state count_same_t, item TEXT) RETURNS count_same_t AS $$ BEGIN state.item_group := CASE WHEN item = state.item THEN state.item_group ELSE state.item_group + 1 END; state.item := item; RETURN state; END; $$ LANGUAGE PLPGSQL; CREATE OR REPLACE FUNCTION count_same_count(state count_same_t) RETURNS INT AS $$ BEGIN RETURN state.item_group; END; $$ LANGUAGE PLPGSQL; CREATE AGGREGATE count_same(TEXT) ( SFUNC=count_same_internal, STYPE=count_same_t, FINALFUNC=count_same_count, INITCOND='(,0)' ); WITH testdata as (select s, chr((floor(random() * 3))::int + 65) as v from generate_series(1,50) s) select v, count(*) from (SELECT s, v, count_same(v) OVER(order by s) grp from testdata) q GROUP BY v, grp order by grp; v | count ---+------- A | 1 B | 1 A | 1 B | 2 C | 1 A | 1 C | 2 A | 1 C | 2 A | 3 B | 3 A | 1 B | 1 /snip aside: learn the technique. custom aggregates may seem awkward and weird at first, but they can be used to solve all sorts of wonderful problems. merlin
I didn't know you could use variable inside custom aggregates, and this allow to solve the problem!
In my own problem I couldn't use aggregates because
_as it output at most one row, it would have mean a lots of useless computation (as in this example I guess, (please correct me if it's not the case) : we do N computations of aggregate , each "using" at most N rows)
_I couldn't cheat with arrays because of cost of serialization/deserialization
_as it output at most one row, it would have mean a lots of useless computation (as in this example I guess, (please correct me if it's not the case) : we do N computations of aggregate , each "using" at most N rows)
_I couldn't cheat with arrays because of cost of serialization/deserialization
I'll keep in mind this custom aggregate use and try to learn more about it.
Cheers,
Rémi-C
Rémi-C
On Tue, Oct 22, 2013 at 9:09 AM, Rémi Cura <remi.cura@gmail.com> wrote: > > Thanks for this good example Merlin ! > > I didn't know you could use variable inside custom aggregates, and this > allow to solve the problem! > > In my own problem I couldn't use aggregates because > _as it output at most one row, it would have mean a lots of useless > computation (as in this example I guess, (please correct me if it's not the > case) : That is not the case. With the approach above what you 'pay' vs standard loop is basically one pl/pgsql function call per output row. (you can do it in straight sql, but when with pl/pgsql to leverage cached function parsing). What you 'get' is a more general function because the loop structure is in the query itself as well as the output structure. This cleanly separates action from the data. Usually, the mechanics of executing the aggregate are not a huge part of query execution time. Actually, the worst case is when the aggregate is trivial but no matter what it's O(n). I'm not clear what on the issue is with your particular case, since you didn't post it :-). Maybe post some extra detail? merlin
Thanks again for the precision !
I still don't understand perfectly. We call the aggregate n times, and each time we compute the aggregate, using (potentially) n rows, thus becoming (at most) O(n*n).
With a standard loop, I loop n times, and each times I only need the current row plus the previous row which I put in memory, thus O(n).
I posted a lot about my issue, but only about the fraction of the problem I was blocked by, and I get no conclusive answer.
Aggregates only returns one row at most, array are dangerous with big data, temp table have to be created/deleted and have to be used in the same session, cursors arn't well supported by accessing library, view can't be written, mat view weren't available.
I still don't understand perfectly. We call the aggregate n times, and each time we compute the aggregate, using (potentially) n rows, thus becoming (at most) O(n*n).
With a standard loop, I loop n times, and each times I only need the current row plus the previous row which I put in memory, thus O(n).
I posted a lot about my issue, but only about the fraction of the problem I was blocked by, and I get no conclusive answer.
My problem was to find a good way to have a plpgsql function taking set of rows as input and returning a set of rows.
I worked on range, and I wanted a polymorphic function (working with any range).
I worked on range, and I wanted a polymorphic function (working with any range).
Aggregates only returns one row at most, array are dangerous with big data, temp table have to be created/deleted and have to be used in the same session, cursors arn't well supported by accessing library, view can't be written, mat view weren't available.
Anyway I solved it using cursors, not optimal but works !
Cheers,
Rémi-C
2013/10/22 Merlin Moncure <mmoncure@gmail.com>
On Tue, Oct 22, 2013 at 9:09 AM, Rémi Cura <remi.cura@gmail.com> wrote:That is not the case. With the approach above what you 'pay' vs
>
> Thanks for this good example Merlin !
>
> I didn't know you could use variable inside custom aggregates, and this
> allow to solve the problem!
>
> In my own problem I couldn't use aggregates because
> _as it output at most one row, it would have mean a lots of useless
> computation (as in this example I guess, (please correct me if it's not the
> case) :
standard loop is basically one pl/pgsql function call per output row.
(you can do it in straight sql, but when with pl/pgsql to leverage
cached function parsing). What you 'get' is a more general function
because the loop structure is in the query itself as well as the
output structure. This cleanly separates action from the data.
Usually, the mechanics of executing the aggregate are not a huge part
of query execution time. Actually, the worst case is when the
aggregate is trivial but no matter what it's O(n).
I'm not clear what on the issue is with your particular case, since
you didn't post it :-). Maybe post some extra detail?
merlin
On 2013-10-21 20:38, Robert James wrote: > I have a table of event_id, event_time. Many times, several events > happen in a row. I'd like a query which replaces all of those events > with a single record, showing the count. > > Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; > D,1; A,2; D,2; B,1; C,2 > > How can I do that? > > It looks like you already found a solution, but here's one with a CTE. I cobbled this together from an older query I had for doing something similar, for which I unfortunately lost the original source of this approach. Also, this implies that there is something that gives an ordering to these rows (in this case, the field "i"). create temp table data (i int, val char); insert into data (val, i) values ('A',1), ('A',2), ('A',3), ('B',4), ('C',5), ('A',6), ('D',7), ('A',8), ('A',9), ('D',10), ('D',11), ('B',12), ('C',13), ('C',14) ; with x as ( select i, row_number() over () as xxx, val, row_number() over (partition by val order by i asc) - row_number() over () as d from data order by i ) select val, count(*) from x group by d, val order by min(i) ;
On Tue, Oct 22, 2013 at 10:01 AM, Elliot <yields.falsehood@gmail.com> wrote: > On 2013-10-21 20:38, Robert James wrote: >> >> I have a table of event_id, event_time. Many times, several events >> happen in a row. I'd like a query which replaces all of those events >> with a single record, showing the count. >> >> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1; >> D,1; A,2; D,2; B,1; C,2 >> >> How can I do that? >> >> > > It looks like you already found a solution, but here's one with a CTE. I > cobbled this together from an older query I had for doing something similar, > for which I unfortunately lost the original source of this approach. Also, > this implies that there is something that gives an ordering to these rows > (in this case, the field "i"). > > create temp table data (i int, val char); > > insert into data (val, i) > values > ('A',1), > ('A',2), > ('A',3), > ('B',4), > ('C',5), > ('A',6), > ('D',7), > ('A',8), > ('A',9), > ('D',10), > ('D',11), > ('B',12), > ('C',13), > ('C',14) > ; > > with x > as > ( > select i, > row_number() over () as xxx, > val, > row_number() over (partition by val order by i asc) > - row_number() over () as d > from data > order by i > ) > select val, > count(*) > from x > group by d, > val > order by min(i) wow, that's really clever. merlin
Hmm exactly what I was thinking !
Thank you a lot, I spend many hours thinking about this and this solution is very nice.
Cheers,
Rémi-C
Cheers,
Rémi-C
2013/10/22 Merlin Moncure <mmoncure@gmail.com>
wow, that's really clever.On Tue, Oct 22, 2013 at 10:01 AM, Elliot <yields.falsehood@gmail.com> wrote:
> On 2013-10-21 20:38, Robert James wrote:
>>
>> I have a table of event_id, event_time. Many times, several events
>> happen in a row. I'd like a query which replaces all of those events
>> with a single record, showing the count.
>>
>> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
>> D,1; A,2; D,2; B,1; C,2
>>
>> How can I do that?
>>
>>
>
> It looks like you already found a solution, but here's one with a CTE. I
> cobbled this together from an older query I had for doing something similar,
> for which I unfortunately lost the original source of this approach. Also,
> this implies that there is something that gives an ordering to these rows
> (in this case, the field "i").
>
> create temp table data (i int, val char);
>
> insert into data (val, i)
> values
> ('A',1),
> ('A',2),
> ('A',3),
> ('B',4),
> ('C',5),
> ('A',6),
> ('D',7),
> ('A',8),
> ('A',9),
> ('D',10),
> ('D',11),
> ('B',12),
> ('C',13),
> ('C',14)
> ;
>
> with x
> as
> (
> select i,
> row_number() over () as xxx,
> val,
> row_number() over (partition by val order by i asc)
> - row_number() over () as d
> from data
> order by i
> )
> select val,
> count(*)
> from x
> group by d,
> val
> order by min(i)
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote: > Thanks again for the precision ! > > I still don't understand perfectly. We call the aggregate n times, and each > time we compute the aggregate, using (potentially) n rows, thus becoming (at > most) O(n*n). > > With a standard loop, I loop n times, and each times I only need the current > row plus the previous row which I put in memory, thus O(n). For posterity, the above is incorrect. Since the aggregate is ordered through the window function, it gets executed exactly once per output row. It behaves exactly like a loop. You know this because there is no array in the aggregate state. merlin
OK,
just out of pure curiosity,
is it always the case or is it due to this particular aggregate?
Cheers,
Rémi-C
2013/10/22 Merlin Moncure <mmoncure@gmail.com>
On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote:For posterity, the above is incorrect. Since the aggregate is ordered
> Thanks again for the precision !
>
> I still don't understand perfectly. We call the aggregate n times, and each
> time we compute the aggregate, using (potentially) n rows, thus becoming (at
> most) O(n*n).
>
> With a standard loop, I loop n times, and each times I only need the current
> row plus the previous row which I put in memory, thus O(n).
through the window function, it gets executed exactly once per output
row. It behaves exactly like a loop. You know this because there is
no array in the aggregate state.
merlin
Wow, this is an excellent discussion - and I must admit, a bit beyond my abilities. Is there a consensus as to the best approach to adopt? Is Elliot's the best? On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote: > OK, > just out of pure curiosity, > is it always the case or is it due to this particular aggregate? > > Cheers, > Rémi-C > > > 2013/10/22 Merlin Moncure <mmoncure@gmail.com> > >> On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote: >> > Thanks again for the precision ! >> > >> > I still don't understand perfectly. We call the aggregate n times, and >> each >> > time we compute the aggregate, using (potentially) n rows, thus >> > becoming >> (at >> > most) O(n*n). >> > >> > With a standard loop, I loop n times, and each times I only need the >> current >> > row plus the previous row which I put in memory, thus O(n). >> >> For posterity, the above is incorrect. Since the aggregate is ordered >> through the window function, it gets executed exactly once per output >> row. It behaves exactly like a loop. You know this because there is >> no array in the aggregate state. >> >> merlin >> >
> 2013/10/22 Merlin Moncure <mmoncure@gmail.com> >> > With a standard loop, I loop n times, and each times I only need the >> > current >> > row plus the previous row which I put in memory, thus O(n). >> >> For posterity, the above is incorrect. Since the aggregate is ordered >> through the window function, it gets executed exactly once per output >> row. It behaves exactly like a loop. You know this because there is >> no array in the aggregate state. >> > just out of pure curiosity, > is it always the case or is it due to this particular aggregate? It is always the case. Generally speaking, aggregates, especially user defined aggregates, are run once per input row. In this case the main utility of window functions is to order the aggregate execution calls and (especially) allow intermediate output per input row, instead of per aggregate grouping. On Tue, Oct 22, 2013 at 6:01 PM, Robert James <srobertjames@gmail.com> wrote: > Wow, this is an excellent discussion - and I must admit, a bit beyond > my abilities. Is there a consensus as to the best approach to adopt? > Is Elliot's the best? For this *specific* problem, I would give Elliot's (extremely clever) query the nod on the basis that it does not require any supporting infrastructure, which is always nice. That being said, once you start getting the mojo of user defined aggregates + window functions it starts to become clear that it's a cleaner way of doing many types of things that are normally handled by loops. merlin
Ok thanks for this precision Merlin.
Seems like aggregates are way more powerful than I thought.
Obviously I need a lot more reading about custom aggregates before fully understanding it.
Elliot's query is pure SQL so obviously very cool !
Seems like aggregates are way more powerful than I thought.
Obviously I need a lot more reading about custom aggregates before fully understanding it.
Elliot's query is pure SQL so obviously very cool !
It could be improved at the margin, and aggregates/function are certainly faster on big data.
But if you have no specific needs I would say Elliot is easier and more universal.
Cheers & thanks all for this good discussion.
Rémi-C
2013/10/23 Merlin Moncure <mmoncure@gmail.com>
> 2013/10/22 Merlin Moncure <mmoncure@gmail.com>
>> > With a standard loop, I loop n times, and each times I only need the
>> > current
>> > row plus the previous row which I put in memory, thus O(n).
>>
>> For posterity, the above is incorrect. Since the aggregate is ordered
>> through the window function, it gets executed exactly once per output
>> row. It behaves exactly like a loop. You know this because there is
>> no array in the aggregate state.
>>> just out of pure curiosity,It is always the case. Generally speaking, aggregates, especially
> is it always the case or is it due to this particular aggregate?
user defined aggregates, are run once per input row. In this case
the main utility of window functions is to order the aggregate
execution calls and (especially) allow intermediate output per input
row, instead of per aggregate grouping.For this *specific* problem, I would give Elliot's (extremely clever)
On Tue, Oct 22, 2013 at 6:01 PM, Robert James <srobertjames@gmail.com> wrote:
> Wow, this is an excellent discussion - and I must admit, a bit beyond
> my abilities. Is there a consensus as to the best approach to adopt?
> Is Elliot's the best?
query the nod on the basis that it does not require any supporting
infrastructure, which is always nice. That being said, once you start
getting the mojo of user defined aggregates + window functions it
starts to become clear that it's a cleaner way of doing many types of
things that are normally handled by loops.
merlin
On 10/22/13, Elliot <yields.falsehood@gmail.com> wrote: > It looks like you already found a solution, but here's one with a CTE. I > cobbled this together from an older query I had for doing something > similar, for which I unfortunately lost the original source of this > approach. Also, this implies that there is something that gives an > ordering to these rows (in this case, the field "i"). > > create temp table data (i int, val char); > > insert into data (val, i) > values > ('A',1), > ('A',2), > ('A',3), > ('B',4), > ('C',5), > > with x > as > ( > select i, > row_number() over () as xxx, > val, > row_number() over (partition by val order by i asc) > - row_number() over () as d > from data > order by i > ) > select val, > count(*) > from x > group by d, > val > order by min(i) > ; Elliot - Thanks for this great solution; I've tested in on my data and it gives great results. I'd like to understand your code. I believe I understand most of it. Can you explain what 'd' is? And this clause "row_number() over (partition by val order by i asc) - row_number() over () as d"? (Hey, while I'm at it, is there a descriptive name for "x" too?) Thanks
On 2013-10-24 17:09, Robert James wrote: > On 10/22/13, Elliot <yields.falsehood@gmail.com> wrote: >> It looks like you already found a solution, but here's one with a CTE. I >> cobbled this together from an older query I had for doing something >> similar, for which I unfortunately lost the original source of this >> approach. Also, this implies that there is something that gives an >> ordering to these rows (in this case, the field "i"). >> >> create temp table data (i int, val char); >> >> insert into data (val, i) >> values >> ('A',1), >> ('A',2), >> ('A',3), >> ('B',4), >> ('C',5), >> >> with x >> as >> ( >> select i, >> row_number() over () as xxx, >> val, >> row_number() over (partition by val order by i asc) >> - row_number() over () as d >> from data >> order by i >> ) >> select val, >> count(*) >> from x >> group by d, >> val >> order by min(i) >> ; > Elliot - Thanks for this great solution; I've tested in on my data and > it gives great results. > > I'd like to understand your code. I believe I understand most of it. > Can you explain what 'd' is? > > And this clause "row_number() over (partition by val order by i asc) - > row_number() over () as d"? > > (Hey, while I'm at it, is there a descriptive name for "x" too?) > > Thanks Glad I could help. It's easier to understand if you break apart the CTE. I'm also moving around the order by i to clean this up a little. Sorry for the formatting. Running this: select i, val, row_number() over (partition by val order by i asc) as class_i, row_number() over (order by i asc) as overall_i, row_number() over (partition by val order by i asc) - row_number() over () as d from data Yields this: i val class_i overall_i d 1 A 1 1 0 2 A 2 2 0 3 A 3 3 0 4 B 1 4 -3 5 C 1 5 -4 6 A 4 6 -2 7 D 1 7 -6 8 A 5 8 -3 9 A 6 9 -3 10 D 2 10 -8 11 D 3 11 -8 12 B 2 12 -10 13 C 2 13 -11 14 C 3 14 -11 class_i counts the row number within a class and overall_i counts the overall row number in the sequence. Here's just one class extracted to emphasize that: i val class_i overall_i d 1 A 1 1 0 2 A 2 2 0 3 A 3 3 0 6 A 4 6 -2 8 A 5 8 -3 9 A 6 9 -3 Within a given consecutive run of a particular class the difference between class_i and overall_i will always be the same (because they're both increasing by the same amount) but that difference between runs will always be different (because each run starts the sequences at different offsets). "d" is the difference of the two. Because that value segments the runs, all that needs to be done is group by it and count the items in the group to get the length of the runs. The xxx column was junk left over from copying and pasting and verifying. Apologies :). This is a cleaned up version: with x as ( select i, val, row_number() over (partition by val order by i asc) - row_number() over (order by i asc) as d from data ) select val, count(*) from x group by d, val order by min(i) ;
Ingenious! I actually think, however, there was a subtle bug in, though I see you fixed it. The line: - row_number() over () as d needs to be: - row_number() over (order by i asc) as d I discovered this when working your code into my application. I got very, very weird results - with one order of columns in the select, I got the correct answer, but with another one I didn't. After much debugging, I realized that the original version ("- row_number over ()") wasn't defined! So, depending on how I wrote the select statement, Postgres could pick different orders! But I see your cleaned up version already fixed this! On 10/25/13, Elliot <yields.falsehood@gmail.com> wrote: > Glad I could help. It's easier to understand if you break apart the CTE. > I'm also moving around the order by i to clean this up a little. Sorry > for the formatting. > > Running this: > select i, > val, > row_number() over (partition by val order by i asc) as class_i, > row_number() over (order by i asc) as overall_i, > row_number() over (partition by val order by i asc) > - row_number() over () as d > from data > > Yields this: > i val class_i overall_i d > 1 A 1 1 0 > 2 A 2 2 0 > 3 A 3 3 0 > 4 B 1 4 -3 > 5 C 1 5 -4 > 6 A 4 6 -2 > 7 D 1 7 -6 > 8 A 5 8 -3 > 9 A 6 9 -3 > 10 D 2 10 -8 > 11 D 3 11 -8 > 12 B 2 12 -10 > 13 C 2 13 -11 > 14 C 3 14 -11 > > class_i counts the row number within a class and overall_i counts the > overall row number in the sequence. Here's just one class extracted to > emphasize that: > > i val class_i overall_i d > 1 A 1 1 0 > 2 A 2 2 0 > 3 A 3 3 0 > 6 A 4 6 -2 > 8 A 5 8 -3 > 9 A 6 9 -3 > > Within a given consecutive run of a particular class the difference > between class_i and overall_i will always be the same (because they're > both increasing by the same amount) but that difference between runs > will always be different (because each run starts the sequences at > different offsets). "d" is the difference of the two. Because that value > segments the runs, all that needs to be done is group by it and count > the items in the group to get the length of the runs. > > The xxx column was junk left over from copying and pasting and > verifying. Apologies :). This is a cleaned up version: > > with x > as > ( > select i, > val, > row_number() over (partition by val order by i asc) > - row_number() over (order by i asc) as d > from data > ) > select val, > count(*) > from x > group by d, > val > order by min(i) > ; > >