Thread: How to enumerate partitions from a window function?

How to enumerate partitions from a window function?

From
Alban Hertroys
Date:
We are trying to add some information to a query over data from a continuous process. Most of what we want can be done quite nicely using window functions, but I got stuck on this particular problem:

The data has records with a timestamp and a few properties that make subsequent rows belong to the same group. Say we have:
create table process_data (
timestamp timestamp not null,
property_A text not null,
property_B text not null,
value numeric(12, 3)
);

And a query like:
select datetime, property_A, property_B
, first_value(datetime)::time over run as swap_time
--, here I want to enumerate the runs themselves
, value
from process_data
window run as (partition by property_A, property_B order by datetime)
;

As stated above, I want to enumerate the runs, starting at 1 and incrementing by 1 every time a partition from the 'run' window closes, which would result in something like this:

datetime | property_A | property_B | swap_time | run_nr | value
================================================
2020-09-03 15:06 | tea | earl grey | 15:06 | 1 | 0.23
2020-09-03 15:07 | tea | earl grey | 15:06 | 1 | 0.22
2020-09-03 15:08 | tea | ceylon | 15:08 | 2 | 0.34
2020-09-03 15:09 | coffee | cappucino | 15:09 | 3 | 0.45
2020-09-03 15:10 | coffee | cappucino | 15:09 | 3 | 0.43
etc.

Is there a way to achieve this through window functions, or do we need to wrap the thing in a subquery to achieve this?

Regards,
Alban Hertroys
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: How to enumerate partitions from a window function?

From
Tom Lane
Date:
Alban Hertroys <haramrae@gmail.com> writes:
> As stated above, I want to enumerate the runs, starting at 1 and
> incrementing by 1 every time a partition from the 'run' window closes,
> Is there a way to achieve this through window functions, or do we need to
> wrap the thing in a subquery to achieve this?

I think this'll work:

select datetime, property_A, property_B
, first_value(datetime)::time over run as swap_time
, dense_rank() over (order by property_A, property_B)
, value
from process_data
window run as (partition by property_A, property_B order by datetime)
;

You can't do it with a window function over the "run" window because
no window function ever looks outside the current partition.  But
that's easy to fix by using a different window definition.  The
planner is smart enough to see that these windows are compatible
and only need one sort to be performed.

            regards, tom lane



Re: How to enumerate partitions from a window function?

From
Alban Hertroys
Date:

On Thu, 3 Sep 2020 at 16:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alban Hertroys <haramrae@gmail.com> writes:
> As stated above, I want to enumerate the runs, starting at 1 and
> incrementing by 1 every time a partition from the 'run' window closes,
> Is there a way to achieve this through window functions, or do we need to
> wrap the thing in a subquery to achieve this?

I think this'll work:

select datetime, property_A, property_B
, first_value(datetime)::time over run as swap_time
, dense_rank() over (order by property_A, property_B)
, value
from process_data
window run as (partition by property_A, property_B order by datetime)
;

You can't do it with a window function over the "run" window because
no window function ever looks outside the current partition.  But
that's easy to fix by using a different window definition.  The
planner is smart enough to see that these windows are compatible
and only need one sort to be performed.

                        regards, tom lane

Thanks Tom,

That gets us close, but it ignores the order of the runs over time. I think it also reassigns the same number to later runs at the same 'day' that happen to have the same values for property_A and _B. That's some crucial information that I forgot to include.

To expand on my original example:

datetime | property_A | property_B | swap_time | run_nr | value
================================================
2020-09-03 15:06 | tea | earl grey | 15:06 | 1 | 0.23
2020-09-03 15:07 | tea | earl grey | 15:06 | 1 | 0.22
2020-09-03 15:08 | tea | ceylon | 15:08 | 2 | 0.34
2020-09-03 15:09 | coffee | cappucino | 15:09 | 3 | 0.45
2020-09-03 15:10 | coffee | cappucino | 15:09 | 3 | 0.43
2020-09-03 15:11 | tea | earl grey | 15:11 | 4 | 0.23
etc.

Where the last row has the same characteristic properties as the first 2 rows (from run 1), but is in run 4 due to it having started after run 3.

The runs normally start at 1 hour before midnight, with run 1, and continue 24h from there (it's a shifted day-schedule). The above example starting at 15:06 is unlikely to occur in reality, although possible (with long downtime). That's mostly to clarify how the run numbers should function, it would require to at least partition run_nr by a date shifted 1 hour back, as long as they number their runs correctly along the time axis.

Regards,
Alban.
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: How to enumerate partitions from a window function?

From
Michael Lewis
Date:
It seems like you are maybe wanting this- If the previous row is the same, then get the previous row's run_nr. If it is different, then increment.

case when lag( property_A ) over() = property_A and lag( property_B ) over() = property_B then coalesce( lag( run_nr ) over(), 1 ) else lag( run_nr ) over() + 1 end

Perhaps there is a much simpler implementation though.

Re: How to enumerate partitions from a window function?

From
Alban Hertroys
Date:

On Thu, 3 Sep 2020 at 20:59, Michael Lewis <mlewis@entrata.com> wrote:
It seems like you are maybe wanting this- If the previous row is the same, then get the previous row's run_nr. If it is different, then increment.

case when lag( property_A ) over() = property_A and lag( property_B ) over() = property_B then coalesce( lag( run_nr ) over(), 1 ) else lag( run_nr ) over() + 1 end

Perhaps there is a much simpler implementation though.

That would work were it not that the very column we're defining is the one to be aliased run_nr. The data does not contain that information, it's what I'm trying to enrich it with and what I'm having trouble wrapping my head around.
Your query (adopted a tiny bit) unfortunately results in:

select datetime, property_A, property_B
, first_value(datetime::time) over run_win as swap_time
, case
when lag(property_A) over time_win = property_A
and lag(property_B) over time_win = property_B
then coalesce(lag(run_nr) over time_win, 1)
else lag(run_nr) over time_win +1
  end
, value
from process_data
window
time_win as (order by datetime)
, run_win as (partition by property_A, property_B order by datetime)
order by datetime
;

ERROR:  column "run_nr" does not exist
LINE 6:  then coalesce(lag(run_nr) over time_win, 1)
                           ^
SQL state: 42703
Character: 221

I turned my example into a proper test-case (better late than never):

CREATE TABLE process_data (
    datetime timestamp without time zone NOT NULL,
    property_a text NOT NULL,
    property_b text NOT NULL,
    value numeric(12,3)
);

COPY process_data (datetime, property_a, property_b, value) FROM stdin;
2020-09-03 15:06:00 tea earl grey 0.230
2020-09-03 15:07:00 tea earl grey 0.220
2020-09-03 15:08:00 tea ceylon 0.340
2020-09-03 15:09:00 coffee cappucino 0.450
2020-09-03 15:10:00 coffee cappucino 0.430
2020-09-03 15:11:00 tea earl grey 0.230
\.

With the desired result (note that swap_time and run_nr are calculated columns):
      datetime       | property_a | property_b | swap_time | run_nr | value
---------------------+------------+------------+-----------+--------+-------
 2020-09-03 15:06:00 | tea        | earl grey  | 15:06:00  |      1 | 0.230
 2020-09-03 15:07:00 | tea        | earl grey  | 15:06:00  |      1 | 0.220
 2020-09-03 15:08:00 | tea        | ceylon     | 15:08:00  |      2 | 0.340
 2020-09-03 15:09:00 | coffee     | cappucino  | 15:09:00  |      3 | 0.450
 2020-09-03 15:10:00 | coffee     | cappucino  | 15:09:00  |      3 | 0.430
 2020-09-03 15:11:00 | tea        | earl grey  | 15:06:00  |      4 | 0.230
(6 rows)


I've been looking around on the Internet in the meantime, and it seems people either solve this with a recursive CTE (referencing the previous row by row_number() over (...)) or by writing a set-returning function that walks over the data in datetime order using a cursor.

Since the actual query is growing more and more state-tracking flags, using a function has the added benefit that referencing state columns from the previous row gets a lot easier (lots of repeated window functions otherwise). It would become a procedural solution instead of a set-based one, but considering that this data is order-sensitive (on datetime), that's probably what a set-based solution would also end up doing anyway.

Regards,
Alban.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.