Re: Given a set of daterange, finding the continuous range thatincludes a particular date (aggregates) - Mailing list pgsql-general
From | mariusz |
---|---|
Subject | Re: Given a set of daterange, finding the continuous range thatincludes a particular date (aggregates) |
Date | |
Msg-id | 1519722982.32586.133.camel@mtvk.pl Whole thread Raw |
In response to | Re: Given a set of daterange, finding the continuous range thatincludes a particular date (aggregates) (mariusz <marius@mtvk.pl>) |
List | pgsql-general |
On Tue, 2018-02-27 at 10:03 +0100, mariusz wrote: > On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote: > > On Fri, Feb 23, 2018 at 6:10 AM, > > mariusz <marius@mtvk.pl> wrote: > > > > > > i guess, you can easily get max continuous range for each row > > with > > something like this: > > > > CREATE OR REPLACE FUNCTION > > append_daterange(d1 daterange, d2 daterange) > > RETURNS daterange > > LANGUAGE sql > > AS > > $$ > > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 > > ELSE d2 END; > > $$; > > > > CREATE AGGREGATE agg_daterange (daterange) ( > > sfunc = append_daterange, > > stype = daterange > > ); > > > > SELECT dr, > > lower(agg_daterange(dr) OVER (ORDER BY dr ASC)), > > upper(agg_daterange(dr) OVER (ORDER BY dr DESC)) > > FROM ... > > > > above example is simplified to selecting only daterange column > > "dr" for > > readability, which in your case should be something like > > > > daterange(staff_assign_date,staff_assign_date_end,'[)') > > > > please note that daterange would be normalized to [) format so > > upper() > > above returns exactly your max "staff_assign_date_end" for > > each > > continuous range when dateranges are created with '[)' format. > > > > the key point is ... ELSE d2 in append_daterange() which > > starts with new > > value each time that new value is discontinuous with agg's > > state value > > and order in which rows are processed (ASC for lower of > > daterange, DESC > > for upper of daterange). > > > > unfortunately this involves reading all rows for "client_id" > > and > > additional sorting for each window. > > i recall reading that you already pay the price of reading all > > rows for > > client_id anyway, so the only question is the cost of two > > additional > > sorts (maybe extracting dateranges to subset on which to do > > windows and > > rejoining result of continuous ranges to original set would > > help to > > lower the cost). > > > > > > Thank you, and I wanted to follow up on this. I couldn't quite get > > your example working as described, but I also ended up trying > > something very similar that got me very close but not quite there. > > Basically, I can see that it is correctly calculating the ranges (in > > the notices), but it is only returning the last range for each client. > > (Because I said PARTITION BY client_id). > > > > > sorry for late replay, i was offline from sat to mon inclusive. > i may have previously added some confusion, so i'll try to explain what > i had in my mind. see below what my suggested query is. > > > > So I'm not sure if I should be calling this differently, or if the > > function needs to work differently, or if this just isn't possible. > > Do I need to partition by something else, and if so what? I don't see > > what I could specify that would indicate a new episode. > > > definitely you want to partition by client_id if you are calculating > this for multiple client_ids, but what matters here for each client_id > is order of dateranges in over() > > > > > Also, it's not clear to me how an aggregate might define/return > > different values within a partition. Although this must be possible, > > since functions like rank() and row_number() seem to do it. > > > that is because function is defined like agg, but what happens is that > its current state value (return value of agg function) is returned for > every row within given window without grouping the resultset > > > > Hoping there is something easy that can be tweaked here. See below > > for copy/pastable test stuff. It includes output from both functions. > > Both look to be returning the same results, which makes me wonder if > > my passing in a start date was a waste of time, though it seems to me > > it would be necessary. > > > > Cheers, > > Ken > > > > > > BEGIN; > > CREATE TEMP TABLE sample_data ( > > client_id INTEGER, > > start_date DATE, > > end_date DATE, > > episode INTEGER -- Just a label, for clarity > > ); > > > > INSERT INTO sample_data VALUES > > (1,'1990-01-01','1990-12-31',1), > > (1,'1991-01-01','1991-12-31',1), > > > > (1,'2000-01-01','2000-12-31',2), > > (1,'2001-01-01','2001-12-31',2), > > (1,'2002-01-01','2002-12-31',2), > > (1,'2003-01-01','2003-12-31',2), > > (1,'2004-01-01','2004-12-31',2), > > (1,'2005-01-01','2005-12-31',2), > > (1,'2006-01-01','2006-12-31',2), > > > > (1,'2014-01-01','2014-12-31',3), > > (1,'2015-01-01','2015-12-31',3), > > (1,'2017-06-30','2017-12-31',4), > > (1,'2018-01-01',NULL,4), > > > > (2,'2014-02-01','2015-01-31',1), > > (2,'2015-02-01','2015-12-31',1), > > (2,'2017-09-30','2018-01-31',2), > > (2,'2018-02-01','2018-02-14',2) > > ; > > > > CREATE OR REPLACE FUNCTION > > append_daterange(d1 daterange, d2 daterange) > > RETURNS daterange > > LANGUAGE sql > > AS > > $$ > > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2 > > END; > > $$; > > > > CREATE AGGREGATE agg_daterange (daterange) ( > > sfunc = append_daterange, > > stype = daterange > > ); > > > > > based on your sample data see the query: > > select client_id, d, > daterange(lower(test_agg_daterange(d) > over (partition by client_id > order by d)), > upper(test_agg_daterange(d) > over (partition by client_id > order by d desc)), > '[)' > ) as cont_range > from (select *, daterange(start_date,end_date,'[]') as d > from sample_data > ) q > order by 1, 2; > that test_agg_daterange() is of course exactly the same as agg_daterange() agg above, sorry for not cleaning enough my examples lazily edited and copied from my history in psql regards, mj > client_id | d | cont_range > -----------+-------------------------+------------------------- > 1 | [1990-01-01,1991-01-01) | [1990-01-01,1992-01-01) > 1 | [1991-01-01,1992-01-01) | [1990-01-01,1992-01-01) > 1 | [2000-01-01,2001-01-01) | [2000-01-01,2007-01-01) > 1 | [2001-01-01,2002-01-01) | [2000-01-01,2007-01-01) > 1 | [2002-01-01,2003-01-01) | [2000-01-01,2007-01-01) > 1 | [2003-01-01,2004-01-01) | [2000-01-01,2007-01-01) > 1 | [2004-01-01,2005-01-01) | [2000-01-01,2007-01-01) > 1 | [2005-01-01,2006-01-01) | [2000-01-01,2007-01-01) > 1 | [2006-01-01,2007-01-01) | [2000-01-01,2007-01-01) > 1 | [2014-01-01,2015-01-01) | [2014-01-01,2016-01-01) > 1 | [2015-01-01,2016-01-01) | [2014-01-01,2016-01-01) > 1 | [2017-06-30,2018-01-01) | [2017-06-30,) > 1 | [2018-01-01,) | [2017-06-30,) > 2 | [2014-02-01,2015-02-01) | [2014-02-01,2016-01-01) > 2 | [2015-02-01,2016-01-01) | [2014-02-01,2016-01-01) > 2 | [2017-09-30,2018-02-01) | [2017-09-30,2018-02-15) > 2 | [2018-02-01,2018-02-15) | [2017-09-30,2018-02-15) > (17 rows) > > and time to explain confusion i added previously regarding daterange > input format (i'm working with something similar, but with dateranges in > db, not add hoc generated from start and end dates). > you need to generate daterange(start_date,end_date,'[]') for -|- > operator to work correctly, but output of daterange will be normalized > to '[)' format so it's up to you to extract end_date of continuous range > (if you prefer dates over dateranges) which gonna be something like > (upper(cont_range)-'1 day'::interval)::date > > above example is for showing how it works, returns max continuous range > containing given row, > you probably want: > select distinct on (client_id,cont_range) > or something like that to extract continuous ranges > > first window (within lower() func) extends its initial (earliest) range > to right for every next row continuous with current range (and jumps to > new start on discontinuity), thus over(order by d ASC) > second window (within upper() func) extends its initial (latest) range > to left for every next row continuous with current range (and jumps to > new end on discontinuity), thus over(order by d DESC) > partition by client_id within each over() is to not mix client_ids' > dateranges > > > i guess that is more or less what you wanted, at least as i understood > you originally, if not than sorry for even more confusion or > unnecessarily explaining simple and obvious things you might already > know > > regards, > mariusz jadczak > > > > CREATE OR REPLACE FUNCTION range_continuous_merge( daterange, > > daterange, date ) RETURNS daterange AS $$ > > > > DECLARE > > res daterange; > > > > BEGIN > > res:= CASE > > WHEN $1 IS NULL AND NOT $2 @> $3 THEN NULL > > WHEN $1 IS NULL AND $2 @> $3 THEN $2 > > WHEN ($1 && $2) OR ($1 -|- $2) THEN $1 + $2 > > WHEN NOT $1 @> $3 THEN $2 > > ELSE $1 > > END; > > RAISE NOTICE 'Inputs: %,%,%. Returning %',$1::text,$2::text, > > $3::text,res; > > RETURN res; > > END; > > $$ LANGUAGE plpgsql STABLE; > > > > CREATE AGGREGATE range_continuous( daterange, date ) ( > > > > sfunc = range_continuous_merge, > > stype = daterange > > -- initcond = '{0,0,0}' > > > > ); > > > > SELECT > > client_id,episode,start_date,end_date,range_continuous(daterange(start_date,end_date,'[]'),start_date) OVER (PARTITIONby client_id) FROM sample_data ; > > > > SELECT > > client_id,episode,start_date,end_date,agg_daterange(daterange(start_date,end_date,'[]')) OVER (PARTITION by client_id)FROM sample_data ; > > > > > > > > -- RANGE_CONTINUOUS_MERGE > > > > NOTICE: Inputs: <NULL>,[1990-01-01,1991-01-01),1990-01-01. Returning > > [1990-01-01,1991-01-01) > > NOTICE: Inputs: > > [1990-01-01,1991-01-01),[1991-01-01,1992-01-01),1991-01-01. Returning > > [1990-01-01,1992-01-01) > > NOTICE: Inputs: > > [1990-01-01,1992-01-01),[2000-01-01,2001-01-01),2000-01-01. Returning > > [2000-01-01,2001-01-01) > > NOTICE: Inputs: > > [2000-01-01,2001-01-01),[2001-01-01,2002-01-01),2001-01-01. Returning > > [2000-01-01,2002-01-01) > > NOTICE: Inputs: > > [2000-01-01,2002-01-01),[2002-01-01,2003-01-01),2002-01-01. Returning > > [2000-01-01,2003-01-01) > > NOTICE: Inputs: > > [2000-01-01,2003-01-01),[2003-01-01,2004-01-01),2003-01-01. Returning > > [2000-01-01,2004-01-01) > > NOTICE: Inputs: > > [2000-01-01,2004-01-01),[2004-01-01,2005-01-01),2004-01-01. Returning > > [2000-01-01,2005-01-01) > > NOTICE: Inputs: > > [2000-01-01,2005-01-01),[2005-01-01,2006-01-01),2005-01-01. Returning > > [2000-01-01,2006-01-01) > > NOTICE: Inputs: > > [2000-01-01,2006-01-01),[2006-01-01,2007-01-01),2006-01-01. Returning > > [2000-01-01,2007-01-01) > > NOTICE: Inputs: > > [2000-01-01,2007-01-01),[2014-01-01,2015-01-01),2014-01-01. Returning > > [2014-01-01,2015-01-01) > > NOTICE: Inputs: > > [2014-01-01,2015-01-01),[2015-01-01,2016-01-01),2015-01-01. Returning > > [2014-01-01,2016-01-01) > > NOTICE: Inputs: > > [2014-01-01,2016-01-01),[2017-06-30,2018-01-01),2017-06-30. Returning > > [2017-06-30,2018-01-01) > > NOTICE: Inputs: [2017-06-30,2018-01-01),[2018-01-01,),2018-01-01. > > Returning [2017-06-30,) > > NOTICE: Inputs: <NULL>,[2014-02-01,2015-02-01),2014-02-01. Returning > > [2014-02-01,2015-02-01) > > NOTICE: Inputs: > > [2014-02-01,2015-02-01),[2015-02-01,2016-01-01),2015-02-01. Returning > > [2014-02-01,2016-01-01) > > NOTICE: Inputs: > > [2014-02-01,2016-01-01),[2017-09-30,2018-02-01),2017-09-30. Returning > > [2017-09-30,2018-02-01) > > NOTICE: Inputs: > > [2017-09-30,2018-02-01),[2018-02-01,2018-02-15),2018-02-01. Returning > > [2017-09-30,2018-02-15) > > > > client_id | episode | start_date | end_date | range_continuous > > -----------+---------+------------+------------+------------------------- > > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,) > > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,) > > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,) > > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,) > > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,) > > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,) > > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,) > > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,) > > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,) > > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,) > > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,) > > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,) > > 1 | 4 | 2018-01-01 | | [2017-06-30,) > > 2 | 1 | 2014-02-01 | 2015-01-31 | > > [2017-09-30,2018-02-15) > > 2 | 1 | 2015-02-01 | 2015-12-31 | > > [2017-09-30,2018-02-15) > > 2 | 2 | 2017-09-30 | 2018-01-31 | > > [2017-09-30,2018-02-15) > > 2 | 2 | 2018-02-01 | 2018-02-14 | > > [2017-09-30,2018-02-15) > > > > (17 rows) > > > > > > -- AGG_DATERANGE > > > > client_id | episode | start_date | end_date | agg_daterange > > > > -----------+---------+------------+------------+------------------------- > > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,) > > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,) > > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,) > > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,) > > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,) > > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,) > > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,) > > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,) > > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,) > > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,) > > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,) > > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,) > > 1 | 4 | 2018-01-01 | | [2017-06-30,) > > 2 | 1 | 2014-02-01 | 2015-01-31 | > > [2017-09-30,2018-02-15) > > 2 | 1 | 2015-02-01 | 2015-12-31 | > > [2017-09-30,2018-02-15) > > 2 | 2 | 2017-09-30 | 2018-01-31 | > > [2017-09-30,2018-02-15) > > 2 | 2 | 2018-02-01 | 2018-02-14 | > > [2017-09-30,2018-02-15) > > (17 rows) > > > > > > > > > > > > > > -- > > > > AGENCY Software > > A Free Software data system > > By and for non-profits > > http://agency-software.org/ > > https://demo.agency-software.org/client > > > > ken.tanzer@agency-software.org > > (253) 245-3801 > > > > > > Subscribe to the mailing list to > > learn more about AGENCY or > > follow the discussion. > > >
pgsql-general by date: