Thread: Query to find contiguous ranges on a column
Given a column of data resembling the following: col 2 3 4 5 11 12 13 14 15 16 17 18 19 23 32 33 34 35 36 37 I need a query to find the contiguous ranges within this column, in this case returning the result set: start, end 2, 5 11, 19 23, 23 32, 37 I have one solution that joins the table against itself and does (among other things) a subselect looking "not exists col +1" and "not exists col -1" on the two instances of the table to find the start and end. This is, as you might guess, is not very efficient (my actual data is some 6 million+ rows) and I'm guessing there has to be something more efficient with windowing or possibly grouping on min and max (though I can't see how to make sure they are part of a contiguous set). Anyone have any ideas? -- Peter Hunsberger
Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > [...] > I have one solution that joins the table against itself and does > (among other things) a subselect looking "not exists col +1" and "not > exists col -1" on the two instances of the table to find the start and > end. This is, as you might guess, is not very efficient (my actual > data is some 6 million+ rows) and I'm guessing there has to be > something more efficient with windowing or possibly grouping on min > and max (though I can't see how to make sure they are part of a > contiguous set). Anyone have any ideas? You can either use a PL/pgSQL function ("SETOF TEXT" just for the convenience of the example): | CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$ | DECLARE | CurrentFirst INT; | CurrentLast INT; | CurrentRecord RECORD; | BEGIN | FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP | IF CurrentFirst IS NULL THEN | CurrentFirst := CurrentRecord.col; | CurrentLast := CurrentRecord.col; | ELSIF CurrentRecord.col = CurrentLast + 1 THEN | CurrentLast := CurrentRecord.col; | ELSE | RETURN NEXT CurrentFirst || ', ' || CurrentLast; | CurrentFirst := CurrentRecord.col; | CurrentLast := CurrentRecord.col; | END IF; | END LOOP; | IF CurrentFirst IS NOT NULL THEN | RETURN NEXT CurrentFirst || ', ' || CurrentLast; | END IF; | RETURN; | END; | $$ LANGUAGE plpgsql; or a recursive query (which I always find very hard to com- prehend): | WITH RECURSIVE RecCols (LeftBoundary, Value) AS | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols | GROUP BY LeftBoundary | ORDER BY LeftBoundary; Could you run both against your data set and find out which one is faster for your six million rows? Tim
Peter Hunsberger wrote on 13.10.2009 23:23: > I need a query to find the contiguous ranges within this column, in > this case returning the result set: > > start, end > 2, 5 > 11, 19 > 23, 23 > 32, 37 > > I have one solution that joins the table against itself and does > (among other things) a subselect looking "not exists col +1" and "not > exists col -1" on the two instances of the table to find the start and > end. This is, as you might guess, is not very efficient (my actual > data is some 6 million+ rows) and I'm guessing there has to be > something more efficient with windowing or possibly grouping on min > and max (though I can't see how to make sure they are part of a > contiguous set). Anyone have any ideas? This is the best I can put together right now. Not very nice, but currently I can't think of a better solution: select * from ( select soi as start_of_interval, case when soi is not null and eoi is null then lead(eoi) over() when soi is not null and eoi is not null then eoi else null end as end_of_interval from ( select case when col - (lag(col,1,col + 1) over (order by col)) - 1 <> 0 then col else null end as soi, case when col - (lead(col,1,col + 1) over (order by col)) + 1 <> 0 then col else null end as eoi from numbers ) t1 where t1.soi is not null or t1.eoi is not null ) t2 where t2.start_of_interval is not null and t2.end_of_interval is not null; The outer-most select is needed to get rid of the "empty" rows. I couldn't find a way to push that into one of the sub-queries. The execution plan doesn't look too bad (probably better than your plan with a self join and a subselect), but it does sortthe whole table which might be a problem. Regards Thomas
On Tue, Oct 13, 2009 at 5:12 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote: > Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > > You can either use a PL/pgSQL function ("SETOF TEXT" just > for the convenience of the example): That works well, takes about 20 seconds to do the 6M+ rows > > or a recursive query (which I always find very hard to com- > prehend): > > | WITH RECURSIVE RecCols (LeftBoundary, Value) AS > | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) > | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) > | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols > | GROUP BY LeftBoundary > | ORDER BY LeftBoundary; > > Could you run both against your data set and find out which > one is faster for your six million rows? > Turns out the server is v 8.3, looks like I need to get them to upgrade it so I get recursive and windowing :-(. If this happens any time soon I'll let you know the results. Many thanks. -- Peter Hunsberger
Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > [...] >> or a recursive query (which I always find very hard to com- >> prehend): >> | WITH RECURSIVE RecCols (LeftBoundary, Value) AS >> | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) >> | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) >> | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols >> | GROUP BY LeftBoundary >> | ORDER BY LeftBoundary; >> Could you run both against your data set and find out which >> one is faster for your six million rows? > Turns out the server is v 8.3, looks like I need to get them to > upgrade it so I get recursive and windowing :-(. If this happens any > time soon I'll let you know the results. > Many thanks. After some tests with a data set of 7983 rows (and 1638 ran- ges): Don't! :-) The recursive solution seems to be more than double as slow as the iterative. I'll take it to -per- formance. Tim
On Wed, Oct 14, 2009 at 4:50 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote: > Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > > After some tests with a data set of 7983 rows (and 1638 ran- > ges): Don't! :-) The recursive solution seems to be more > than double as slow as the iterative. I'll take it to -per- > formance. > Interesting, I've never liked recursive on Oracle but performance is usually reasonable... Thanks for the heads up... -- Peter Hunsberger