Re: Query to find contiguous ranges on a column - Mailing list pgsql-general

From Tim Landscheidt
Subject Re: Query to find contiguous ranges on a column
Date
Msg-id m3r5t754et.fsf@passepartout.tim-landscheidt.de
Whole thread Raw
In response to Query to find contiguous ranges on a column  (Peter Hunsberger <peter.hunsberger@gmail.com>)
Responses Re: Query to find contiguous ranges on a column
List pgsql-general
Peter Hunsberger <peter.hunsberger@gmail.com> wrote:

> [...]
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end.  This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set).  Anyone have any ideas?

You can either use a PL/pgSQL function ("SETOF TEXT" just
for the convenience of the example):

| CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$
| DECLARE
|   CurrentFirst INT;
|   CurrentLast INT;
|   CurrentRecord RECORD;
| BEGIN
|   FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP
|     IF CurrentFirst IS NULL THEN
|       CurrentFirst := CurrentRecord.col;
|       CurrentLast  := CurrentRecord.col;
|     ELSIF CurrentRecord.col = CurrentLast + 1 THEN
|       CurrentLast := CurrentRecord.col;
|     ELSE
|       RETURN NEXT CurrentFirst || ', ' || CurrentLast;
|       CurrentFirst := CurrentRecord.col;
|       CurrentLast := CurrentRecord.col;
|     END IF;
|   END LOOP;
|   IF CurrentFirst IS NOT NULL THEN
|     RETURN NEXT CurrentFirst || ', ' || CurrentLast;
|   END IF;
|   RETURN;
| END;
| $$ LANGUAGE plpgsql;

or a recursive query (which I always find very hard to com-
prehend):

| WITH RECURSIVE RecCols (LeftBoundary, Value) AS
|   (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
|    UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
|   SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
|     GROUP BY LeftBoundary
|     ORDER BY LeftBoundary;

Could you run both against your data set and find out which
one is faster for your six million rows?

Tim

pgsql-general by date:

Previous
From: Peter Hunsberger
Date:
Subject: Query to find contiguous ranges on a column
Next
From: Thomas Kellerer
Date:
Subject: Re: Query to find contiguous ranges on a column