Home > mailing lists
Re: Select ranges based on sequential breaks - Mailing list pgsql-general

From	Mike Toews
Subject	Re: Select ranges based on sequential breaks
Date	June 22, 2009 18:41:57
Msg-id	4A3FD068.30900@sfu.ca Whole thread Raw
In response to	Re: Select ranges based on sequential breaks (Joel Nothman <jnothman@student.usyd.edu.au>)
Responses	Re: Select ranges based on sequential breaks
List	pgsql-general
Tree view
Hi Joel,

Window functions appear to be the best solution for this style of
problem, and I'm looking forward to their applications. However, I'm
sticking with 8.3 for at least a year, so I'm not able to explore this
solution yet. For now, I can only post-process the output in a non-SQL
environment. I also need to do other fun stuff, like cumulative sums,
which is also challenging with SQL, but much easier and intuitive with R.

An excellent book that I recently stumbled on is /Joe Celko's SQL for
Smarties/ (recommended by someone on this list), which is a heavy read,
but has an amazing depth to ANSI SQL. Of particular interest is "Chapter
24: Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this.

-Mike

Joel Nothman wrote:
> Hi Mike,
>
> I happened upon your query, which is related to some stuff I've been
> playing with.
>
> Firstly, David's solution below doesn't work. I haven't yet tried to work
> out why.
>
> Secondly, I was hoping to be able to solve your problem nicely with
> Postgres 8.4's window functions [1,2], which can provide access to
> data in sequentially-related rows.
>
> Given the following setup:
> CREATE TABLE foo (dt date, bin varchar(4));
> INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
> ('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
> ('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
> ('2009-01-09', 'red'), ('2009-01-10', 'red');
>
> I had hoped the following would suffice:
> SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
> w AS (ORDER BY dt PARTITION BY bin);
>
> Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
> WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
> which (PARTITION BY bin ORDER BY dt) would not give me.
>
> I was able to produce the required result with:
>
> SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
>       FROM (SELECT dt, bin,
>                    CASE WHEN newbin IS NULL
>                    THEN 0
>                    ELSE SUM(newbin) OVER (ORDER BY dt)
>                    END AS binno
>               FROM (SELECT *,
>                            (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A=
> S
> newbin
>                       FROM foo
>                    ) AS newbins
>               ) AS binnos
> GROUP BY binno
> ORDER BY first;
>
> This relies on a middle step in which I create an enumeration of the bins
> in sequence:
> SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
> (ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE=
> R
> BY dt))::int AS newbin FROM foo) AS newbins;
>          dt     | bin  | binno
> ------------+------+-------
>      2009-01-01 | red  |     0
>      2009-01-02 | red  |     0
>      2009-01-03 | blue |     1
>      2009-01-04 | blue |     1
>      2009-01-05 | blue |     1
>      2009-01-06 | red  |     2
>      2009-01-07 | blue |     3
>      2009-01-08 | blue |     3
>      2009-01-09 | red  |     4
>      2009-01-10 | red  |     4
>
> I would hope there is a neater way to do this with window functions.
>
> The best way to solve your problem may be with PL/SQL, which is also good
> at dealing with sequences (it's not as complicated as it looks!):
>
> CREATE TYPE bindates AS (first date, last date, bin varchar(5));
> CREATE OR REPLACE FUNCTION bindates()
> RETURNS SETOF bindates
> AS $$
> DECLARE
>       row record;
>       res bindates;
> BEGIN
>       FOR row IN SELECT * FROM foo ORDER BY dt
>       LOOP
>         IF res.bin IS NULL OR res.bin !=3D row.bin THEN
>           IF res.bin IS NOT NULL THEN
>             RETURN NEXT res;
>           END IF;
>           res.bin :=3D row.bin;
>           res.first :=3D row.dt;
>         END IF;
>         res.last :=3D row.dt;
>       END LOOP;
>       IF res.bin IS NOT NULL THEN
>         RETURN NEXT res;
>       END IF;
> END;
> $$ LANGUAGE plpgsql;
>
>
> Finally, I'll try to sort out David's solution. Once we correct his typo
> (t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
>        first    |    last    | bin
> ------------+------------+------
>      2009-01-03 | 2009-01-05 | blue
>      2009-01-06 | 2009-01-06 | red
>      2009-01-07 | 2009-01-08 | blue
>      2009-01-09 |            | red
>
> This includes correct output, but it fails on both edge cases. The
> non-appearance of the first row is due to the WHERE clause on the main
> SELECT statement:
> WHERE (SELECT bin
>            FROM foo t2
>            WHERE t2.dt < t1.dt
>            ORDER BY dt DESC LIMIT 1) <> t1.bin
>
> If we drop this WHERE clause, we get:
>        first    |    last    | bin
> ------------+------------+------
>      2009-01-01 | 2009-01-02 | red
>      2009-01-02 | 2009-01-02 | red
>      2009-01-03 | 2009-01-05 | blue
>      2009-01-04 | 2009-01-05 | blue
>      2009-01-05 | 2009-01-05 | blue
>      2009-01-06 | 2009-01-06 | red
>      2009-01-07 | 2009-01-08 | blue
>      2009-01-08 | 2009-01-08 | blue
>      2009-01-09 |            | red
>      2009-01-10 |            | red
>
> We can therefore get the result including the first row by selecting from
> this table with 'GROUP BY last, bin'.
>
> And we can hack in a value for those final NULLs as a special case. The
> following statement works:
>
> SELECT MIN(first),
>            CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
> END,
>            bin
>     FROM (
>       SELECT dt AS first,
>              (SELECT dt
>               FROM foo t3
>               WHERE t3.dt < (
>                   SELECT dt
>                   FROM foo t5
>                   WHERE t5.dt > t1.dt
>                     AND t5.bin <> t1.bin
>                   ORDER BY dt ASC
>                   LIMIT 1)
>               ORDER BY dt DESC
>               LIMIT 1) AS last,
>               bin
>       FROM foo t1
> ) t0
> GROUP BY last, bin
> ORDER BY last;
>
>
> Finally, what's efficient? With 1,000,000 random rows, we get:
> Enumeration: 13s
> PL/SQL: 12s
> Modified David: minutes.
>
> [I used the following to create test data:
> CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
> import random
> for i in xrange(n):
>       m =3D random.randint(1,12)
>       d =3D random.randint(1,28)
>       b =3D random.choice(('red', 'blue'))
>       yield '2009-%d-%d' % (m, d), b
> $$ LANGUAGE plpythonu;
> DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]
>
> I hope that helps you in considering various approaches to the problem.
>
> - Joel
>
>
> [1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
> [2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html
>
>
> On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
> <david.t.wilson@gmail.com> wrote:
>
>
pgsql-general by date:
From: John DeSoi
Date: 22 June 2009, 18:40:52
Subject: Re: Information about columns
From: Scott Marlowe
Date: 22 June 2009, 18:57:01
Subject: Re: Select ranges based on sequential breaks
Re: Select ranges based on sequential breaks - Mailing list pgsql-general

Previous

Next