Thread: interval origami
Hi, I am working on a hobby project that might eventually make its way into the public domain. The goal is to create a Media Annotation, Analysis and Playback Synthesis System (I call the project MAAPSS). Conceptually, it is somewhat similar to vcode[1,2] but more "hackable" with broader general applications. [1]: http://social.cs.uiuc.edu/projects/vcode.html [2]: https://youtu.be/Sv_OZ174wpg In the configuration that I am currently exploring, the database will contain many entries with intervals that could overlap in every possible way. The entries could be labeled in many ways other than "interesting" and "fail". This little example is a simplified attempt to isolate what I see as the core problem, which is probably mostly based in my current lack of understanding of SQL. CREATE TABLE Example (start REAL, stop REAL, tag TEXT); INSERT INTO Example VALUES (10, 30, 'interesting'), (12, 32, 'interesting'), (15, 20, 'fail'), (16, 39, 'interesting'), (17, 21, 'fail'), (50, 80, 'interesting'), (60, 65, 'fail'), (85, 90, 'fail'); The 'start' and 'stop' values stored in the database are numbers of type REAL, representing seconds. There is no need for calculations with, or transformations between, elaborate time representation formats. Here is a cognitive approach to a solution: 0. Starting with a dataset like this: 10.0|30.0|interesting 12.0|32.0|interesting 15.0|20.0|fail 16.0|39.0|interesting 17.0|21.0|fail 50.0|80.0|interesting 60.0|65.0|fail 85.0|90.0|fail 1. Find all of the "interesting" segments. 10.0|30.0|interesting 12.0|32.0|interesting 16.0|39.0|interesting 50.0|80.0|interesting 2. Find any "interesting" segments with overlaps and merge them. 10.0|39.0|interesting 50.0|80.0|interesting 3. Find all "fail" segments with a period that overlaps the period of any (step 2) "interesting" segment. 15.0|20.0|fail 17.0|21.0|fail 60.0|65.0|fail 4. Find any overlaps in those (step 3) "fail" segments and merge. 15.0|21.0|fail 60.0|65.0|fail 5. Derive a new list of "interesting" segments from (step 2) that omits all "fail" segments from (step 4). 10.0|15.0|interesting 21.0|39.0|interesting 50.0|60.0|interesting 65.0|80.0|interesting A relational database solution is needed. I am not yet very familiar with relational database design and SQL beyond the very basics. So any pointers to similar SQL problems with solutions and explanations would be tremendously useful at this point. Any hints at all, really, will be much appreciated. Beyond a basic solution, I suspect something like an R-Tree index might be useful. I gather this from the SQLite page on R-Trees: https://sqlite.org/rtree.html -- SELECT version(); version --------------------------------------------------------------------------------------------------------- PostgreSQL 11.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit (1 row)
On 2018-Nov-30, Adam Jensen wrote: > A relational database solution is needed. I am not yet very familiar > with relational database design and SQL beyond the very basics. So any > pointers to similar SQL problems with solutions and explanations would > be tremendously useful at this point. Any hints at all, really, will be > much appreciated. This sounds like something you can easily do with range types in Postgres. Probably not terribly portable to other DBMSs though. https://www.postgresql.org/docs/11/rangetypes.html -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/30/18 12:57 PM, Alvaro Herrera wrote: > This sounds like something you can easily do with range types in > Postgres. Probably not terribly portable to other DBMSs though. > https://www.postgresql.org/docs/11/rangetypes.html Nice hint. Thanks! PostgreSQL as the application platform is fine. Portability to other DBMS's isn't a requirement. The 'numrange' type with the 'overlaps' and 'intersection' operators seem to cover the fundamental computations in a very natural way. hanzer=# SELECT numrange(10, 30) && numrange(12, 32); ?column? ---------- t (1 row) hanzer=# SELECT numrange(10, 30) * numrange(12, 32); ?column? ---------- [12,30) (1 row) How might they be used in an SQL query to solve the problem? I'm an SQL noob. Seeing a few examples would be very informative. My impression is that it might require some serious kung-fu. :)
On 11/30/18 3:19 PM, Adam Jensen wrote: > The 'numrange' type with the 'overlaps' and 'intersection' operators > seem to cover the fundamental computations in a very natural way. Actually, those operators might not be entirely sufficient. Given two ranges like this: 10.0|39.0|interesting 15.0|21.0|fail Something like the negative or inverse of the intersection is needed: 10.0|15.0|interesting 21.0|39.0|interesting Hmm...
On 11/30/18 4:02 PM, Adam Jensen wrote: > On 11/30/18 3:19 PM, Adam Jensen wrote: >> The 'numrange' type with the 'overlaps' and 'intersection' operators >> seem to cover the fundamental computations in a very natural way. > > Actually, those operators might not be entirely sufficient. Given two > ranges like this: > > 10.0|39.0|interesting > 15.0|21.0|fail > > Something like the negative or inverse of the intersection is needed: > > 10.0|15.0|interesting > 21.0|39.0|interesting I've mapped out nine time segment overlap scenarios: 1. good(10, 40) | bad(05, 15) -> good(15, 40) 2. good(10, 40) | bad(10, 15) -> good(15, 40) 3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40) 4. good(10, 40) | bad(20, 40) -> good(10, 20) 5. good(10, 40) | bad(20, 45) -> good(10, 20) 6. good(10, 40) | bad(05, 40) -> good() 7. good(10, 40) | bad(05, 45) -> good() 8. good(10, 40) | bad(10, 40) -> good() 9. good(10, 40) | bad(10, 45) -> good() Letting gs/gf and bs/bf represent "good start-time"/"good finish-time" and so on, pseudo-code to remove the bad segments looks like this: find overlap: good(gs, gf) | bad(bs, bf) CASE WHEN ((bs <= gs) AND (bf < gf)) THEN # 1 & 2 -> (bf, gf) WHEN ((bs > gs) AND (bf < gf)) THEN # 3 -> (gs, bs), (bf, gf) WHEN ((bs > gs) AND (bf >= gf)) THEN # 4 & -> (gs, bs) WHEN ((bs <= gs) AND (bf >= gf)) THEN # 6 & 7 & 8 & 9 -> () END CASE; And my first attempt at writing a PostgreSQL function looks like this: CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL) RETURNS TABLE (start REAL, stop REAL) AS $$ BEGIN CASE WHEN ((bs <= gs) AND (bf < gf)) THEN RETURN NEXT (bf, gf); RETURN; WHEN ((bs > gs) AND (bf < gf)) THEN RETURN NEXT (gs, bs); RETURN NEXT (bf, gf); RETURN; WHEN ((bs > gs) AND (bf >= gf)) THEN RETURN NEXT (gs, bs); RETURN; WHEN ((bs <= gs) AND (bf >= gf)) THEN RETURN; END CASE; END; $$ LANGUAGE plpgsql; It results in: ERROR: RETURN NEXT cannot have a parameter in function with OUT parameters LINE 6: RETURN NEXT (bf, gf); Page 83 of the book "PostgreSQL Server Programming" mentions this situation but doesn't actually describe or explain anything; nor does it present a working example... Any ideas?
On 11/30/18 7:04 PM, Adam Jensen wrote: > On 11/30/18 4:02 PM, Adam Jensen wrote: >> On 11/30/18 3:19 PM, Adam Jensen wrote: >>> The 'numrange' type with the 'overlaps' and 'intersection' operators >>> seem to cover the fundamental computations in a very natural way. >> >> Actually, those operators might not be entirely sufficient. Given two >> ranges like this: >> >> 10.0|39.0|interesting >> 15.0|21.0|fail >> >> Something like the negative or inverse of the intersection is needed: >> >> 10.0|15.0|interesting >> 21.0|39.0|interesting > > I've mapped out nine time segment overlap scenarios: > > 1. good(10, 40) | bad(05, 15) -> good(15, 40) > 2. good(10, 40) | bad(10, 15) -> good(15, 40) > 3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40) > 4. good(10, 40) | bad(20, 40) -> good(10, 20) > 5. good(10, 40) | bad(20, 45) -> good(10, 20) > 6. good(10, 40) | bad(05, 40) -> good() > 7. good(10, 40) | bad(05, 45) -> good() > 8. good(10, 40) | bad(10, 40) -> good() > 9. good(10, 40) | bad(10, 45) -> good() > > Letting gs/gf and bs/bf represent "good start-time"/"good finish-time" > and so on, pseudo-code to remove the bad segments looks like this: > > find overlap: good(gs, gf) | bad(bs, bf) > > CASE > WHEN ((bs <= gs) AND (bf < gf)) THEN # 1 & 2 > -> (bf, gf) > WHEN ((bs > gs) AND (bf < gf)) THEN # 3 > -> (gs, bs), (bf, gf) > WHEN ((bs > gs) AND (bf >= gf)) THEN # 4 & > -> (gs, bs) > WHEN ((bs <= gs) AND (bf >= gf)) THEN # 6 & 7 & 8 & 9 > -> () > END CASE; > > And my first attempt at writing a PostgreSQL function looks like this: > > CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL) > RETURNS TABLE (start REAL, stop REAL) AS $$ > BEGIN > CASE > WHEN ((bs <= gs) AND (bf < gf)) THEN > RETURN NEXT (bf, gf); > RETURN; > WHEN ((bs > gs) AND (bf < gf)) THEN > RETURN NEXT (gs, bs); > RETURN NEXT (bf, gf); > RETURN; > WHEN ((bs > gs) AND (bf >= gf)) THEN > RETURN NEXT (gs, bs); > RETURN; > WHEN ((bs <= gs) AND (bf >= gf)) THEN > RETURN; > END CASE; > END; $$ > LANGUAGE plpgsql; > > It results in: > > ERROR: RETURN NEXT cannot have a parameter in function with OUT parameters > LINE 6: RETURN NEXT (bf, gf); > > Page 83 of the book "PostgreSQL Server Programming" mentions this > situation but doesn't actually describe or explain anything; nor does it > present a working example... > > Any ideas? > This seems to work. Trial and error got me there. CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL) RETURNS TABLE (start REAL, stop REAL) AS $$ BEGIN CASE WHEN ((bs <= gs) AND (bf < gf)) THEN RETURN QUERY VALUES (bf, gf); WHEN ((bs > gs) AND (bf < gf)) THEN RETURN QUERY VALUES (gs, bs), (bf, gf); WHEN ((bs > gs) AND (bf >= gf)) THEN RETURN QUERY VALUES (gs, bs); WHEN ((bs <= gs) AND (bf >= gf)) THEN RETURN; END CASE; END; $$ LANGUAGE plpgsql; 1. good(10, 40) | bad(05, 15) -> good(15, 40) 2. good(10, 40) | bad(10, 15) -> good(15, 40) 3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40) 4. good(10, 40) | bad(20, 40) -> good(10, 20) 5. good(10, 40) | bad(20, 45) -> good(10, 20) 6. good(10, 40) | bad(05, 40) -> good() 7. good(10, 40) | bad(05, 45) -> good() 8. good(10, 40) | bad(10, 40) -> good() 9. good(10, 40) | bad(10, 45) -> good() SELECT find_overlap(10,40, 05,15); SELECT find_overlap(10,40, 10,15); SELECT find_overlap(10,40, 20,30); SELECT find_overlap(10,40, 20,40); SELECT find_overlap(10,40, 20,45); SELECT find_overlap(10,40, 05,40); SELECT find_overlap(10,40, 05,45); SELECT find_overlap(10,40, 10,40); SELECT find_overlap(10,40, 10,45);
Given the original: CREATE TABLE Example (start REAL, stop REAL, tag TEXT); I suppose it might make more sense to do it like this: CREATE FUNCTION remove_bad(gs REAL, gf REAL, bs REAL, bf REAL) RETURNS SETOF Example AS $$ BEGIN CASE WHEN ((bs <= gs) AND (bf < gf)) THEN RETURN QUERY VALUES (bf, gf, 'interesting'); WHEN ((bs > gs) AND (bf < gf)) THEN RETURN QUERY VALUES (gs, bs, 'interesting'), (bf, gf, 'interesting'); WHEN ((bs > gs) AND (bf >= gf)) THEN RETURN QUERY VALUES (gs, bs, 'interesting'); WHEN ((bs <= gs) AND (bf >= gf)) THEN RETURN; END CASE; END; $$ LANGUAGE plpgsql; SELECT remove_bad(10,40, 05,15); SELECT remove_bad(10,40, 10,15); SELECT remove_bad(10,40, 20,30); SELECT remove_bad(10,40, 20,40); SELECT remove_bad(10,40, 20,45); SELECT remove_bad(10,40, 05,40); SELECT remove_bad(10,40, 05,45); SELECT remove_bad(10,40, 10,40); SELECT remove_bad(10,40, 10,45);
On 11/30/18 4:02 PM, Adam Jensen wrote: > On 11/30/18 3:19 PM, Adam Jensen wrote: >> The 'numrange' type with the 'overlaps' and 'intersection' operators >> seem to cover the fundamental computations in a very natural way. > > Actually, those operators might not be entirely sufficient. Given two > ranges like this: > > 10.0|39.0|interesting > 15.0|21.0|fail > > Something like the negative or inverse of the intersection is needed: > > 10.0|15.0|interesting > 21.0|39.0|interesting Perhaps overkill, but if you represent your timeline as actual line segments, perhaps PostGIS would be useful. E.g.: https://postgis.net/docs/manual-2.5/ST_Difference.html HTH, Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On 12/1/18 8:24 AM, Joe Conway wrote: > Perhaps overkill, but if you represent your timeline as actual line > segments, perhaps PostGIS would be useful. E.g.: > > https://postgis.net/docs/manual-2.5/ST_Difference.html That's an interesting notion. Thanks, Joe! I think I learned enough about plpgsql programming last night to write the three basic functions that each operate on two time intervals: 1. determine if two intervals overlap 2. merge two overlapping intervals into one interval 3. given two overlapping intervals, produce the difference interval or interval set My thinking here is that since these functions seem to be computationally simple, it might be more convenient for exploration, development, distribution and maintenance if there is some uniformity in their style and control over their implementation and behavior. Currently, I am thinking about how these functions might be used to solve the problem. Since I am not familiar with the capabilities of SQL, my tendency is to think in terms of a function that iterates over the data set multiple times and eventually converges to produce the solution set. I am concerned that this might be a very goofy way to solve the problem in a relational database.
On Sat, Dec 1, 2018, 12:38 PM Adam Jensen <hanzer@riseup.net wrote:
On 12/1/18 8:24 AM, Joe Conway wrote:
> Perhaps overkill, but if you represent your timeline as actual line
> segments, perhaps PostGIS would be useful. E.g.:
>
> https://postgis.net/docs/manual-2.5/ST_Difference.html
That's an interesting notion. Thanks, Joe!
I think I learned enough about plpgsql programming last night to write
the three basic functions that each operate on two time intervals:
1. determine if two intervals overlap
2. merge two overlapping intervals into one interval
3. given two overlapping intervals, produce the difference interval or
interval set
My thinking here is that since these functions seem to be
computationally simple, it might be more convenient for exploration,
development, distribution and maintenance if there is some uniformity in
their style and control over their implementation and behavior.
Currently, I am thinking about how these functions might be used to
solve the problem. Since I am not familiar with the capabilities of SQL,
my tendency is to think in terms of a function that iterates over the
data set multiple times and eventually converges to produce the solution
set. I am concerned that this might be a very goofy way to solve the
problem in a relational database.
If you use postgis then you get all the fundamental primitives and (critically) indexing that is virtually guaranteed to function well. I used postgis awhile back to solve a data problem that was based in abstract dimensions and it worked incredibly well. I'm not an expert in the kind of extension programming you're attempting but I'd be worried that it's easy to make a mistake.
Postgis is going to give you maybe more guardrails to keep your work within safe design boundaries..
Just two cents from the gallery,
Steve