Ranges for well-ordered types - Mailing list pgsql-hackers

From Michael Glaesemann
Subject Ranges for well-ordered types
Date
Msg-id 10AC5536-280E-463F-A335-71FEC1FBB774@seespotcode.net
Whole thread Raw
Responses Re: Ranges for well-ordered types  ("Ian Caulfield" <ian.caulfield@gmail.com>)
Re: Ranges for well-ordered types  (Michael Glaesemann <grzm@seespotcode.net>)
Re: Ranges for well-ordered types  (Bruno Wolff III <bruno@wolff.to>)
Re: Ranges for well-ordered types  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
I've been interested in representing and manipulating time ranges in  
PostgreSQL, where a time range is defined by a start time and an end  
time. Time ranges are useful, for example, in representing when some  
predicate was known valid. Similarly, time ranges can be used to  
represent "transaction time": the version history of the data itself.  
This is similar to the "time travel" feature in previous versions of  
PostgreSQL. (Tables that include both valid time and transaction time  
information are sometimes called bitemporal tables.) Ranges can also  
be useful in scheduling applications.

The original use case that prompted me to explore this area was  
tracking what time periods teachers were assigned to schools (a valid- 
time table). Here's one way to represent this in PostgreSQL:

create table teachers__schools_1
(    teacher text not null    , school text not null    , from_date date not null    , to_date date not null    , check
(from_date<= to_date)    , unique (teacher, school, from_date)    , unique (teacher, school, to_date)
 
);

Each row of this table represents the time range (from from_date to  
to_date) during which a teacher was assigned to a particular school.  
(Teachers can be assigned to more than one school at a time.) The  
check constraint guarantees that [from_date, to_date] represents a  
valid closed-closed interval (where the end points are included in  
the range). Two unique constraints are necessary to guarantee  
uniqueness for each row. In my use case, it would also be desirable  
to apply constraints to prevent overlapping and ensure continuity-- 
that teachers are always at some school. This can be done using  
constraint triggers, though making sure all of the comparisons for  
four dates (beginning and end points for two ranges) are done  
properly can be a little daunting and prone to bugs.

The problem of representing time ranges in a relational database has  
been worked on for quite some time. Indeed, the PostgreSQL source  
still includes a tinterval type, where the beginning and end points  
are of type abstime, though this type is no longer included in the  
documentation. Drafts of SQL3 included the PERIOD constructor to  
represent date, time, and timestamp ranges as part of SQL/Temporal,  
but this section of the specification was not included in the final  
SQL:2003 standard. At least two relatively prominent books [1][2] as  
well as numerous papers have been written on the subject.

An aside regarding terminology: In the literature I've read, time  
ranges are most often referred to as intervals. However, SQL already  
uses INTERVAL to refer to another distinct type. As mentioned above,  
SQL3 used the term PERIOD for a constructor to create types with a  
beginning point and an end point. RANGE is a reserved keyword in SQL: 
2003 (related, I believe, to windowed tables). Range also has a  
distinct meaning in relational calculus. I'm at a bit of a loss as to  
how to refer to these structures with a beginning and an end point  
with a term that isn't already reserved in SQL or may be in the  
future. Suggestions welcome :) Span? Reuse tinterval? For the time  
being, I'll arbitrarily continue to use range.

In the first six months of 2006 alone, I've noted quite a few threads  
related to time ranges in the various PostgreSQL mailing lists, so it  
seems this issue is ripe for a solution. Just two weeks ago, Albert  
Hertroys started work on a vector type[3] , where he rightly notes  
that time ranges are just an application of a general range (or  
vector, to use his term) that could just as easily be used with  
"integers, reals, points, dates, etc". For the past couple of months,  
I've been working with composite types and PL/pgSQL to define and  
manipulate date ranges and integer ranges, and see what can be  
abstracted out to a general range type.

In the general case, a particular range value can be represented as  
two point values. For example, the date range [2006-01-01,  
2006-05-31] (using the closed-closed interval representation) is  
represented by the two date "point" values 2006-01-01 and 2006-05-31.  
The interval range [3,9] is represented by the two integer "point"  
values 3 and 9. A range can be formed for any point type, where a  
point type is any type that's well-ordered:* the range of values is bounded (the number of values in the type  
is finite)* comparisons are well-defined for any two values, and* for any point p, the next point can be found using a
successor 
 
function

Given a well-ordered "point" type, common questions of ranges can be  
be answered, such as, for two ranges r1 and r2* Do r1 and r2 overlap?* Does r1 meet r2?* Is the union of r1 and r2
defined,and if so, what is it?* Is the intersection of r1 and r2 defined, and if so, what is it?
 

The way I've thought of implementing ranges in PostgreSQL is through  
an additional system catalog table (called pg_ordered_type for  
convenience) that would store the necessary information for each  
point type:* ordtype: the point type (referencing pg_type.oid)* ordcmpfunc: the comparison function (referencing
pg_proc.oid)*ordsucc: the successor function (referencing pg_proc.oid)* ordmin, ordmax the minimum and maximum values
forthe point type  
 
(not sure how this would be represented)
From one point of view, the successor function and minimum and  
maximum values (and perhaps the comparison function) are intrinsic  
attributes of the type itself. However, I can see potential  
usefulness using the same "base" type with different successor  
functions or different maximum and minimum values. For example, to  
represent a point type of even numbers, the base type could be  
integer, and the successor function would "p + 2". (This raises the  
question of how to restrict the values of the type to even numbers:  
perhaps using domains would be a solution.) Another example would be  
using timestamps of different precision: timestamp(6) has a different  
successor function than timestamp(0).

It may be desirable to include in pg_ordered_type a "predecessor"  
function (returning the point prior to a point p) and a duration  
function (providing a more efficient way to return the number of  
points included in the interval than iterating over the successor  
function).

With comparison and the successor function, one can define the 13  
predicates (often called Allen operators) describing the relationship  
between any two range values: equals, before, meets, overlaps,  
starts, during, finishes, and inverses of the last six.

Note that two of the built-in numeric types--real and double- 
precision--are not well-ordered, and therefore ranges for real and  
double-precision would not be defined. The primary reason for this is  
that without a successor function, it's not possible to determine  
whether or not two range values meet (determined by the meets Allen  
operator). And without a definition for meets, it's not possible to  
define continuity--to use the above example, that a teacher must  
always be assigned to some school. In that respect, the  
implementation I'm proposing here differs from the vector type  
proposed by Albert Hertroys. Perhaps there's a way to have a "looser"  
form of ranges that do not include a successor function, and for  
those looser ranges, continuity constraints couldn't be defined.  
However, the issues arising from the inexactness of real and double  
precision calls into question the usefulness of functions that rely  
on testing for equality (e.g., equals, meets, starts, finishes, and  
inverses of the last three).

Another issue for a general range implementation is uniqueness.  
Currently btree indexes are used to enforce uniqueness in PostgreSQL.  
By somewhat arbitrarily defining a comparison result for the 13  
possible predicates, I've been able to define an opclass for ranges  
so btree indexes can be defined. Note that in defining an opclass,  
comparision is already required, so perhaps the comparison function
could be stored someplace other than the pg_ordered_type table.

As for other indexes, I've looked at some of the GiST material and  
think that it probably can be usefully applied to ranges: for example  
the ip4r type[4], which defines ranges of IPv4 addresses, uses GiST  
for indexing. However, GiST is currently well over my head, so this  
is speculation on my part.

Returning to my original example, with a "date_range" type, the table  
could be defined as:

create table teachers__schools_2
(    teacher text not null    , school text not null    , period date_range not null    , primary key (teacher, school,
period)
);

The original check constraint is handled by the date_range type and  
the two unique constraints are replaced by a single primary key  
constraint. Constraints for overlapping and continuity are still  
handled using constraint triggers, but are easier to implement using  
functions available to compare ranges rather than handling beginning  
and end points individually.

I believe ranges in PostgreSQL would be useful and satisfy a need  
expressed in numerous mailing list threads. I hope to do my part in  
their implementation. I look forward to your feedback.

Michael Glaesemann
grzm seespotcode net

[1] Richard T. Snodgrass, Developing Time-Oriented Database  
Applications in SQL, Morgan Kaufmann, 2000. ISBN 1-55860-436-7. Out  
of print, but available from the author as a PDF.
(http://www.cs.arizona.edu/people/rts/tdbbook.pdf)
[2] Chris Date, Hugh Darwen, Nikolas Lorentzos, Temporal Data and the  
Relational Model, Morgan Kaufmann, 2002. ISBN 1-55860-855-9 (http:// 
www.amazon.com/gp/product/1558608559/)
[3] Vector type (Re: [GENERAL] challenging constraint situation - how  
do I make it) (http://archives.postgresql.org/pgsql-general/2006-05/ 
msg01279.php)
[4] ip4r (http://pgfoundry.org/projects/ip4r/)



pgsql-hackers by date:

Previous
From: Paul Lindner
Date:
Subject: Latest timezone data in 8.1.4
Next
From: "Ian Caulfield"
Date:
Subject: Re: Ranges for well-ordered types