Thread: Range Types: empty ranges
Do we want empty ranges? The philosophy is that they are essentially the "zero" value of any range type. Like the number zero, it allows closure over operations that would otherwise return an error. For instance, the number zero is useful because you can do things like: f(x) = 5x + 3; And even if x is zero, the function is still defined, and even produces a more "real" number like 3. Sure, when you try to divide by zero, you have a problem, but otherwise it works. Similarly, "intersection" of ranges is somewhat analogous to multiplication of numbers. I have a feeling that there are a lot of applications that might use intersection in combination with other operators, like overlaps or contains. If we do allow empty ranges, I think it will be seamless in most of those situations because "overlaps" and "contains" are well-defined for empty ranges. But if we don't allow empty ranges, I suspect that it will cause some user surprise, because depending on the order of operations an empty range may be created (causing an error) or not. The cost, of course, is that not all operations are well-defined for empty ranges. I think those are mostly operators like those mentioned in the other thread: ">>" (strictly right of), "<<" (strictly left of), and "-|-" (adjacent); and perhaps "&>" and "&<". These are probably used a little less frequently, and should probably not be used in a context where empty ranges are permitted (if they are, it's likely a mistake and an error should be thrown). My feeling is that we should let the operation proceed as far as it is well-defined, and no further; and I think that means empty ranges should be allowed. Thoughts? Do the benefits outweigh the costs? Regards,Jeff Davis
On 2/11/11 10:11 AM, Jeff Davis wrote: > Thoughts? Do the benefits outweigh the costs? I guess I'm having trouble tying the concept of empty ranges to any reality external to the database. For example, what would the time range: '('15:15:00','15:15:00')' ... represent exactly? "A non-existant point in time which might or might not be near 3:15 PM"? For my personal use of ranges, I'm very reluctant to embrace a concept in the database which can't possibly depict something concrete. Really, it seems like you're trying to fix NULL by separating the concept of EMPTY from NULL. Which is a good idea in general, I'm just not sure that this is the way to do it. HOWEVER, I also recognize that range types might be used for scientific applications, in which the mathematical concepts of imaginary ranges and empty ranges might be valid. So I wouldn't want to prohibit this feature for the people who need it. BUT ... if I, in one of my applications, accidentally defined something as having the range '('15:15:00','15:15:00')', I would *want* the database to through an error and not accept it. So, if we allow empty ranges of this kind, I would want a GUC for "allow_empty_ranges". -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Jeff Davis <pgsql@j-davis.com> wrote: > The philosophy is that they are essentially the "zero" value of > any range type. Like the number zero, it allows closure over > operations that would otherwise return an error. Well, zero has a pretty well defined and easy to understand meaning. How many ostriches do you have in your back yard? In my case, zero. That's distinct from NULL, which would mean the answer is unknown. > Similarly, "intersection" of ranges is somewhat analogous to > multiplication of numbers. Ah, so the point is that the intersection of '[13:00,15:00)' and '[08:00,10:00)' is the empty range, not NULL -- because you *know* they don't overlap? That seems useful. What seems not useful is to say that such a value is both strictly to the right *and* strictly to the left of a range. That seems to fly in the face of the normal meaning of "strict". Nor does it seem reasonable to say that it is adjacent to a range. It seems like it would be useful to be able to say that if A is adjacent to B and B is adjacent to C that the three ranges could be considered as one contiguous range. You'd get in trouble with such an assumption if an empty B was considered adjacent to all ranges. I see similar problems with an empty B if A is strictly to the left of B and B is strictly to the left of C, where C is strictly to the left of A. I'm somewhat inclined to agree with Robert that such comparisons should not throw errors. At this point, my inclination is to say that UNKNOWN should be the result (which I believe is equivalent to NULL::boolean). -Kevin
Josh Berkus <josh@agliodbs.com> wrote: > if I, in one of my applications, accidentally defined something > as having the range '('15:15:00','15:15:00')', I would *want* the > database to through an error and not accept it. I can agree with that, but I think that range '[15:15:00,15:15:00)' should be valid as a zero-length range between, for example, '[15:00:00,15:15:00)' and '[15:15:00,15:30:00)'. -Kevin
On Fri, Feb 11, 2011 at 1:11 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Similarly, "intersection" of ranges is somewhat analogous to > multiplication of numbers. I had a feeling that we might be going in this direction. It strikes me that this case is a bit like division by zero. It's kind of a nuisance that dividing by zero throws an error and we COULD fix that by making it return NULL or NaN or some new distinguished value DbZ. But then we'd have to define what happens when you feed DbZ into every other operation in the system, and similarly here. If we define two non-overlapping ranges as intersecting to NULL, or as throwing an error, then everything else is clear after that. I'm not sure it's worth complicating the representation and the definitions of other operations to cater to this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 11, 2011 at 1:50 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Josh Berkus <josh@agliodbs.com> wrote: > >> if I, in one of my applications, accidentally defined something >> as having the range '('15:15:00','15:15:00')', I would *want* the >> database to through an error and not accept it. > > I can agree with that, but I think that range '[15:15:00,15:15:00)' > should be valid as a zero-length range between, for example, > '[15:00:00,15:15:00)' and '[15:15:00,15:30:00)'. How would that actually work? I kind of agree with Josh: I'd be inclined to make the type input function boot such constructs at the door. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2011-02-11 at 10:28 -0800, Josh Berkus wrote: > I guess I'm having trouble tying the concept of empty ranges to any > reality external to the database. That's true, but in the same sense as zero has no meaning outside of the database. It's really that it has nice mathematical properties coming from set theory. Take the distributive law: A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) If (B INTERSECT C) is empty, then the result should be A. With empty ranges, that holds on both sides. Without them, it doesn't hold on the left side. People implicitly use this kind of logic all the time when constructing queries. If one form works, and the other throws an error, it will cause confusion. > For example, what would the time range: > > '('15:15:00','15:15:00')' > > ... represent exactly? "A non-existant point in time which might or > might not be near 3:15 PM"? That's meaningless and will throw an error. An empty range is not anchored at any particular point, so any two empty ranges are equal. > BUT ... if I, in one of my applications, accidentally defined something > as having the range '('15:15:00','15:15:00')', I would *want* the > database to through an error and not accept it. Absolutely. That kind of input should throw an error (and does). > So, if we allow empty ranges of this kind, I would want a GUC for > "allow_empty_ranges". I think that would be the least desirable option. If we don't like empty ranges, let's prohibit them entirely. Or, there are always check constraints... Regards,Jeff Davis
Robert Haas <robertmhaas@gmail.com> wrote: >> I think that range '[15:15:00,15:15:00)' should be valid as a >> zero-length range between, for example, '[15:00:00,15:15:00)' and >> '[15:15:00,15:30:00)'. > > How would that actually work? I kind of agree with Josh: I'd be > inclined to make the type input function boot such constructs at > the door. It makes more sense in the context of a range of some type with a clearly defined granularity. Our accounting system, for example, can assign a new range of receipt IDs for each calendar year. If you want a variable to represent the receipts for traffic receipts for 2012, you might, in preparation for the upcoming year, define something to declare the range as '[12TR000001,12TR000001)'. When the first receipt is assigned as 12TR000001, that is updated to '[12TR000001,12TR000002)'. Just as an off-the-cuff example. Basically, with a type having well-defined granularity, a [) range could usefully represent, "start to last used", and start out empty. -Kevin
On Fri, Feb 11, 2011 at 2:06 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2011-02-11 at 10:28 -0800, Josh Berkus wrote: >> I guess I'm having trouble tying the concept of empty ranges to any >> reality external to the database. > > That's true, but in the same sense as zero has no meaning outside of the > database. > > It's really that it has nice mathematical properties coming from set > theory. Take the distributive law: > > A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) But the basic range type isn't even closed under UNION. It seems like what you really need if you want this stuff to work out smoothly is a MULTIRANGE type. Then this problem goes away: the union or intersection of two ranges is a multirange, and after that everything is closed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Basically, with a type having well-defined granularity, a [) range > could usefully represent, "start to last used", and start out > empty. I guess that would actually be "start to next available".... -Kevin
On Fri, 2011-02-11 at 13:08 -0600, Kevin Grittner wrote: > It makes more sense in the context of a range of some type with a > clearly defined granularity. Our accounting system, for example, > can assign a new range of receipt IDs for each calendar year. If > you want a variable to represent the receipts for traffic receipts > for 2012, you might, in preparation for the upcoming year, define > something to declare the range as '[12TR000001,12TR000001)'. When > the first receipt is assigned as 12TR000001, that is updated to > '[12TR000001,12TR000002)'. Just as an off-the-cuff example. > > Basically, with a type having well-defined granularity, a [) range > could usefully represent, "start to last used", and start out empty. I think this is trying to make a range into something that's not. A range is a set of values with the additional constraint that there are no "gaps". Trying to incorporate a "start value" is adding extra information in there, and it's not really a part of the same algebra. It sounds more like a contiguous sequence with a "start value" and a "current value" to me. Sequences have other useful operations, like "next", and things like "overlaps" don't really seem to make sense (at least not in a practical way that I can tell). Regards,Jeff Davis
On Feb 11, 2011, at 10:28 AM, Josh Berkus wrote: > So, if we allow empty ranges of this kind, I would want a GUC for > "allow_empty_ranges". GUC you, josh. ;-P D
On Fri, 2011-02-11 at 13:50 -0500, Robert Haas wrote: > On Fri, Feb 11, 2011 at 1:11 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Similarly, "intersection" of ranges is somewhat analogous to > > multiplication of numbers. > > I had a feeling that we might be going in this direction. It strikes > me that this case is a bit like division by zero. Except that we do happen to allow the value zero and wait 'til someone divides by it before throwing an error. So I think that's more of a point toward allowing empty ranges than rejecting them. > It's kind of a > nuisance that dividing by zero throws an error and we COULD fix that > by making it return NULL or NaN or some new distinguished value DbZ. But empty ranges are actually quite well-defined, in a way similar to an empty set. * it can meaningfully result in a non-empty range at a later stage of computation * it increases the number of tautologies, rather than decreasing them like NULL I guess what I'm saying is that DbZ doesn't seem particularly useful to carry along, while and empty range plausibly is. > But then we'd have to define what happens when you feed DbZ into every > other operation in the system, and similarly here. If your point is that empty ranges need to be handled specially sometimes, I agree. That is the semantic cost which I identified in the original email. Are the benefits worth it? > If we define two > non-overlapping ranges as intersecting to NULL, or as throwing an > error, then everything else is clear after that. Well, there is a certain amount of localized clarity, I will agree with that. The complexity comes when you accidentally rely on some transformation which seems logically sound, but could result in a transient empty range, which then throws an error. Regards,Jeff Davis
On Fri, 2011-02-11 at 14:14 -0500, Robert Haas wrote: > > It's really that it has nice mathematical properties coming from set > > theory. Take the distributive law: > > > > A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) > > But the basic range type isn't even closed under UNION. An excellent point. Allow me to move the target a little: WHERE A && B AND A && C and: WHERE A && (B INTERSECT C) That seems like a logically sound transformation, but if (B INTERSECT C) is empty, it relies on the empty range for those two to be equivalent. And that would be a runtime error, caught during testing only if you're lucky. Now, I agree that lack of closure on UNION exhibits many of the problems that I am pointing out related to forbidding empty ranges. However, I'm not sure if that means we should give up on either. Regards,Jeff Davis
On Fri, Feb 11, 2011 at 3:03 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Well, there is a certain amount of localized clarity, I will agree with > that. The complexity comes when you accidentally rely on some > transformation which seems logically sound, but could result in a > transient empty range, which then throws an error. But by this argument you also need to support discontiguous ranges, don't you? I mean, if you want to insist that A intersect B has to still be a legal range, what about A union B? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jeff Davis <pgsql@j-davis.com> wrote: > Trying to incorporate a "start value" is adding extra information > in there, and it's not really a part of the same algebra. It > sounds more like a contiguous sequence with a "start value" and a > "current value" to me. Well, in the receipt number example there are multiple ranges in use for each year, and ranges for multiple years. If we get to the idea of a multi-ranges, this would be very handy for certain types of reports -- especially for auditors. It's not that we can't do with with discrete begin and end columns -- we do that now; but it seemed a potentially beneficial use of ranges for us, if they can represent the needed states. People already talk about these as ranges, just in terms of the common English understanding of the word. Perhaps it was a mistake to get so concrete rather than conceptual -- basically, it seems like it could be a useful concept for any planned or scheduled range with an indeterminate end point, which you want to "reserve" up front and record in progress until complete. The alternative would be that such "ranges to be" have a parallel "planned start value" column of the same type as the range, to be used as the start of the range once it is not empty. Or, as another way to put it, it seems potentially useful to me to have an empty range which is pinned to a location, in *addition* to the unpinned empty ranges such as would be needed to represent the intersection of two non-overlapping ranges. Of course, the *most* useful places for our shop to have ranges are temporal. Many (most?) of those are situations where you start with a range with an unknown end and later (often years later) fill in the end of the range based on some event which finally closes it. Again, two discreet dates with a null-capable end-date work, but I can see where ranges could potentially be more powerful. -Kevin
On Fri, 2011-02-11 at 14:19 -0600, Kevin Grittner wrote: > Well, in the receipt number example there are multiple ranges in use > for each year, and ranges for multiple years. If we get to the idea > of a multi-ranges, this would be very handy for certain types of > reports -- especially for auditors. It's not that we can't do with > with discrete begin and end columns -- we do that now; but it seemed > a potentially beneficial use of ranges for us, if they can represent > the needed states. People already talk about these as ranges, just > in terms of the common English understanding of the word. I think that might indicate that the word "range" is a little too loose. The definition that I've been using is more like a mathematical interval. > Perhaps it was a mistake to get so concrete rather than conceptual > -- basically, it seems like it could be a useful concept for any > planned or scheduled range with an indeterminate end point, which > you want to "reserve" up front and record in progress until > complete. Maybe this is a range... would it be served by: (5, INF) or: [5, INF) ? That's already supported, and it means "all points greater than 5". > Of course, the *most* useful places for our shop to have ranges are > temporal. Many (most?) of those are situations where you start with > a range with an unknown end and later (often years later) fill in > the end of the range based on some event which finally closes it. > Again, two discreet dates with a null-capable end-date work, but I > can see where ranges could potentially be more powerful. Ranges support infinite boundaries, but do not support NULL (previous discussion concluded that NULL boundaries were likely to be confusing and served no obvious use case not handled by infinity). Regards,Jeff Davis
On Fri, 2011-02-11 at 15:14 -0500, Robert Haas wrote: > On Fri, Feb 11, 2011 at 3:03 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Well, there is a certain amount of localized clarity, I will agree with > > that. The complexity comes when you accidentally rely on some > > transformation which seems logically sound, but could result in a > > transient empty range, which then throws an error. > > But by this argument you also need to support discontiguous ranges, don't you? > > I mean, if you want to insist that A intersect B has to still be a > legal range, what about A union B? > I responded to a similar question/point here: http://archives.postgresql.org/pgsql-hackers/2011-02/msg01073.php Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: >> Perhaps it was a mistake to get so concrete rather than >> conceptual -- basically, it seems like it could be a useful >> concept for any planned or scheduled range with an indeterminate >> end point, which you want to "reserve" up front and record in >> progress until complete. > > Maybe this is a range... would it be served by: > (5, INF) > or: > [5, INF) > ? > > That's already supported, and it means "all points greater than > 5". Well that would create a range of infinite size until you started using it, at which point it would drop to tiny and start to increase until completion, which would be kinda weird as well as error-prone. At least for the uses I'm considering. I'm sure we can work around it if it isn't supported, but for our purposes, [x,x) ranges would be a useful construct. >> Of course, the *most* useful places for our shop to have ranges >> are temporal. Many (most?) of those are situations where you >> start with a range with an unknown end and later (often years >> later) fill in the end of the range based on some event which >> finally closes it. Again, two discreet dates with a null-capable >> end-date work, but I can see where ranges could potentially be >> more powerful. > > Ranges support infinite boundaries, but do not support NULL > (previous discussion concluded that NULL boundaries were likely to > be confusing and served no obvious use case not handled by > infinity). Yeah, infinity works fine as long as you realize it's only temporary. I don't know whether there are any practical uses where on the low end it doesn't mean "some particular date we don't know or don't care enough to determine" and on the high end doesn't mean "some particular date which we don't know or which hasn't yet been fixed". For the latter we have, so far, been using NULL, since we were trying to stay portable and not all products support infinity. We're committing to PostgreSQL now, so we could start to use infinity for these temporarily indefinite values, and I would expect to do so in ranges. -Kevin
FWIW, a very informal survey of probabilists didn't yield any reason for trying to put an order on the empty set ( unless the metric was cardinality or other equivalence relation ). I think the problem here is that the idea of union and intersection forming a ring over sets is being conflated with the order relation. Clearly, combining the two notions can be inconsistent. However... >> > A UNION (B INTERSECT C) = (A UNION B) INTERSECT (A UNION C) >> >> But the basic range type isn't even closed under UNION. > > An excellent point. Allow me to move the target a little: > > WHERE A && B AND A && C > and: > WHERE A && (B INTERSECT C) > > That seems like a logically sound transformation, but if (B INTERSECT C) > is empty, it relies on the empty range for those two to be equivalent. > > Now, I agree that lack of closure on UNION exhibits many of the problems > that I am pointing out related to forbidding empty ranges. However, I'm > not sure if that means we should give up on either. This seems potentially very useful, because we can transform WHERE A && B AND A && C from a bitmap scan into WHERE A && (B INTERSECT C), a simple index scan. In the union case ( even if we had a type that supported disjoint intervals), I doubt we would ever make that transformation because the index will probably still be over connected intervals. So, +1 for keeping it how it is ( but maybe with a better error message ). Best, Nathan
On Fri, Feb 11, 2011 at 10:11:45AM -0800, Jeff Davis wrote: > The cost, of course, is that not all operations are well-defined for > empty ranges. I think those are mostly operators like those mentioned in > the other thread: ">>" (strictly right of), "<<" (strictly left of), and > "-|-" (adjacent); and perhaps "&>" and "&<". These are probably used a > little less frequently, and should probably not be used in a context > where empty ranges are permitted (if they are, it's likely a mistake and > an error should be thrown). But surely this is just a matter of your definitions? If you define "A strictly left of B" as "all points in A are less than all points in B", then you might have a problem with "all points" of an empty range. If you define it as "not (any points in A to the right of any points in B)" then the answer for an empty range is well defined (namely true). I think once you make workable definitions then empty ranges should pose no problems. Sure, it may be that an empty range will be both to the left and to the right of every other set. That doesn't make it wrong. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On 2/11/2011 1:50 PM, Kevin Grittner wrote: > Josh Berkus<josh@agliodbs.com> wrote: > >> if I, in one of my applications, accidentally defined something >> as having the range '('15:15:00','15:15:00')', I would *want* the >> database to through an error and not accept it. > > I can agree with that, but I think that range '[15:15:00,15:15:00)' > should be valid as a zero-length range between, for example, > '[15:00:00,15:15:00)' and '[15:15:00,15:30:00)'. Does ['15:15:00','15:15:00') make any more sense? Doesn't this essentially mean >= '15:15:00' && < '15:15:00' which again doesn't include a single point on the time line? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> wrote: > Does ['15:15:00','15:15:00') make any more sense? Doesn't this > essentially mean > > >= '15:15:00' && < '15:15:00' > > which again doesn't include a single point on the time line? It defines a position in time with zero duration. Some of the graphics programming I've done in the past was based on a system where, at the pixel level, the coordinates referred to the boundaries *between* the pixels. If you were to have a number of horizontal bars, for example, the size of which you would adjust to represent fluctuating data, you might have an x coordinate of 20 for the left edge, and [20,30) would paint 10 pixels. I guess you *could* destroy and recreate the object when the number dropped to zero and came off it again; but the concept of [20,20) to draw zero pixels but maintain the positional anchor can be convenient. I see parallel concepts for some data domains in a database. -Kevin