Range Types, discrete and/or continuous - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Range Types, discrete and/or continuous |
Date | |
Msg-id | 1287956256.8516.792.camel@jdavis Whole thread Raw |
Responses |
Re: Range Types, discrete and/or continuous
Re: Range Types, discrete and/or continuous Re: Range Types, discrete and/or continuous |
List | pgsql-hackers |
I'd like to open the discussion for Range Types again. This is a fairly long email because several issues are intertwined, and I don't think they can be neatly pulled apart. Previous discussions: http://archives.postgresql.org/pgsql-hackers/2009-12/msg01162.php http://archives.postgresql.org/pgsql-hackers/2010-10/msg01126.php Wiki page: http://wiki.postgresql.org/wiki/RangeTypes Last development cycle, one of the questions that was unresolved was whether to handle ranges like a discrete set (that is, [1,5) = [1,4] ) or continuous or both. I think that discrete ranges are required. For instance, day range and IP address ranges are two examples where treating them continuously would clearly cause confusion. "Monday until Thursday" is the same as "Monday through Wednesday," and it would be dangerous to treat them as different values. Timestamp ranges are a little trickier because the resolution is much finer. There's a certain intuition for either representation:* We could look at them as continuous, and say that the resolutionis merely an implementation detail.* We could look at them as a discrete range, saying that it just happens to have a very fine granularity. I don't think intuition helps us much here, because different people have different intuitions for different problems. The second seems more mathematically sound to me, because the resolution will ultimately affect the semantics at some point (whether it's seconds or days), so it can't be purely an implementation detail. For instance, let's say you have two ranges: R1 = [ 2009-01-01 01:00:00, 2009-01-01 01:00:10 ]R2 = [ 2009-01-01 01:00:00, 2009-01-01 01:00:10.000001 ) If we treat those as discrete, then R1 = R2, R1 contains R2, R2 contains R1, and R2 - R1 = R1 - R2 = empty. However, if we treat those as continuous, then we get a contradiction: R2 contains R1R1 does not contain R2R2 - R1 = ( 2009-01-01 01:00:10, 2009-01-01 01:00:10.000001 ) = empty? So clearly, the granularity can make a difference in the semantics. It can only be treated as an implementation detail if we say "nobody will ever have timestamps that close," which seems pretty flimsy. I think everyone would agree that seconds are plenty large enough that conflicts are likely, and there are only a million microseconds per second. It's easy to imagine that such cases might appear if you are using interval math or if you are collecting data from multiple inputs constantly (network log data, sensor networks, etc). A couple problems though: one is float types. Technically they can be treated as discrete, but the granularity is variable, which may pose some implementation awkwardness (for instance, trying to make use of division or multiplication for granules won't work). I suppose we can write off float4/8 because the users should expect some fuzziness, and we can treat them as continuous. But if we want floating-point timestamps to be supported by range types at all, do we treat those as continuous while integer timestamps are discrete? Seems odd. Also, there's a good case for continuous ranges for types like NUMERIC, but still continuous ranges don't seem quite as important overall. One way to make NUMERIC ranges discrete is by specifying a granule size; and that has the added benefit that you can use that for other types as well (like timestamp ranges where you only care about second granularity). But then there are problems, like someone trying to say "float4 range with granularity 0.1", which is asking for trouble. I suppose we could say again that people using float4 are expecting some fuzziness. The way I see it, there are three approaches to resolving these issues: 1. Support only discrete ranges, possibly allowing the specification of the granule. 2. Support two different ways to specify range types: one for discrete and one for continuous. 3. Combine the discrete and continuous interfaces by making the API a higher level and forcing the type definition to make all of the decisions. I think it's possible to do this in such a way that postgres doesn't really care whether a type is continuous or discrete, based on some ideas from Nathan Boley. This is "Approach 2" on wiki page. #1 allows for #2 to happen later, so I don't think it makes sense to choose #2 at this stage. However, #1 may preclude doing #3 later. Right now I'm leaning toward #3. I like this approach quite a lot, and it only has two real disadvantages: * Harder to do sanity checks at the time of definition. * No easy way to declare a range type that's slightly differentfrom an existing type. For instance, you can't just specify a different granule with simple syntax, you'd need to handle that in the functions used to define the range type. I don't think either of those things are a major problem. Defining a new range type requires a bit of thought to get right, and I don't think we can come up with one interface that forces users to get it right. So, it's reasonable to say that defining a new range type that's not built-in should be an extension, rather than just a little DDL. The important part of this feature is that postgres will know what types are range types and how to manipulate them. Also, if we start with #3, we might be able to add some syntax sugar on top to make it easy to define new range types with simple DDL in some simple situations. Thoughts? Regards,Jeff Davis
pgsql-hackers by date: