Thread: WIP: Range Types

WIP: Range Types

From
Jeff Davis
Date:
I have been updating my work in progress here:

http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=log;h=refs/heads/rangetypes

Right now, it's not in a reviewable state, but those interested can
glance through the code.

Quick synopsis (for illustration purposes only; don't expect much from
the current code):
 CREATE TYPE numrange   AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);

Much of the previous discussion seemed to focus on two issues:


1. Reconciling discrete ranges (like ranges of integers) and continuous
ranges (like ranges of numeric).

I liked Robert's suggestion here:

http://archives.postgresql.org/message-id/AANLkTiks_x93_k82b4f_ga634wCi0oeb9fTrUrF28EGM@mail.gmail.com

which says that the user can just define a "canonicalize" function that
will take a range as input (or perhaps the logical pieces of a range)
and put it into an appropriate canonical representation. For instance,
int4range_canonical might take (1,4] and turn it into [2,4]. This is
similar to a few other ideas, but Robert's idea seems to require the
least effort by the person defining the range type, because postgresql
can still handle representation.

It doesn't allow for all of the suggested features. In particular, it
would not allow "granules" to be specified for discrete ranges. But on
balance, it seems like this is the most conceptually simple and I think
it satisfies the primary use cases.


2. Representational issues. There are many possibilities here: a. flags for inclusivity, start, and offset b. flags for
inclusivity,start, and end c. if it's a discrete range, start and end only might suffice d. if it's a discrete range,
perhapssomething involving "granules"
 

(a) might be interesting, and for some data types might be more compact,
but it introduces a new datatype that is distinct from the range's
subtype: the "difference type" (that is, for timestamps it's
"interval"). This approach seemed reasonable on paper, but it involves a
lot of extra complexity, and introduces some strange assumptions (using
an offset of "1 month" versus "30 days" can't be allowed).

(c) and (d) are rejected because they require different code paths for
discrete and continuous ranges.

I chose (b). This is the simplest. If desired, we could still allow the
user to specify their own serialize/deserialize functions, which can get
most of the benefits of the other ones anyway.


Other issues:

1. I plan to introduce an ANYRANGE type.

2. We need to use the subtype's IO functions, but those may not be
immutable. So, rather than create new IO functions for each range type,
I was thinking that I'd use just three (anyrange_i_in, anyrange_s_in,
and anyrange_v_in), and select the right one at definition time, based
on the subtype's IO functions' volatility. That seems like a bit of a
hack -- any better ideas?

3. Right now I allow user-defined parse/deparse functions to be
specified. In almost all cases, I would think that we want the text
format to be something like: [ 2010-01-01, 2011-01-01 )
where the brackets denote inclusivity, and the left and right sides can
be optionally double-quoted. Is it even worth having these parse/deparse
functions, or should we just force the "obvious" format?

4. For the GiST penalty function, and perhaps some picksplit algorithms,
it might be nice to know the length of a range, or do some other kinds
of math. It introduces a lot of complexity to try to define math
functions for each subtype, and try to make sure they behave sanely. So
I was thinking that the user might need to specify a function that
converts the subtype into a float that approximates a value's position
in the total order. Any better ideas?

Overall:

I think this is one of the simpler designs. Conceptually, defining new
ranges of different granularity with ease sounds like a great idea --
but it introduces a lot of complexity (and generated a lot of different
opinions), so it was not included in this design. Similarly, I am
leaning away from lots of user-specified options unless there is a real
use case.


Any suggestions or comments welcome.

Regards,Jeff Davis



Re: WIP: Range Types

From
Hitoshi Harada
Date:
2011/1/4 Jeff Davis <pgsql@j-davis.com>:
> I have been updating my work in progress here:
>
> http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=log;h=refs/heads/rangetypes
>
> Right now, it's not in a reviewable state, but those interested can
> glance through the code.
>
> Quick synopsis (for illustration purposes only; don't expect much from
> the current code):
>
>  CREATE TYPE numrange
>    AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);

I am interested in how you define increment/decrement operation of
range value in discrete types. The window functions and PARTITION also
want to represent RANGE but there's no clear solution.

Sorry if it's already been discussed since I didn't track the threads.

Regards


--
Hitoshi Harada


Re: WIP: Range Types

From
Florian Weimer
Date:
* Jeff Davis:

> 4. For the GiST penalty function, and perhaps some picksplit algorithms,
> it might be nice to know the length of a range, or do some other kinds
> of math. It introduces a lot of complexity to try to define math
> functions for each subtype, and try to make sure they behave sanely. So
> I was thinking that the user might need to specify a function that
> converts the subtype into a float that approximates a value's position
> in the total order.

Doesn't the eqsel hint already provide this information?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: WIP: Range Types

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> 2. We need to use the subtype's IO functions, but those may not be
> immutable. So, rather than create new IO functions for each range type,
> I was thinking that I'd use just three (anyrange_i_in, anyrange_s_in,
> and anyrange_v_in), and select the right one at definition time, based
> on the subtype's IO functions' volatility. That seems like a bit of a
> hack -- any better ideas?

You should just do what we do for arrays and records, ie, mark the I/O
functions stable.  There is no reason for anyrange to have a more
complicated approach to this than the existing composite-type structures
do.  See discussion thread here
http://archives.postgresql.org/pgsql-hackers/2010-07/msg00932.php
and commit here
http://archives.postgresql.org/pgsql-committers/2010-07/msg00307.php

> 3. Right now I allow user-defined parse/deparse functions to be
> specified. In almost all cases, I would think that we want the text
> format to be something like:
>   [ 2010-01-01, 2011-01-01 )
> where the brackets denote inclusivity, and the left and right sides can
> be optionally double-quoted. Is it even worth having these parse/deparse
> functions, or should we just force the "obvious" format?

+1 for forcing a single consistent format.  I compare this to the
Berkeley-era decision to let types specify nondefault array delimiters
--- that was flexibility that didn't help anybody, just resulted in
over-complicated code (or code that would fall over if someone tried
to actually use a delimiter other than comma...)
        regards, tom lane


Re: WIP: Range Types

From
Robert Haas
Date:
On Tue, Jan 4, 2011 at 2:29 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> I liked Robert's suggestion here:
>
> http://archives.postgresql.org/message-id/AANLkTiks_x93_k82b4f_ga634wCi0oeb9fTrUrF28EGM@mail.gmail.com
>
> which says that the user can just define a "canonicalize" function that
> will take a range as input (or perhaps the logical pieces of a range)
> and put it into an appropriate canonical representation. For instance,
> int4range_canonical might take (1,4] and turn it into [2,4]. This is
> similar to a few other ideas, but Robert's idea seems to require the
> least effort by the person defining the range type, because postgresql
> can still handle representation.
>
> It doesn't allow for all of the suggested features. In particular, it
> would not allow "granules" to be specified for discrete ranges. But on
> balance, it seems like this is the most conceptually simple and I think
> it satisfies the primary use cases.

Maybe I'm missing something, but it seems like this approach could
support granules.  You just have to define the canonicalize function
in terms of the granule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Jeff Davis
Date:
On Tue, 2011-01-04 at 12:21 -0500, Robert Haas wrote:
> > It doesn't allow for all of the suggested features. In particular, it
> > would not allow "granules" to be specified for discrete ranges. But on
> > balance, it seems like this is the most conceptually simple and I think
> > it satisfies the primary use cases.
> 
> Maybe I'm missing something, but it seems like this approach could
> support granules.  You just have to define the canonicalize function
> in terms of the granule.

I meant that it doesn't support them as an explicit, user-visible
concept.

The main drawback here is that only a select group of people will be
defining discrete range types at all, because it would require them to
define a function first. Perhaps that's for the best, because, (as Tom
pointed out) we don't want someone using floats and then specifying a
granule of '0.01'.

While we're talking about it, one question I had is: should the
canonicalize function be: /* works on the deserialized information right before serialization */ canonical(&flags,
&lower_bound,&upper_bound)
 
or /* works on the serialized form right after serialization */ range = canonical(range)

I would lean toward the latter because it's simpler on the user (and
allows non-C functions). But perhaps an efficiency argument could be
made for the former because it could avoid one round of
deserialize/reserialize when the representation is not already in
canonical form.

Regards,Jeff Davis





Re: WIP: Range Types

From
Jeff Davis
Date:
On Tue, 2011-01-04 at 14:18 +0000, Florian Weimer wrote:
> * Jeff Davis:
> 
> > 4. For the GiST penalty function, and perhaps some picksplit algorithms,
> > it might be nice to know the length of a range, or do some other kinds
> > of math. It introduces a lot of complexity to try to define math
> > functions for each subtype, and try to make sure they behave sanely. So
> > I was thinking that the user might need to specify a function that
> > converts the subtype into a float that approximates a value's position
> > in the total order.
> 
> Doesn't the eqsel hint already provide this information?
> 

Can you clarify what you mean? I don't know what the "eqsel hint" is.

Regards,Jeff Davis



Re: WIP: Range Types

From
Jeff Davis
Date:
On Tue, 2011-01-04 at 23:04 +0900, Hitoshi Harada wrote:
> 2011/1/4 Jeff Davis <pgsql@j-davis.com>:
> > I have been updating my work in progress here:
> >
> > http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=log;h=refs/heads/rangetypes
> >
> > Right now, it's not in a reviewable state, but those interested can
> > glance through the code.
> >
> > Quick synopsis (for illustration purposes only; don't expect much from
> > the current code):
> >
> >  CREATE TYPE numrange
> >    AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);
> 
> I am interested in how you define increment/decrement operation of
> range value in discrete types. The window functions and PARTITION also
> want to represent RANGE but there's no clear solution.
> 
> Sorry if it's already been discussed since I didn't track the threads.

The user would specify a "canonical" function like:
  CREATE TYPE int4range AS RANGE (SUBTYPE=int4, SUBTYPE_CMP=btint4cmp,    CANONICAL=my_int4range_canonical);

That function would be called when constructing ranges on input or after
a computation, and could change something like (1,4] into [2,4] if you
prefer the latter form.

So the range types would not have increments, decrements, granules, or
knowledge about the "difference" type (e.g. "interval" is the difference
type for timestamp).

What support do you need/want from range types to help with new window
function features?

Also, partitioning might have some use for range types to represent
range partitions. Comments are welcome.

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Tue, Jan 4, 2011 at 1:18 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2011-01-04 at 12:21 -0500, Robert Haas wrote:
>> > It doesn't allow for all of the suggested features. In particular, it
>> > would not allow "granules" to be specified for discrete ranges. But on
>> > balance, it seems like this is the most conceptually simple and I think
>> > it satisfies the primary use cases.
>>
>> Maybe I'm missing something, but it seems like this approach could
>> support granules.  You just have to define the canonicalize function
>> in terms of the granule.
>
> I meant that it doesn't support them as an explicit, user-visible
> concept.
>
> The main drawback here is that only a select group of people will be
> defining discrete range types at all, because it would require them to
> define a function first. Perhaps that's for the best, because, (as Tom
> pointed out) we don't want someone using floats and then specifying a
> granule of '0.01'.
>
> While we're talking about it, one question I had is: should the
> canonicalize function be:
>  /* works on the deserialized information right before serialization */
>  canonical(&flags, &lower_bound, &upper_bound)
> or
>  /* works on the serialized form right after serialization */
>  range = canonical(range)
>
> I would lean toward the latter because it's simpler on the user (and
> allows non-C functions).

Yeah, me too.

> But perhaps an efficiency argument could be
> made for the former because it could avoid one round of
> deserialize/reserialize when the representation is not already in
> canonical form.

I believe this might be an appropriate time to apply Knuth's Law.  I'm
not thrilled with the amount of palloc overhead we have in the
backend, but absent some evidence that this case is going to be
particularly significant, I'd be disinclined to contort the interface.I suspect that if you run oprofile this won't be
thebottleneck. 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Josh Berkus
Date:
On 1/4/11 10:18 AM, Jeff Davis wrote:
> The main drawback here is that only a select group of people will be
> defining discrete range types at all, because it would require them to
> define a function first. Perhaps that's for the best, because, (as Tom
> pointed out) we don't want someone using floats and then specifying a
> granule of '0.01'.

Frankly, I'm still not convinced that *anyone* will really need discrete
range types -- as opposed to continuous range types, which I'm already
using in production ala "temporal".  So I'm completely OK with making
discrete range types hard to use, as long as continous range types are
easy to use.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: WIP: Range Types

From
Jeff Davis
Date:
On Tue, 2011-01-04 at 16:45 -0800, Josh Berkus wrote:
> On 1/4/11 10:18 AM, Jeff Davis wrote:
> > The main drawback here is that only a select group of people will be
> > defining discrete range types at all, because it would require them to
> > define a function first. Perhaps that's for the best, because, (as Tom
> > pointed out) we don't want someone using floats and then specifying a
> > granule of '0.01'.
> 
> Frankly, I'm still not convinced that *anyone* will really need discrete
> range types 

Well, *need* is a standard that can never be met. But with something
like a date range, it's very possible that a discrete version matches
the real-world problem more closely than a continuous one.

If you use only continuous ranges, then be careful to stick with exactly
one convention, or you will likely get wrong results (I think this point
has already been established). That sounds easy, but consider:* If you want to know whether two ranges are adjacent (a
common
requirement), then you need to use "[ )" or "( ]".* If you need to map a single point into a range, the only thing
that
makes sense is "[ ]".* If your query contains current_date, you'll probably want ranges that
are either in "( ]" or "[ ]" form.* If you are mixing data sets, they may use different conventions.

You can work around all of these problems by making the query more
complex (and more error-prone). But I wouldn't like to give up on
discrete ranges for types where it really makes sense (dates, IPs,
integers).

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Wed, Jan 5, 2011 at 12:54 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2011-01-04 at 16:45 -0800, Josh Berkus wrote:
>> On 1/4/11 10:18 AM, Jeff Davis wrote:
>> > The main drawback here is that only a select group of people will be
>> > defining discrete range types at all, because it would require them to
>> > define a function first. Perhaps that's for the best, because, (as Tom
>> > pointed out) we don't want someone using floats and then specifying a
>> > granule of '0.01'.
>>
>> Frankly, I'm still not convinced that *anyone* will really need discrete
>> range types
>
> Well, *need* is a standard that can never be met. But with something
> like a date range, it's very possible that a discrete version matches
> the real-world problem more closely than a continuous one.
>
> If you use only continuous ranges, then be careful to stick with exactly
> one convention, or you will likely get wrong results (I think this point
> has already been established). That sounds easy, but consider:
>  * If you want to know whether two ranges are adjacent (a common
> requirement), then you need to use "[ )" or "( ]".
>  * If you need to map a single point into a range, the only thing that
> makes sense is "[ ]".
>  * If your query contains current_date, you'll probably want ranges that
> are either in "( ]" or "[ ]" form.
>  * If you are mixing data sets, they may use different conventions.
>
> You can work around all of these problems by making the query more
> complex (and more error-prone). But I wouldn't like to give up on
> discrete ranges for types where it really makes sense (dates, IPs,
> integers).

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Hitoshi Harada
Date:
2011/1/5 Jeff Davis <pgsql@j-davis.com>:
> On Tue, 2011-01-04 at 23:04 +0900, Hitoshi Harada wrote:
>> >  CREATE TYPE numrange
>> >    AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);
>>
>> I am interested in how you define increment/decrement operation of
>> range value in discrete types. The window functions and PARTITION also
>> want to represent RANGE but there's no clear solution.
>
> The user would specify a "canonical" function like:
>
>   CREATE TYPE int4range AS RANGE (SUBTYPE=int4, SUBTYPE_CMP=btint4cmp,
>     CANONICAL=my_int4range_canonical);
>
> That function would be called when constructing ranges on input or after
> a computation, and could change something like (1,4] into [2,4] if you
> prefer the latter form.
>
> So the range types would not have increments, decrements, granules, or
> knowledge about the "difference" type (e.g. "interval" is the difference
> type for timestamp).

To canonicalize, it might be ok. I wonder if you won't operate on the
range types like extending their bounds or iterate/enum values from
start to end. In such situation, I bet you'll need to know how to walk
values step by step.

> What support do you need/want from range types to help with new window
> function features?
>
My argument is here:
http://archives.postgresql.org/message-id/AANLkTimFmQmbzJ5CTXvE_PwT_zmCuHPoet3gaQq6Pvo8@mail.gmail.com

For any type to calculate boundary based on RANGE clause in window
functions, we need some type interface mechanism in the core to know
how to add / subtract values to reach the boundary from the current
value. For example,

SELECT count(*) OVER (ORDER BY n_int RANGE BETWEEN 10 PRECEDING AND 5
FOLLOWING) FROM tbl;

In the standard, the types allowed in RANGE are only int, float, date,
timestamp, etc. but we have various extensible data types as you know
and we couldn't assume '+' / '-' operators tell add /subtract
operation absolutely.

> Also, partitioning might have some use for range types to represent
> range partitions. Comments are welcome.

I heard about partitioning which may have the same problem in RANGE
clause from Itagaki-san, but have not looked so much yet.

Regards,

--
Hitoshi Harada


Re: WIP: Range Types

From
Hitoshi Harada
Date:
2011/1/5 Jeff Davis <pgsql@j-davis.com>:
> On Tue, 2011-01-04 at 23:04 +0900, Hitoshi Harada wrote:
>> >  CREATE TYPE numrange
>> >    AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);
>>
>> I am interested in how you define increment/decrement operation of
>> range value in discrete types. The window functions and PARTITION also
>> want to represent RANGE but there's no clear solution.
>
> The user would specify a "canonical" function like:
>
>   CREATE TYPE int4range AS RANGE (SUBTYPE=int4, SUBTYPE_CMP=btint4cmp,
>     CANONICAL=my_int4range_canonical);
>
> That function would be called when constructing ranges on input or after
> a computation, and could change something like (1,4] into [2,4] if you
> prefer the latter form.
>
> So the range types would not have increments, decrements, granules, or
> knowledge about the "difference" type (e.g. "interval" is the difference
> type for timestamp).

To canonicalize, it might be ok. I wonder if you won't operate on the
range types like extending their bounds or iterate/enum values from
start to end. In such situation, I bet you'll need to know how to walk
values step by step.

> What support do you need/want from range types to help with new window
> function features?
>
My argument is here:
http://archives.postgresql.org/message-id/AANLkTimFmQmbzJ5CTXvE_PwT_zmCuHPoet3gaQq6Pvo8@mail.gmail.com

For any type to calculate boundary based on RANGE clause in window
functions, we need some type interface mechanism in the core to know
how to add / subtract values to reach the boundary from the current
value. For example,

SELECT count(*) OVER (ORDER BY n_int RANGE BETWEEN 10 PRECEDING AND 5
FOLLOWING) FROM tbl;

In the standard, the types allowed in RANGE are only int, float, date,
timestamp, etc. but we have various extensible data types as you know
and we couldn't assume '+' / '-' operators tell add /subtract
operation absolutely.

> Also, partitioning might have some use for range types to represent
> range partitions. Comments are welcome.

I heard about partitioning which may have the same problem in RANGE
clause from Itagaki-san, but have not looked so much yet.

Regards,

--
Hitoshi Harada


Re: WIP: Range Types

From
David Fetter
Date:
On Thu, Jan 06, 2011 at 02:25:01AM +0900, Hitoshi Harada wrote:
> 2011/1/5 Jeff Davis <pgsql@j-davis.com>:
> > On Tue, 2011-01-04 at 23:04 +0900, Hitoshi Harada wrote:
> >> >  CREATE TYPE numrange
> >> >    AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);
> >>
> >> I am interested in how you define increment/decrement operation
> >> of range value in discrete types. The window functions and
> >> PARTITION also want to represent RANGE but there's no clear
> >> solution.
> >
> > The user would specify a "canonical" function like:
> >
> >   CREATE TYPE int4range AS RANGE (SUBTYPE=int4, SUBTYPE_CMP=btint4cmp,
> >     CANONICAL=my_int4range_canonical);
> >
> > That function would be called when constructing ranges on input or after
> > a computation, and could change something like (1,4] into [2,4] if you
> > prefer the latter form.
> >
> > So the range types would not have increments, decrements, granules, or
> > knowledge about the "difference" type (e.g. "interval" is the difference
> > type for timestamp).
> 
> To canonicalize, it might be ok.  I wonder if you won't operate on
> the range types like extending their bounds or iterate/enum values
> from start to end.  In such situation, I bet you'll need to know how
> to walk values step by step.
> 
> > What support do you need/want from range types to help with new window
> > function features?
> >
> My argument is here:
> http://archives.postgresql.org/message-id/AANLkTimFmQmbzJ5CTXvE_PwT_zmCuHPoet3gaQq6Pvo8@mail.gmail.com
> 
> For any type to calculate boundary based on RANGE clause in window
> functions, we need some type interface mechanism in the core to know
> how to add / subtract values to reach the boundary from the current
> value.  For example,
> 
> SELECT count(*) OVER (ORDER BY n_int RANGE BETWEEN 10 PRECEDING AND 5
> FOLLOWING) FROM tbl;

I'm not sure I get the connection between this type of range and the
"range types" Jeff is working on.  Jeff's work involves a way to
create types which represent ranges over types which have some kind of
ordering, although not necessarily a successor operation.

Had you planned to cast to an integer range in the process of doing
this window?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: WIP: Range Types

From
Jeff Davis
Date:
On Wed, 2011-01-05 at 10:41 -0800, David Fetter wrote:
> On Thu, Jan 06, 2011 at 02:25:01AM +0900, Hitoshi Harada wrote:
> > For any type to calculate boundary based on RANGE clause in window
> > functions, we need some type interface mechanism in the core to know
> > how to add / subtract values to reach the boundary from the current
> > value.  For example,
> > 
> > SELECT count(*) OVER (ORDER BY n_int RANGE BETWEEN 10 PRECEDING AND 5
> > FOLLOWING) FROM tbl;
> 
> I'm not sure I get the connection between this type of range and the
> "range types" Jeff is working on.  Jeff's work involves a way to
> create types which represent ranges over types which have some kind of
> ordering, although not necessarily a successor operation.
> 
> Had you planned to cast to an integer range in the process of doing
> this window?

I don't think Harada-san needs the type infrastructure itself, just the
interface to find the "difference type" (e.g. "interval" from
"timestamp") as well as functions like add and subtract (potentially two
interesting subtract functions). Without knowing which function to call,
there is no way to find the window boundaries given the current row.

The current design for range types doesn't ask for add or subtract.
Although it might be interesting to try to use such an interface for
range types, it introduces a lot of complexity and makes it easier to
cause subtle problems (consider that addition of timestamps and
intervals is not commutative).

Even if add and subtract were associated with a range type, there's no
way to tell which range type to pick given the window function syntax
(multiple range types could be defined over the same subtype).

I think the interface question should be addressed more directly with a
"type interfaces" patch.

Regards,Jeff Davis





Re: WIP: Range Types

From
Hitoshi Harada
Date:
2011/1/6 Jeff Davis <pgsql@j-davis.com>:
> Even if add and subtract were associated with a range type, there's no
> way to tell which range type to pick given the window function syntax
> (multiple range types could be defined over the same subtype).
>
> I think the interface question should be addressed more directly with a
> "type interfaces" patch.

I agree the current canonical approach fits range type's demand, and
I'm inclined that the type interface discussion is another point.
IIRC, Robert Haas originally began to propose the idea of type
interface to get together three of KNN-GIST, range type and window
frame issue. For KNN-GIST, it was committed by extending pg_amop
without considering others and range type will be as well. Not getting
them together might be the answer.

Regards,

-- 
Hitoshi Harada


Re: WIP: Range Types

From
Jeff Davis
Date:
On Thu, 2011-01-06 at 09:30 +0900, Hitoshi Harada wrote:
> Robert Haas originally began to propose the idea of type
> interface to get together three of KNN-GIST, range type and window
> frame issue. For KNN-GIST, it was committed by extending pg_amop
> without considering others and range type will be as well. Not getting
> them together might be the answer.

We may end up combining all of these concepts into type interfaces
later. Now that we have multiple potential users of type interfaces, it
will be easier to design type interfaces to work well for all of them.

Regards,Jeff Davis



Re: WIP: Range Types

From
Jeff Davis
Date:
On Wed, 2011-01-05 at 12:07 -0800, Jeff Davis wrote:
> The current design for range types doesn't ask for add or subtract.
> Although it might be interesting to try to use such an interface for
> range types, it introduces a lot of complexity and makes it easier to
> cause subtle problems (consider that addition of timestamps and
> intervals is not commutative).

A consequence of this design is that some generic range functions, like
"length" or "distance" would need to rely on the polymorphism of "+" and
"-" to work.

I'm also not sure if a constructor like "range(start, offset) returns
anyrange" could be made to work generically at all, because the start
and offset may be two different types (and a function that takes
ANYELEMENT requires that all ANYELEMENT arguments are the same type).

Does anyone see a problem with that?

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Thu, Jan 6, 2011 at 12:32 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2011-01-05 at 12:07 -0800, Jeff Davis wrote:
>> The current design for range types doesn't ask for add or subtract.
>> Although it might be interesting to try to use such an interface for
>> range types, it introduces a lot of complexity and makes it easier to
>> cause subtle problems (consider that addition of timestamps and
>> intervals is not commutative).
>
> A consequence of this design is that some generic range functions, like
> "length" or "distance" would need to rely on the polymorphism of "+" and
> "-" to work.
>
> I'm also not sure if a constructor like "range(start, offset) returns
> anyrange" could be made to work generically at all, because the start
> and offset may be two different types (and a function that takes
> ANYELEMENT requires that all ANYELEMENT arguments are the same type).
>
> Does anyone see a problem with that?

Seems like you could make people who want that write range(start,
start+offset) instead without too much pain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Florian Weimer
Date:
* Jeff Davis:

> On Tue, 2011-01-04 at 14:18 +0000, Florian Weimer wrote:
>> * Jeff Davis:
>>
>> > 4. For the GiST penalty function, and perhaps some picksplit algorithms,
>> > it might be nice to know the length of a range, or do some other kinds
>> > of math. It introduces a lot of complexity to try to define math
>> > functions for each subtype, and try to make sure they behave sanely. So
>> > I was thinking that the user might need to specify a function that
>> > converts the subtype into a float that approximates a value's position
>> > in the total order.
>>
>> Doesn't the eqsel hint already provide this information?

> Can you clarify what you mean? I don't know what the "eqsel hint" is.

Uhm, it's not not eqsel, but the RESTRICT clause on operators:

<http://www.postgresql.org/docs/8.4/static/xoper-optimization.html>

I'm wondering if one of these hint functions can be reused to compute
range lengths.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: WIP: Range Types

From
Jeff Davis
Date:
On Fri, 2011-01-07 at 08:21 +0000, Florian Weimer wrote:
> <http://www.postgresql.org/docs/8.4/static/xoper-optimization.html>
> 
> I'm wondering if one of these hint functions can be reused to compute
> range lengths.

Interesting idea.

However, I don't really see a way to make that work. These functions are
tied to the operator, so it would be awkward to try to connect it to the
GiST support functions. Also, it doesn't seem to be an exact fit,
because the RESTRICT function is used to compute the selectivity as of
right now using current statistics.

Regards,Jeff Davis



Re: WIP: Range Types

From
Jeff Davis
Date:
When writing the generic range output function, it needs to know the
specific range type in order to call the subtype's output function.

Records accomplish this by using a special cache based on the typmod,
apparently, which looks like a hack to me.

Arrays accomplish this by storing the specific type in every array
value. That seems very wasteful in the case of range types (which only
hold two values).

I thought I could get away with using get_fn_expr_argtype() for most of
the generic functions, but apparently that can't always provide an
answer.

Any ideas? Maybe, with alignment and a "flags" byte (to hold
inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
much, anyway?

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Sat, Jan 8, 2011 at 3:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Any ideas? Maybe, with alignment and a "flags" byte (to hold
> inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
> much, anyway?

I'd be really reluctant to bloat the range representation by 4 bytes
to support an anyrange type.  Better to defer this until the great day
when we get a better typmod system, at least IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Jeff Davis
Date:
On Sat, 2011-01-08 at 15:47 -0500, Robert Haas wrote:
> On Sat, Jan 8, 2011 at 3:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > Any ideas? Maybe, with alignment and a "flags" byte (to hold
> > inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
> > much, anyway?
> 
> I'd be really reluctant to bloat the range representation by 4 bytes
> to support an anyrange type.  Better to defer this until the great day
> when we get a better typmod system, at least IMHO.

Can you elaborate? How can we have generic functions without ANYRANGE?

And without generic functions, how do we make it easy for users to
specify a new range type?

Regards,Jeff Davis



Re: WIP: Range Types

From
Jeff Davis
Date:
On Sat, 2011-01-08 at 13:05 -0800, Jeff Davis wrote:
> On Sat, 2011-01-08 at 15:47 -0500, Robert Haas wrote:
> > On Sat, Jan 8, 2011 at 3:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > > Any ideas? Maybe, with alignment and a "flags" byte (to hold
> > > inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
> > > much, anyway?
> > 
> > I'd be really reluctant to bloat the range representation by 4 bytes
> > to support an anyrange type.  Better to defer this until the great day
> > when we get a better typmod system, at least IMHO.
> 
> Can you elaborate? How can we have generic functions without ANYRANGE?
> 
> And without generic functions, how do we make it easy for users to
> specify a new range type?

Another thought:

If we use timestamps, then that's 8 bytes each, meaning 16 bytes. Then,
there is the VARHDRSZ (now we're at 20), the flag byte (21), and the
range type oid (25). With alignment (if it's aligned at all), that's
either 28 or 32 bytes, which is starting to seem ridiculous.

Making it always varlena is kind of nice, because then if the upper or
lower bound is special (NULL or infinity), then we can omit it and save
some space. But I'm starting to think that it's not worth it, and we
should detect whether the subtype is fixed, and if so, make the range
type fixed length. That will save on the varlena header.

Any suggestions on how to represent/align these ranges?

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Sat, Jan 8, 2011 at 4:05 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2011-01-08 at 15:47 -0500, Robert Haas wrote:
>> On Sat, Jan 8, 2011 at 3:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > Any ideas? Maybe, with alignment and a "flags" byte (to hold
>> > inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
>> > much, anyway?
>>
>> I'd be really reluctant to bloat the range representation by 4 bytes
>> to support an anyrange type.  Better to defer this until the great day
>> when we get a better typmod system, at least IMHO.
>
> Can you elaborate? How can we have generic functions without ANYRANGE?
>
> And without generic functions, how do we make it easy for users to
> specify a new range type?

Oh, hmm.  What generic functions did you have in mind?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Robert Haas
Date:
On Sat, Jan 8, 2011 at 6:06 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> If we use timestamps, then that's 8 bytes each, meaning 16 bytes. Then,
> there is the VARHDRSZ (now we're at 20), the flag byte (21), and the
> range type oid (25). With alignment (if it's aligned at all), that's
> either 28 or 32 bytes, which is starting to seem ridiculous.

It'll use the 1-byte varlena header format, which is unaligned.  So
you'll end up with 8 + 8 + 2 bytes = 18 bytes, unaligned.  Maybe you
could cram that down to 17 bytes unaligned with sufficient work, but
I'm not sure it's worth the complexity.  If you end up having to
include the type OID though that's pretty horrible; it adds another 4
bytes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Jeff Davis
Date:
On Sat, 2011-01-08 at 20:32 -0500, Robert Haas wrote:
> On Sat, Jan 8, 2011 at 4:05 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > On Sat, 2011-01-08 at 15:47 -0500, Robert Haas wrote:
> >> On Sat, Jan 8, 2011 at 3:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> >> > Any ideas? Maybe, with alignment and a "flags" byte (to hold
> >> > inclusivity, infinite boundaries, etc.), the extra 4 bytes doesn't cost
> >> > much, anyway?
> >>
> >> I'd be really reluctant to bloat the range representation by 4 bytes
> >> to support an anyrange type.  Better to defer this until the great day
> >> when we get a better typmod system, at least IMHO.
> >
> > Can you elaborate? How can we have generic functions without ANYRANGE?
> >
> > And without generic functions, how do we make it easy for users to
> > specify a new range type?
> 
> Oh, hmm.  What generic functions did you have in mind?

Well, input/output, comparisons, overlaps, intersection, minus, and all
the necessary GiST support functions.

Without generic functions, the only choices we have are:* force the user to write and specify them all -- which doesn't
leave
much left of my feature (I think the interface would be all that's
left).* somehow generate the functions at type creation time

Any other ideas?

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Sat, Jan 8, 2011 at 9:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> Oh, hmm.  What generic functions did you have in mind?
>
> Well, input/output, comparisons, overlaps, intersection, minus, and all
> the necessary GiST support functions.
>
> Without generic functions, the only choices we have are:
>  * force the user to write and specify them all -- which doesn't leave
> much left of my feature (I think the interface would be all that's
> left).
>  * somehow generate the functions at type creation time
>
> Any other ideas?

Do they have to be good ideas?

I mean, one semi-obvious possibility is to write one set of C
functions that can have multiple SQL-level definitions bound to it.
Then when the function is called, it can peek at flinfo->fn_oid to
figure out which incarnation was called and then get the typo info
from there.  That's ugly, though.

It'd be really nice if we could just arrange for the info on which
type anyrange actually is at the moment to be available in the right
place.  Storing it on disk to work around that is pretty horrible, but
maybe there's no other reasonable option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Jeff Davis
Date:
On Sat, 2011-01-08 at 21:58 -0500, Robert Haas wrote:
> I mean, one semi-obvious possibility is to write one set of C
> functions that can have multiple SQL-level definitions bound to it.
> Then when the function is called, it can peek at flinfo->fn_oid to
> figure out which incarnation was called and then get the typo info
> from there.  That's ugly, though.

That would work, but it is pretty ugly. It means it would be very
difficult for users to write new generic functions, because they would
need to add a new catalog entry for every existing range type.

Then again, wasting 4 bytes per row is not ideal, either. And maybe
users could still write some useful generic functions if they were a
combination of other generic functions, using the polymorphic system.

> It'd be really nice if we could just arrange for the info on which
> type anyrange actually is at the moment to be available in the right
> place.  Storing it on disk to work around that is pretty horrible, but
> maybe there's no other reasonable option.

I was surprised when I saw the solution for records. Maybe we should
consider something like that as a last resort (if it's possible for
non-record types)? I'd rather give up typmod than anyrange.

It also might be worth figuring out why input functions get the type oid
and output functions do not. I see this comment above getTypeIOParam():
* As of PostgreSQL 8.1, output functions receive only the value
itself                                                                    * and not any auxiliary parameters, so the
nameof this routine is
 
now                                                                    * a bit of a misnomer ... it should be
getTypeInputParam.                                                                                 

So, why was it eliminated? If the type output function got the type OID,
would it be enough to use fn_expr_get_argtype() for the other generic
functions?

Regards,Jeff Davis



Re: WIP: Range Types

From
Robert Haas
Date:
On Sun, Jan 9, 2011 at 4:03 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> It also might be worth figuring out why input functions get the type oid
> and output functions do not. I see this comment above getTypeIOParam():
>
>  * As of PostgreSQL 8.1, output functions receive only the value
> itself
>  * and not any auxiliary parameters, so the name of this routine is
> now
>  * a bit of a misnomer ... it should be
> getTypeInputParam.
>
> So, why was it eliminated?

Good question.  The relevant commit is here:

commit 6c412f0605afeb809014553ff7ad28cf9ed5526b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sun May 1 18:56:19 2005 +0000
   Change CREATE TYPE to require datatype output and send functions to have   only one argument.  (Per recent
discussion,the option to accept multiple   arguments is pretty useless for user-defined types, and would be a likely
sourceof security holes if it was used.)  Simplify call sites of   output/send functions to not bother passing more
thanone argument. 

...but I don't understand the motivation behind it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Martijn van Oosterhout
Date:
On Tue, Jan 11, 2011 at 09:24:08PM -0500, Robert Haas wrote:
> commit 6c412f0605afeb809014553ff7ad28cf9ed5526b
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sun May 1 18:56:19 2005 +0000
>
>     Change CREATE TYPE to require datatype output and send functions to have
>     only one argument.  (Per recent discussion, the option to accept multiple
>     arguments is pretty useless for user-defined types, and would be a likely
>     source of security holes if it was used.)  Simplify call sites of
>     output/send functions to not bother passing more than one argument.
>
> ...but I don't understand the motivation behind it.

IIRC, the issue is that a type output function has to interpret a
Datum. Type output functions can also be called by users, so it is not
guarenteed that the given OID would actually match the type of the
Datum given. Hence the decision that the output function must be able
to determine itself what kind of Datum it is dealing with.

Thought experiment: the datum is an integer, but the oid says it's a
pass-by-ref datum. Now the code may now to use the integer to derefence
an arbitrary place in memory.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: WIP: Range Types

From
Robert Haas
Date:
On Wed, Jan 12, 2011 at 2:16 AM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Tue, Jan 11, 2011 at 09:24:08PM -0500, Robert Haas wrote:
>> commit 6c412f0605afeb809014553ff7ad28cf9ed5526b
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   Sun May 1 18:56:19 2005 +0000
>>
>>     Change CREATE TYPE to require datatype output and send functions to have
>>     only one argument.  (Per recent discussion, the option to accept multiple
>>     arguments is pretty useless for user-defined types, and would be a likely
>>     source of security holes if it was used.)  Simplify call sites of
>>     output/send functions to not bother passing more than one argument.
>>
>> ...but I don't understand the motivation behind it.
>
> IIRC, the issue is that a type output function has to interpret a
> Datum. Type output functions can also be called by users, so it is not
> guarenteed that the given OID would actually match the type of the
> Datum given. Hence the decision that the output function must be able
> to determine itself what kind of Datum it is dealing with.
>
> Thought experiment: the datum is an integer, but the oid says it's a
> pass-by-ref datum. Now the code may now to use the integer to derefence
> an arbitrary place in memory.

I guess that begs the question of why we need to allow users to call
type output functions directly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
"David E. Wheeler"
Date:
On Jan 12, 2011, at 8:48 AM, Robert Haas wrote:

> I guess that begs the question of why we need to allow users to call
> type output functions directly.

I've used them quite a lot in the past; less so on 8.4+ where casting everything to text became a lot easier.

Best,

David



Re: WIP: Range Types

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié ene 12 13:48:27 -0300 2011:

> I guess that begs the question of why we need to allow users to call
> type output functions directly.

It used to be the case that that was the only way to run certain casts.
For example, see the pre-8.2 version of this:
http://wiki.postgresql.org/wiki/Pg_depend_display

I haven't needed to use that in a long time, but I am not sure if the
need has completely disappeared.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: WIP: Range Types

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Robert Haas's message of mié ene 12 13:48:27 -0300 2011:
>> I guess that begs the question of why we need to allow users to call
>> type output functions directly.

> It used to be the case that that was the only way to run certain casts.
> For example, see the pre-8.2 version of this:
> http://wiki.postgresql.org/wiki/Pg_depend_display

> I haven't needed to use that in a long time, but I am not sure if the
> need has completely disappeared.

The general point is that any out-of-band data transmitted to an output
function has to be trustworthy, and it has to be available at any place
that is going to call the output function.  The latter point tends to
put a crimp in any ideas of this sort anyway: if you can derive the info
you want at any arbitrary place in the system, why not derive it inside
the output function to start with?
        regards, tom lane


Re: WIP: Range Types

From
Robert Haas
Date:
On Wed, Jan 12, 2011 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Excerpts from Robert Haas's message of mié ene 12 13:48:27 -0300 2011:
>>> I guess that begs the question of why we need to allow users to call
>>> type output functions directly.
>
>> It used to be the case that that was the only way to run certain casts.
>> For example, see the pre-8.2 version of this:
>> http://wiki.postgresql.org/wiki/Pg_depend_display
>
>> I haven't needed to use that in a long time, but I am not sure if the
>> need has completely disappeared.
>
> The general point is that any out-of-band data transmitted to an output
> function has to be trustworthy, and it has to be available at any place
> that is going to call the output function.  The latter point tends to
> put a crimp in any ideas of this sort anyway: if you can derive the info
> you want at any arbitrary place in the system, why not derive it inside
> the output function to start with?

Under what circumstances would it be necessary to call a type output
function without knowing the data type?  I mean, you had to decide
which type output function you were going to call in the first place,
so...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Range Types

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 12, 2011 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The general point is that any out-of-band data transmitted to an output
>> function has to be trustworthy, and it has to be available at any place
>> that is going to call the output function. �The latter point tends to
>> put a crimp in any ideas of this sort anyway: if you can derive the info
>> you want at any arbitrary place in the system, why not derive it inside
>> the output function to start with?

> Under what circumstances would it be necessary to call a type output
> function without knowing the data type?  I mean, you had to decide
> which type output function you were going to call in the first place,
> so...

If the out-of-band info is going to be limited to the type OID, you
might as well put it in the object, ie, follow the same solution we're
already using for arrays.  Jeff's problems are already plenty large
enough without insisting that he invent a new, precedent-free solution
for this problem (*and* break every single output-function call site in
both core and third-party modules in order to do so...)
        regards, tom lane


Re: WIP: Range Types

From
Robert Haas
Date:
On Wed, Jan 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jan 12, 2011 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The general point is that any out-of-band data transmitted to an output
>>> function has to be trustworthy, and it has to be available at any place
>>> that is going to call the output function.  The latter point tends to
>>> put a crimp in any ideas of this sort anyway: if you can derive the info
>>> you want at any arbitrary place in the system, why not derive it inside
>>> the output function to start with?
>
>> Under what circumstances would it be necessary to call a type output
>> function without knowing the data type?  I mean, you had to decide
>> which type output function you were going to call in the first place,
>> so...
>
> If the out-of-band info is going to be limited to the type OID, you
> might as well put it in the object, ie, follow the same solution we're
> already using for arrays.  Jeff's problems are already plenty large
> enough without insisting that he invent a new, precedent-free solution
> for this problem (*and* break every single output-function call site in
> both core and third-party modules in order to do so...)

In terms of solving the immediate problem, you're probably correct.
But this isn't the first time this issue has come up, and it probably
won't be the last.  It's pretty lame to waste 4+ bytes on disk for
every value of a given type due to a parameter-passing convention.
And I suppose it's worth pointing out that if Jeff does adopt that
solution, we'll be stuck with it for approximately forever due to
pg_upgrade.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company