Thread: Interval->day proposal

Interval->day proposal

From
Michael Glaesemann
Date:
There has been discussion in the past of including number of days as  
part of the Interval structure in addition to the current months and  
time components. Here are some mailing list threads where the topic  
has arisen.

[Re: [GENERAL] '1 year' = '360 days' ????](http:// 
archives.postgresql.org/pgsql-general/2004-10/msg01104.php)
[Re: [HACKERS] timestamp with time zone a la sql99](http:// 
archives.postgresql.org/pgsql-hackers/2004-10/msg00843.php)

One advantage of this is that it would allow '1 day' to have a  
different meaning that '24 hours', which would be meaningful when  
crossing daylight saving time changes. For example, PostgreSQL  
returns the following results:

test=# set time zone 'CST7CDT';
SET
test=# select '2005-04-01 12:00-07'::timestamptz + '1 day'::interval;        ?column?
------------------------
2005-04-02 12:00:00-07
(1 row)

test=# select '2005-04-02 12:00-07'::timestamptz + '1 day'::interval;        ?column?
------------------------
2005-04-03 13:00:00-06
(1 row)

A daylight saving time change occurred at 2005-04-03 02:00.  
Internally, the '1 day' interval is converted to 24 hours, which is  
then added to '2005-04-02 12:00-07', returning '2005-04-03 13:00-06'.  
Distinguishing between '1 day' and '24 hours' would allow '2005-04-02  
12:00-07'::timestamptz + '1 day'::interval to return '2005-04-03  
12:00-06', while '2005-04-02 12:00-07'::timestamptz + '1  
day'::interval would return '2005-04-03 13:00-06' as it does now.

I'm interested in implementing at least part of this. In my mind  
there are two steps: the first is to modify the Interval data type  
implementation to include days as well as months and time; the second  
is to modify the functions used to add interval to timestamptz to  
take into account the difference between '1 day' and '24 hours' when  
working across daylight saving time changes. I'm thinking of it in  
two steps primarily because this is my first venture into the backend  
code and I'm breaking it into pieces that I hopefully deal with.  
Arguably, implementing only the first step isn't useful in and of  
itself, but I have to start somewhere, and that seems like the  
logical place to start.

I've been looking at the current interval implementation, and am  
beginning to form a plan for my first cut at the code. Before  
starting out, I'd appreciate any feedback on my understanding of the  
code or my plan.

In interval_in, (backend/utils/adt/timestamp.c) strings representing  
intervals are parsed using DecodeInterval() which is defined in  
backend/utils/adt/datetime.c. DecodeInterval() assigns values to the  
appropriate components of a pg_tm structure, which includes tm_year,  
tm_mon, tm_mday, tm_hour, tm_min, and tm_sec. This pg_tm data is  
passed to tm2interval, where the tm_year and tm_mon components are  
used to determine the interval->month value, and the tm_mday,  
tm_hour, tm_min, and tm_sec values are used to determine the interval- >time value.

When the string is read by DecodeInterval, the "days" component is  
assigned to tm_mday. It seems relatively straightforward to use this  
input to provide the interval->day value. However, I'm wondering what  
range of days this the interval->day component can be expected to  
handle. tm_mday is an int value, which is only guaranteed to be 2  
bytes (though it may be larger), if I understand correctly. This  
means that tm_mday isn't able to hold the full span of days, which is  
nearly 2.15 billion over the range of valid timestamps (from BC 4713  
to AD 5,874,897).

However, this range of days may not be necessary in practice. The  
(minimum) 2 bytes guaranteed by int is enough to hold +/- 44 years of  
days. Is it likely that DST issues are going to be important over  
anything larger? I'm not sure, and I welcome others' thoughts on  
this. If a wider range of days is deemed required, I'm not sure how  
this would be accomplished outside of changing the pg_tm datatype.

If interval->day can be stored in an int16, then the interval struct  
would be widened from 12 to 14 bytes. Are there concerns about  
widening the interval datatype? This is perhaps a naive approach, so  
if anyone has other ideas, I'd love to hear them.

Is my understanding correct? Any corrections or pointers appreciated.

Michael Glaesemann
grzm myrealbox com



Re: Interval->day proposal

From
Tom Lane
Date:
Michael Glaesemann <grzm@myrealbox.com> writes:
> When the string is read by DecodeInterval, the "days" component is  
> assigned to tm_mday. It seems relatively straightforward to use this  
> input to provide the interval->day value. However, I'm wondering what  
> range of days this the interval->day component can be expected to  
> handle. tm_mday is an int value, which is only guaranteed to be 2  
> bytes (though it may be larger), if I understand correctly.

Actually, practically all of the Postgres code assumes int is at least
32 bits.  Feel free to change pg_tm's field to be declared int32 instead
of just int if that bothers you, but it is really quite academic.

> If interval->day can be stored in an int16, then the interval struct  
> would be widened from 12 to 14 bytes. Are there concerns about  
> widening the interval datatype?

I'd make the on-disk field an int32, taking the struct to 16 bytes.
Given that it already requires double alignment for the embedded
double-or-int8 field, it's likely that it effectively takes 16 bytes
anyway after you count padding effects.  So I don't see any percentage
in trying to shave a couple bytes.
        regards, tom lane


Re: Interval->day proposal

From
Josh Berkus
Date:
Michael,

> One advantage of this is that it would allow '1 day' to have a
> different meaning that '24 hours', which would be meaningful when
> crossing daylight saving time changes. For example, PostgreSQL
> returns the following results:

I've been stumping for this for years.  See my arguments with Thomas Lockhart 
in 2000.   A "calendar day" is not the same as 24 hours, and DST behavior has 
forced me to use TIMESTAMP WITHOUT TIME ZONE on many a calendaring 
application.

Unfortunately, it appears that tri-partitioning INTERVAL ( year/month ; 
week/day ; hour/minute/second ) is a violation of the SQL spec which has only 
the two partitions ( year/month ; week/day/hour/minute/second ).   Have they 
changed this in SQL 2003?    If not, do we want to do it anyway, perhaps 
using a 2nd interval type? 

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Interval->day proposal

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Unfortunately, it appears that tri-partitioning INTERVAL ( year/month ; 
> week/day ; hour/minute/second ) is a violation of the SQL spec which has only
> the two partitions ( year/month ; week/day/hour/minute/second ).

I think it's an extension of the spec, not a violation.  In
particular, if you were working in a daylight-savings-less timezone,
you could not tell the difference (could you?)  The spec's worldview
essentially corresponds to daylight-savings-less all the time, and
so they are already a subset of what we do.
        regards, tom lane


Re: Interval->day proposal

From
Michael Glaesemann
Date:
On May 31, 2005, at 12:48 AM, Tom Lane wrote:



> Michael Glaesemann <grzm@myrealbox.com> writes:
>
>
>
>>  tm_mday is an int value, which is only guaranteed to be 2
>> bytes (though it may be larger), if I understand correctly.
>>
>>
>>
>
> Actually, practically all of the Postgres code assumes int is at least
> 32 bits.  Feel free to change pg_tm's field to be declared int32  
> instead
> of just int if that bothers you, but it is really quite academic.
>
>

Thanks for the clarification. My instinct would be to change so that  
it's no longer just an assumption. Is there any benefit to changing  
the other pg_tm int fields to int32? I imagine int is used quite a  
bit throughout the code, and I'd think assuming 32-bit ints would  
have bitten people in the past if it were invalid, so perhaps  
changing them is unnecessary.



> I'd make the on-disk field an int32, taking the struct to 16 bytes.
>
>

Will do.

Thanks for you comments.

Michael Glaesemann
grzm myrealbox com




Re: Interval->day proposal

From
Michael Glaesemann
Date:
On May 31, 2005, at 1:40 AM, Tom Lane wrote:


> Josh Berkus <josh@agliodbs.com> writes:
>
>
>> Unfortunately, it appears that tri-partitioning INTERVAL ( year/ 
>> month ;
>> week/day ; hour/minute/second ) is a violation of the SQL spec  
>> which has only
>> the two partitions ( year/month ; week/day/hour/minute/second ).
>>
>>
>
> I think it's an extension of the spec, not a violation.  In
> particular, if you were working in a daylight-savings-less timezone,
> you could not tell the difference (could you?)
>

I've started working on this change, and one difference has shown up  
immediately in the regression tests. v8.0.3 currently returns:
  SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years...";              9 years...
----------------------------------
!  9 years 1 mon -11 days -10:46:00  (1 row)

With my first round of changes,
  SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years...";              9 years...
----------------------------------
!  9 years 1 mon -12 days +13:14:00  (1 row)

These are equivalent in both CVS and my branch, as '1 day'::interval  
= '24 hours'::interval. I haven't checked the SQL spec yet (and  
intend to do so), but is there a canonical form for intervals that we  
need to return? I can imagine there might be, and if the spec  
considers only months and time, I could see where they might want  
days and time to have the same sign, and put results in "simplest  
form". Even if the spec doesn't require it, the behavior is  
definitely changed even outside of DST-aware code.

-- v8.0.3
test=# select '9 years 1 mon -11 days -10:46:00'::interval;             interval
----------------------------------
9 years 1 mon -11 days -10:46:00
(1 row)

test=# select '9 years 1 mon -12 days +13:14:00'::interval;             interval
----------------------------------
9 years 1 mon -11 days -10:46:00
(1 row)

test=# select '25 hours'::interval;    interval
----------------
1 day 01:00:00
(1 row)

-- new interval code
test=# select ' 9 years 1 mon -11 days -10:46:00'::interval;             interval
----------------------------------
9 years 1 mon -11 days -10:46:00
(1 row)

test=# select '9 years 1 mon -12 days +13:14:00'::interval;             interval
----------------------------------
9 years 1 mon -12 days +13:14:00
(1 row)

test=# select '25 hours'::interval;
interval
----------
25:00:00
(1 row)

I'll be digging into the spec later and post what I find. Thoughts?

Michael Glaesemann
grzm myrealbox com




Re: Interval->day proposal

From
Tom Lane
Date:
Michael Glaesemann <grzm@myrealbox.com> writes:
> On May 31, 2005, at 12:48 AM, Tom Lane wrote:
>> Actually, practically all of the Postgres code assumes int is at least
>> 32 bits.  Feel free to change pg_tm's field to be declared int32  
>> instead
>> of just int if that bothers you, but it is really quite academic.

> Thanks for the clarification. My instinct would be to change so that  
> it's no longer just an assumption. Is there any benefit to changing  
> the other pg_tm int fields to int32? I imagine int is used quite a  
> bit throughout the code, and I'd think assuming 32-bit ints would  
> have bitten people in the past if it were invalid, so perhaps  
> changing them is unnecessary.

As I understand it, the received wisdom of the C community is that
"int" means the machine's natural, most efficient word width.  The
C specification was written at a time when a fair percentage of hardware
thought that meant int16 (and I do remember programming such hardware).
But there are no longer any machines ... or at least none on which you'd
want to run Postgres ... for which int means int16; today I'd assume
that int means "probably int32, maybe int64 if that's really faster
on this machine".
        regards, tom lane


Re: Interval->day proposal

From
Michael Glaesemann
Date:
On Jun 1, 2005, at 1:42 PM, Michael Glaesemann wrote:


> -- v8.0.3
> test=# select '25 hours'::interval;
>     interval
> ----------------
> 1 day 01:00:00
> (1 row)
>
> -- new interval code
> test=# select '25 hours'::interval;
> interval
> ----------
> 25:00:00
> (1 row)
>
> I'll be digging into the spec later and post what I find. Thoughts?
>


I've dug a bit, and this is definitely not spec compliant, as
interval hours must be in the range 0-23. From Section 4.6.3 of SQL/
Foundation (2003 draft)

Table 6 — Valid values for fields in INTERVAL values
Keyword    Valid values of INTERVAL fields
YEAR     Unconstrained except by <interval leading field precision>
MONTH    Months (within years) (0-11)
DAY    Unconstrained except by <interval leading field precision>
HOUR    Hours (within days) (0-23)
MINUTE    Minutes (within hours) (0-59)
SECOND    Seconds (within minutes) (0-59.999...)

I'm not quite sure what the parenthetical "within days" adds to the
definition. However, the point of keeping a separate day field is so
we can keep the number of days separate from the number of 24-hour
units, and limiting the range of HOUR between 0 and 23 makes that
impossible, it appears.

I'd be happy if someone can let me know I'm reading the spec wrong!

Michael Glaesemann
grzm myrealbox com




Re: Interval->day proposal

From
Tom Lane
Date:
Michael Glaesemann <grzm@myrealbox.com> writes:
> On Jun 1, 2005, at 1:42 PM, Michael Glaesemann wrote:
>> -- v8.0.3
>> test=# select '25 hours'::interval;
>> interval
>> ----------------
>> 1 day 01:00:00
>> (1 row)
>> 
>> -- new interval code
>> test=# select '25 hours'::interval;
>> interval
>> ----------
>> 25:00:00
>> (1 row)
>> 
>> I'll be digging into the spec later and post what I find. Thoughts?

> I've dug a bit, and this is definitely not spec compliant, as  
> interval hours must be in the range 0-23.

Doesn't bother me.  The spec says what results you must get from
spec-compliant input; I don't think it says we may take only
spec-compliant input.  (If we were to read it that way, we'd have
to rip out every PG extension, not only the interval-related ones.)

The entire *point* of this change is to be able to distinguish
"25 hours" from "1 day 1 hour", so you can hardly argue that being
able to do that is not what we want it to do...
        regards, tom lane


Re: Interval->day proposal

From
Tom Lane
Date:
Michael Glaesemann <grzm@myrealbox.com> writes:
> I've started working on this change, and one difference has shown up  
> immediately in the regression tests. v8.0.3 currently returns:

>    SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years...";
>                9 years...
>    ----------------------------------
> !  9 years 1 mon -11 days -10:46:00
>    (1 row)

> With my first round of changes,

>    SELECT INTERVAL '10 years -11 month -12 days +13:14' AS "9 years...";
>                9 years...
>    ----------------------------------
> !  9 years 1 mon -12 days +13:14:00
>    (1 row)

Right, because it is now possible for day and hour to have different signs.
Comes with the territory I think.

> These are equivalent in both CVS and my branch, as '1 day'::interval  
> = '24 hours'::interval. I haven't checked the SQL spec yet (and  
> intend to do so), but is there a canonical form for intervals that we  
> need to return?

AFAICS, the spec only allows one sign for the entire interval value,
so this behavior is already outside the spec.  As long as we don't
generate different field signs in cases where that was not already
present in the input, I don't think it's a spec violation.
        regards, tom lane