Strategies for converting UTC data to local windows for arbitrary resolutions and timezones - Mailing list pgsql-general
From | Lincoln Swaine-Moore |
---|---|
Subject | Strategies for converting UTC data to local windows for arbitrary resolutions and timezones |
Date | |
Msg-id | CABcidkJzoCFTX1Bkn8aCSPgH-3BwR8MO8=uHY2r48s1ynNbbJQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones
|
List | pgsql-general |
Hi all,
I'm attempting to generate some reports using user-driven timezones for UTC data, and I'm having trouble writing a query that meets the following criteria:
1) Data should be averaged at one of daily, hourly, or 15 minute granularities (user-driven).
2) Data over a given period should reflect the localized value of the UTC data for a user-provided time zone. E.g.:
a) The 1 hour period starting at '2023-10-03 12:00:00.000000-0400' (America/New_York) should reflect data with timestamps between '2023-10-03 16:00:00.000000Z' and '2023-10-03 17:00:00.000000Z'.
b) The 1 day period starting at '2023-10-03 00:00:00.000000-0400' should reflect data with timestamps between '2023-10-03 04:00:00.000000Z' and '2023-10-04 04:00:00.000000Z'.
3) When a period interacts with a DST change in the given timezone, the data should not be clumped together. E.g.:
a) Data points occurring at 2023-11-05 05:30:00.000000Z and 2023-11-05 06:30:00.000000Z should be treated as falling into separate buckets when the time zone is America/New_York: (2023-11-05 01:30:00.000000-0400 and 2023-11-05 01:30:00.000000-0500, respectively). This should be true for either the 15 minute or 1 hour intervals.
b) Some clumping for day resolution seems ok! E.g. the 1 day period starting at '2023-11-05 00:00:00.000000-0400' can and probably should contain 25 hours' worth of data. Certainly it should not reflect the data falling between '2023-11-05 00:04:00.000000Z' and '2023-11-05 00:04:00.000000Z' + '24 hours'::interval (= '2023-11-06 00:04:00.000000Z'), because that would be the local times of '2023-11-05 00:00:00.000000-0400' and '2023-11-04 23:00:00.000000-0500'.
4) It would be relatively simple to do 15 minute and 1 hour periods were all timezone offsets multiples of 1 hour (in that case, all operations could be done in UTC and then converted after the fact), but unfortunately some time zones have 30 min-based offsets, which interferes with this approach.
5) Ideally, the solution would not involve messing with the server/connection's value of timezone. (Though I would be interested if there was a solution that relaxed this constraint and was relatively safe/compatible with transactions and psycopg2.)
6) Ideally, my query would return periods that are missing data (though could plausibly fill these in in the layer above). This points toward generate_series but...
7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone parameter for generate_series, which I believe might help. But tragically, Digitalocean doesn't yet support 16, and it's not practical to migrate elsewhere. Based on historical release -> support timing, I'd imagine they will not support it until Q2 2024, which is too late for this feature. If anyone had the inside scoop about when they'd likely support it, I'd welcome it!
This looks pretty hairy written out as above, but I actually think it reflects most people's intuitions about what data a local period "should" correspond to (though I'd welcome feedback on this point).
Here are some thoughts about approaches that I've tried, and what their drawbacks seem to be. For all these, I'll use the following CTE to demonstrate some data that crosses a DST boundary:
```
with original_data as (
select
('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x || ' minutes')::interval as "t"
from
generate_series(0, 1000) as x
)
```
1) date_trunc: it seems like as of v12, date_trunc accepts a third argument of timezone, which essentially plays the role of the server timezone setting for the scope of the function. This is very handy, and *I believe* solves my issues for the hour/day periods:
```
[etc]
select
date_trunc('day', t, 'America/New_York'),
min(t),
max(t),
count(*)
from original_data
group by 1
order by 1;
date_trunc | min | max | count
------------------------+------------------------+------------------------+-------
2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05 03:45:00+00 | 16
2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06 04:45:00+00 | 100
2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07 04:45:00+00 | 96
2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08 04:45:00+00 | 96
2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09 04:45:00+00 | 96
2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10 04:45:00+00 | 96
[etc]
```
This checks out, but unfortunately doesn't seem to work for 15 minutes. I think, by the way, that this behavior is identical to what I would've gotten if my server timezone was "America/New_York" and I ran it without the 3rd argument, though I'd be curious to hear if there are discrepancies.
On this point, I've read up on some of the history around this feature, and was a little puzzled by this assertion in this thread: https://www.postgresql.org/message-id/87in1k73nr.fsf@news-spur.riddles.org.uk:
> If you have a timestamp-in-UTC column and want to do a date_trunc in some other specified zone (that's not the session timezone), you need FOUR uses of AT TIME ZONE to do it correctly:
> date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu') AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'
I *think* that this doesn't comport with behavior I've seen, because at a DST boundary the initial `col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu'` would drop the offset, which from then on would be unrecoverable. For example:
```
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
```
This is what I meant above by "clumping" in point (3). I believe this issue is ameliorated by setting the database timezone:
```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-05
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-04
```
The same (correct, in my opinion) behavior seems to occur when using the third argument to date_trunc, so all is well on that front, but I'm being pedantic about this because if I was misunderstanding and usage of four `at time zone`s was a legitimate strategy, it could plausibly be applied to the below. Hoping to be incorrect here!
2) date_bin: this will take a "stride", which should accommodate all the periods I'm interested in, and "origin" to help deal with initial day/hour offsets. But I'm having trouble getting it to respect time zone/DST stuff. For example:
```
[etc]
select
date_bin('24 hours', t, '2023-11-03 00:00:00.000000-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;
date_bin | min | max | count
------------------------+------------------------+------------------------+-------
2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04 23:45:00-04 | 16
2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05 22:45:00-05 | 96
2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06 22:45:00-05 | 96
2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07 22:45:00-05 | 96
2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08 22:45:00-05 | 96
2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09 22:45:00-05 | 96
2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10 22:45:00-05 | 96
[etc]
```
As you can see, the days get "bumped" by DST, and are off by one (until the spring). I actually think this makes intuitive sense for this concept of "stride"--it just means that it's not as useful for timezones with DST.
3) generate_series: If I could get this working, it would be ideal, because it would also help fill in gaps in my data with null rows. But again, I can't seem to, except in v16, have it respect timezones other than the server's:
```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.000000-0400'::timestamptz,
'2023-11-07 00:00:00.000000-0500'::timestamptz,
'1 day'::interval
) days;
start_time | end_time
------------------------+------------------------
2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]
set timezone to 'America/New_York';
[same as above]
start_time | end_time
------------------------+------------------------
2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```
The latter being correct for these purposes, but not seeming super reliable/practical (see: point 5).
Can anyone think of any other approaches to this problem? I'd be thrilled if I could manage it without resorting to manually post-processing in pandas or something, because I have a number of datapoints, and each individual one is quite wide.
Apologies for the length of this message--just trying to be thorough. I sincerely appreciate any help or pointers!
Best,
Lincoln
--
I'm attempting to generate some reports using user-driven timezones for UTC data, and I'm having trouble writing a query that meets the following criteria:
1) Data should be averaged at one of daily, hourly, or 15 minute granularities (user-driven).
2) Data over a given period should reflect the localized value of the UTC data for a user-provided time zone. E.g.:
a) The 1 hour period starting at '2023-10-03 12:00:00.000000-0400' (America/New_York) should reflect data with timestamps between '2023-10-03 16:00:00.000000Z' and '2023-10-03 17:00:00.000000Z'.
b) The 1 day period starting at '2023-10-03 00:00:00.000000-0400' should reflect data with timestamps between '2023-10-03 04:00:00.000000Z' and '2023-10-04 04:00:00.000000Z'.
3) When a period interacts with a DST change in the given timezone, the data should not be clumped together. E.g.:
a) Data points occurring at 2023-11-05 05:30:00.000000Z and 2023-11-05 06:30:00.000000Z should be treated as falling into separate buckets when the time zone is America/New_York: (2023-11-05 01:30:00.000000-0400 and 2023-11-05 01:30:00.000000-0500, respectively). This should be true for either the 15 minute or 1 hour intervals.
b) Some clumping for day resolution seems ok! E.g. the 1 day period starting at '2023-11-05 00:00:00.000000-0400' can and probably should contain 25 hours' worth of data. Certainly it should not reflect the data falling between '2023-11-05 00:04:00.000000Z' and '2023-11-05 00:04:00.000000Z' + '24 hours'::interval (= '2023-11-06 00:04:00.000000Z'), because that would be the local times of '2023-11-05 00:00:00.000000-0400' and '2023-11-04 23:00:00.000000-0500'.
4) It would be relatively simple to do 15 minute and 1 hour periods were all timezone offsets multiples of 1 hour (in that case, all operations could be done in UTC and then converted after the fact), but unfortunately some time zones have 30 min-based offsets, which interferes with this approach.
5) Ideally, the solution would not involve messing with the server/connection's value of timezone. (Though I would be interested if there was a solution that relaxed this constraint and was relatively safe/compatible with transactions and psycopg2.)
6) Ideally, my query would return periods that are missing data (though could plausibly fill these in in the layer above). This points toward generate_series but...
7) Sigh: I can't upgrade to 16. I gather that 16 has a timezone parameter for generate_series, which I believe might help. But tragically, Digitalocean doesn't yet support 16, and it's not practical to migrate elsewhere. Based on historical release -> support timing, I'd imagine they will not support it until Q2 2024, which is too late for this feature. If anyone had the inside scoop about when they'd likely support it, I'd welcome it!
This looks pretty hairy written out as above, but I actually think it reflects most people's intuitions about what data a local period "should" correspond to (though I'd welcome feedback on this point).
Here are some thoughts about approaches that I've tried, and what their drawbacks seem to be. For all these, I'll use the following CTE to demonstrate some data that crosses a DST boundary:
```
with original_data as (
select
('2023-11-05 00:00:00.000000Z'::timestamptz) + (15 * x || ' minutes')::interval as "t"
from
generate_series(0, 1000) as x
)
```
1) date_trunc: it seems like as of v12, date_trunc accepts a third argument of timezone, which essentially plays the role of the server timezone setting for the scope of the function. This is very handy, and *I believe* solves my issues for the hour/day periods:
```
[etc]
select
date_trunc('day', t, 'America/New_York'),
min(t),
max(t),
count(*)
from original_data
group by 1
order by 1;
date_trunc | min | max | count
------------------------+------------------------+------------------------+-------
2023-11-04 04:00:00+00 | 2023-11-05 00:00:00+00 | 2023-11-05 03:45:00+00 | 16
2023-11-05 04:00:00+00 | 2023-11-05 04:00:00+00 | 2023-11-06 04:45:00+00 | 100
2023-11-06 05:00:00+00 | 2023-11-06 05:00:00+00 | 2023-11-07 04:45:00+00 | 96
2023-11-07 05:00:00+00 | 2023-11-07 05:00:00+00 | 2023-11-08 04:45:00+00 | 96
2023-11-08 05:00:00+00 | 2023-11-08 05:00:00+00 | 2023-11-09 04:45:00+00 | 96
2023-11-09 05:00:00+00 | 2023-11-09 05:00:00+00 | 2023-11-10 04:45:00+00 | 96
[etc]
```
This checks out, but unfortunately doesn't seem to work for 15 minutes. I think, by the way, that this behavior is identical to what I would've gotten if my server timezone was "America/New_York" and I ran it without the 3rd argument, though I'd be curious to hear if there are discrepancies.
On this point, I've read up on some of the history around this feature, and was a little puzzled by this assertion in this thread: https://www.postgresql.org/message-id/87in1k73nr.fsf@news-spur.riddles.org.uk:
> If you have a timestamp-in-UTC column and want to do a date_trunc in some other specified zone (that's not the session timezone), you need FOUR uses of AT TIME ZONE to do it correctly:
> date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu') AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'
I *think* that this doesn't comport with behavior I've seen, because at a DST boundary the initial `col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu'` would drop the offset, which from then on would be unrecoverable. For example:
```
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC';
timezone
---------------------
2023-11-05 06:00:00
```
This is what I meant above by "clumping" in point (3). I believe this issue is ameliorated by setting the database timezone:
```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.000000'::timestamp AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-05
select date_trunc('hour', '2023-11-05 05:30:00.000000'::timestamp AT TIME ZONE 'UTC');
date_trunc
------------------------
2023-11-05 01:00:00-04
```
The same (correct, in my opinion) behavior seems to occur when using the third argument to date_trunc, so all is well on that front, but I'm being pedantic about this because if I was misunderstanding and usage of four `at time zone`s was a legitimate strategy, it could plausibly be applied to the below. Hoping to be incorrect here!
2) date_bin: this will take a "stride", which should accommodate all the periods I'm interested in, and "origin" to help deal with initial day/hour offsets. But I'm having trouble getting it to respect time zone/DST stuff. For example:
```
[etc]
select
date_bin('24 hours', t, '2023-11-03 00:00:00.000000-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;
date_bin | min | max | count
------------------------+------------------------+------------------------+-------
2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04 23:45:00-04 | 16
2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05 22:45:00-05 | 96
2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06 22:45:00-05 | 96
2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07 22:45:00-05 | 96
2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08 22:45:00-05 | 96
2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09 22:45:00-05 | 96
2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10 22:45:00-05 | 96
[etc]
```
As you can see, the days get "bumped" by DST, and are off by one (until the spring). I actually think this makes intuitive sense for this concept of "stride"--it just means that it's not as useful for timezones with DST.
3) generate_series: If I could get this working, it would be ideal, because it would also help fill in gaps in my data with null rows. But again, I can't seem to, except in v16, have it respect timezones other than the server's:
```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.000000-0400'::timestamptz,
'2023-11-07 00:00:00.000000-0500'::timestamptz,
'1 day'::interval
) days;
start_time | end_time
------------------------+------------------------
2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]
set timezone to 'America/New_York';
[same as above]
start_time | end_time
------------------------+------------------------
2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```
The latter being correct for these purposes, but not seeming super reliable/practical (see: point 5).
Can anyone think of any other approaches to this problem? I'd be thrilled if I could manage it without resorting to manually post-processing in pandas or something, because I have a number of datapoints, and each individual one is quite wide.
Apologies for the length of this message--just trying to be thorough. I sincerely appreciate any help or pointers!
Best,
Lincoln
Lincoln Swaine-Moore
pgsql-general by date: