Thread: Querying sporadic time series type data.
Hi all. I am warehousing my health data in PG. Some of this data comes from my phone/watch but some I enter in manually. Basically this is similar to a time series metric collection type of scenario. The table looks like this timestamp, measure, value, unit So for example the data could look like Insert into metrics (timestamp, measure, value, unit) values (now(), "blood glucose", 6.0, 'mmol/L') Some metrics have a lot of data such as steps and others such as blood pressure readings, weight etc are much rarer. I have two problems I am trying to solve. Question one is "what was the state of my health on XXXX date" This would be based on "last known data as of that date". The problem is that for some metrics the value will be a sum (step count for example) but for others it will be an average (blood glucose for example) for still others it might be min and/or max (blood pressure). Also it might be wise to report null for metrics that haven't been measured for a long time (weight for example). I tried to do this with CTEs but I am not sure how to write one without having to include every metric ahead of time. This means I would have to rewrite the query every time I add a metric. I get the feeling this might be possible with some kind of correlated subquery but I can't come up with one that works. The other issue is how to deal with metrics with dual values like blood pressure. It's easy enough to log systolic and diastolic as separate rows but how do I consolidate them?
On Fri, Oct 2, 2020 at 10:45 PM Tim Uckun <timuckun@gmail.com> wrote:
I am warehousing my health data in PG. Some of this data comes from my
phone/watch but some I enter in manually. Basically this is similar to
a time series metric collection type of scenario.
My first impression is that you should use a timeseries database for this, not PostgreSQL.
If you do try to use SQL to do this you indeed run into the problem of handling SQL's requirement that every column be explicitly defined. You can probably work around this somewhat by writing a query generator instead of hard-coding queries. Though for your current level of knowledge I'd suggest putting forth working queries for the current model and forget about how easy or hard dealing with change would be. Or at least be more convincing that change adaption is an important requirement.
Your data type issue is self-inflicted. Create a composite data type for blood pressure. Expanding upon that, you should have a table for each metric - or at least be willing to find middle-ground where some are grouped on one table (either as rows or columns) and some are separate tables.
None of this is impossible in PostgreSQL. Its a matter of development effort and performance. Put forth something that works - or at least kinda works and note where specifically in the code you are stuck - if you want non-conceptual help overcoming a specific problem.
David J.
Given the scale I am working at I thought a specific time scale database would be overkill but I'll look into it nevertheless. Even if I do write the queries with the known metrics I am still trying to figure out how to efficiently query "last known value of this metric on or before X time" without a correlated subquery which would be a massively inefficient query. On Sun, Oct 4, 2020 at 5:11 AM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Fri, Oct 2, 2020 at 10:45 PM Tim Uckun <timuckun@gmail.com> wrote: >> >> I am warehousing my health data in PG. Some of this data comes from my >> phone/watch but some I enter in manually. Basically this is similar to >> a time series metric collection type of scenario. > > > My first impression is that you should use a timeseries database for this, not PostgreSQL. > > If you do try to use SQL to do this you indeed run into the problem of handling SQL's requirement that every column beexplicitly defined. You can probably work around this somewhat by writing a query generator instead of hard-coding queries. Though for your current level of knowledge I'd suggest putting forth working queries for the current model and forgetabout how easy or hard dealing with change would be. Or at least be more convincing that change adaption is an importantrequirement. > > Your data type issue is self-inflicted. Create a composite data type for blood pressure. Expanding upon that, you shouldhave a table for each metric - or at least be willing to find middle-ground where some are grouped on one table (eitheras rows or columns) and some are separate tables. > > None of this is impossible in PostgreSQL. Its a matter of development effort and performance. Put forth something thatworks - or at least kinda works and note where specifically in the code you are stuck - if you want non-conceptual helpovercoming a specific problem. > > David J.
The convention here is to inline or bottom-post.
Given the scale I am working at I thought a specific time scale
database would be overkill but I'll look into it nevertheless. Even
if I do write the queries with the known metrics I am still trying to
figure out how to efficiently query "last known value of this metric
on or before X time" without a correlated subquery which would be a
massively inefficient query.
Yes, if the time is arbitrary you are stuck with a where clause, sort-and-limit. DISTINCT ON can be useful for returning multiple entities over the same metric.
Indexes, possibly partial (if you don’t normalize the model), should help.
Even massively inefficient can be efficient enough...
David J.
Couldn't he also use a daterange (or tsrange) in the where clause? Something like:
select... from table where [9,xdate::date]::daterange @:> datecolumn and metric = 'measure you want' order by date desc limit 1
Even with queries of this form, based on how you describe your data I might make two stored procedures. One where you pass in the date, the metric of interest and it returns the single value. Then the second procedure would take the same input and then does the averaging you want. Then you can call them all in a single select query:
select singlemetric('date'::date, bloodpressure), avgmetric('date'::date, insulin)...
Back to the design of your db, it sounds like you have a fixed set of metrics you are recording. If this is so I think making them distinct columns in your data set is preferable than the scheme you are using. But if you are going to be adding many new metrics frequently then your design makes sense to me.
On Sat, Oct 3, 2020 at 5:10 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
The convention here is to inline or bottom-post.On Saturday, October 3, 2020, Tim Uckun <timuckun@gmail.com> wrote:Given the scale I am working at I thought a specific time scale
database would be overkill but I'll look into it nevertheless. Even
if I do write the queries with the known metrics I am still trying to
figure out how to efficiently query "last known value of this metric
on or before X time" without a correlated subquery which would be a
massively inefficient query.Yes, if the time is arbitrary you are stuck with a where clause, sort-and-limit. DISTINCT ON can be useful for returning multiple entities over the same metric.Indexes, possibly partial (if you don’t normalize the model), should help.Even massively inefficient can be efficient enough...David J.
> select... from table where [9,xdate::date]::daterange @:> datecolumn and metric = 'measure you want' order by date desclimit 1 In the correlated subquery a simple select date, value from metrics where metric_id=x order by datetime desc limit 1 would get you the figure you want. but of course this has to be repeated for every metric you collect and if you wanted to do something fancy like plot some measure over a year based on monthly figures it's massively inefficient. > Back to the design of your db, it sounds like you have a fixed set of metrics you are recording. If this is so I thinkmaking them distinct columns in your data set is preferable than the scheme you are using. But if you are going to beadding many new metrics frequently then your design makes sense to me. The problem with this is that most of the rows will only have one column filled in. In fact probably all of them are. I am starting to think the best way to handle this is by creating some time slot tables with all the columns and populating with batch jobs. A daily table, weekly table, monthly table etc. Another option might be to go ahead and write that slow and io heavy query and save it as a materialized view.