Thread: Querying sporadic time series type data.

Querying sporadic time series type data.

From
Tim Uckun
Date:
Hi all.

I am warehousing my health data in PG. Some of this data comes from my
phone/watch but some I enter in manually. Basically this is similar to
a time series metric collection type of scenario.  The table looks
like this

timestamp, measure, value, unit

So for example the data could look like

Insert into metrics (timestamp, measure, value, unit) values (now(),
"blood glucose", 6.0, 'mmol/L')

Some metrics have a lot of data such as steps and others such as blood
pressure readings, weight etc are much rarer.

I have two problems I am trying to solve.  Question one is

"what was the state of my health on XXXX date"  This would be based on
"last known data as of that date".  The problem is that for some
metrics the value will be a sum (step count for example) but for
others it will be an average (blood glucose for example) for still
others it might be min and/or max (blood pressure).  Also it might be
wise to report null for metrics that haven't been measured for a long
time (weight for example).

I tried to do this with CTEs but I am not sure how to write one
without having to include every metric ahead of time.  This means I
would have to rewrite the query every time I add a metric.

I get the feeling this might be possible with some kind of correlated
subquery but I can't come up with one that works.

The other issue is how to deal with metrics with dual values like
blood pressure. It's easy enough to log systolic and diastolic as
separate rows but how do I consolidate them?



Re: Querying sporadic time series type data.

From
"David G. Johnston"
Date:
On Fri, Oct 2, 2020 at 10:45 PM Tim Uckun <timuckun@gmail.com> wrote:
I am warehousing my health data in PG. Some of this data comes from my
phone/watch but some I enter in manually. Basically this is similar to
a time series metric collection type of scenario.

My first impression is that you should use a timeseries database for this, not PostgreSQL.

If you do try to use SQL to do this you indeed run into the problem of handling SQL's requirement that every column be explicitly defined.  You can probably work around this somewhat by writing a query generator instead of hard-coding queries.  Though for your current level of knowledge I'd suggest putting forth working queries for the current model and forget about how easy or hard dealing with change would be.  Or at least be more convincing that change adaption is an important requirement.

Your data type issue is self-inflicted.  Create a composite data type for blood pressure.  Expanding upon that, you should have a table for each metric - or at least be willing to find middle-ground where some are grouped on one table (either as rows or columns) and some are separate tables.

None of this is impossible in PostgreSQL.  Its a matter of development effort and performance.  Put forth something that works - or at least kinda works and note where specifically in the code you are stuck - if you want non-conceptual help overcoming a specific problem.

David J.

Re: Querying sporadic time series type data.

From
Tim Uckun
Date:
Given the scale I am working at I thought a specific time scale
database would be overkill but I'll look into it nevertheless.  Even
if I do write the queries with the known metrics I am still trying to
figure out how to efficiently query "last known value of this metric
on or before X time" without a correlated subquery which would be a
massively inefficient query.


On Sun, Oct 4, 2020 at 5:11 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Oct 2, 2020 at 10:45 PM Tim Uckun <timuckun@gmail.com> wrote:
>>
>> I am warehousing my health data in PG. Some of this data comes from my
>> phone/watch but some I enter in manually. Basically this is similar to
>> a time series metric collection type of scenario.
>
>
> My first impression is that you should use a timeseries database for this, not PostgreSQL.
>
> If you do try to use SQL to do this you indeed run into the problem of handling SQL's requirement that every column
beexplicitly defined.  You can probably work around this somewhat by writing a query generator instead of hard-coding
queries. Though for your current level of knowledge I'd suggest putting forth working queries for the current model and
forgetabout how easy or hard dealing with change would be.  Or at least be more convincing that change adaption is an
importantrequirement. 
>
> Your data type issue is self-inflicted.  Create a composite data type for blood pressure.  Expanding upon that, you
shouldhave a table for each metric - or at least be willing to find middle-ground where some are grouped on one table
(eitheras rows or columns) and some are separate tables. 
>
> None of this is impossible in PostgreSQL.  Its a matter of development effort and performance.  Put forth something
thatworks - or at least kinda works and note where specifically in the code you are stuck - if you want non-conceptual
helpovercoming a specific problem. 
>
> David J.



Re: Querying sporadic time series type data.

From
"David G. Johnston"
Date:
The convention here is to inline or bottom-post.

On Saturday, October 3, 2020, Tim Uckun <timuckun@gmail.com> wrote:
Given the scale I am working at I thought a specific time scale
database would be overkill but I'll look into it nevertheless.  Even
if I do write the queries with the known metrics I am still trying to
figure out how to efficiently query "last known value of this metric
on or before X time" without a correlated subquery which would be a
massively inefficient query.

Yes, if the time is arbitrary you are stuck with a where clause, sort-and-limit.  DISTINCT ON can be useful for returning multiple entities over the same metric.

Indexes, possibly partial (if you don’t normalize the model), should help.

Even massively inefficient can be efficient enough...

David J.

Re: Querying sporadic time series type data.

From
Steven Pousty
Date:
Couldn't he also use a daterange (or tsrange) in the where clause? Something like:

select... from table where [9,xdate::date]::daterange @:> datecolumn and metric =  'measure you want' order by date desc limit 1

Even with queries of this form, based on how you describe your data I might make two stored procedures. One where you pass in the date, the metric of interest and it returns the single value. Then the second procedure would take the same input and then does the averaging you want. Then you can call them all in a single select query:

select singlemetric('date'::date, bloodpressure), avgmetric('date'::date, insulin)...

Back to the design of your db, it sounds like you have a fixed set of metrics you are recording. If this is so I think making them distinct columns in your data set is preferable than the scheme you are using. But if you are going to be adding many new metrics frequently then your design makes sense to me.

On Sat, Oct 3, 2020 at 5:10 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
The convention here is to inline or bottom-post.

On Saturday, October 3, 2020, Tim Uckun <timuckun@gmail.com> wrote:
Given the scale I am working at I thought a specific time scale
database would be overkill but I'll look into it nevertheless.  Even
if I do write the queries with the known metrics I am still trying to
figure out how to efficiently query "last known value of this metric
on or before X time" without a correlated subquery which would be a
massively inefficient query.

Yes, if the time is arbitrary you are stuck with a where clause, sort-and-limit.  DISTINCT ON can be useful for returning multiple entities over the same metric.

Indexes, possibly partial (if you don’t normalize the model), should help.

Even massively inefficient can be efficient enough...

David J.

Re: Querying sporadic time series type data.

From
Tim Uckun
Date:
> select... from table where [9,xdate::date]::daterange @:> datecolumn and metric =  'measure you want' order by date
desclimit 1 

In the correlated subquery a simple select  date, value from metrics
where metric_id=x order by datetime desc limit 1 would get you the
figure you want. but of course this has to be repeated for every
metric you collect and if you wanted to do something fancy like plot
some measure over a year based on monthly figures it's massively
inefficient.


> Back to the design of your db, it sounds like you have a fixed set of metrics you are recording. If this is so I
thinkmaking them distinct columns in your data set is preferable than the scheme you are using. But if you are going to
beadding many new metrics frequently then your design makes sense to me. 


The problem with this is that most of the rows will only have one
column filled in. In fact probably all of them are.

I am starting to think the best way to handle this is by creating some
time slot tables with all the columns and populating with batch jobs.
A daily table, weekly table, monthly table etc.  Another option might
be to go ahead and write that slow and io heavy query and save it as a
materialized view.