Thread: Design Question (Time Series Data)

Design Question (Time Series Data)

From

Andreas Strasser

Date:

03 October 2007, 22:35:47

Hello,

i'm currently designing an application that will retrieve economic data
(mainly time series)from different sources and distribute it to clients.
It is supposed to manage around 20.000 different series with differing
numbers of observations (some have only a few dozen observations, others
several thousand) and i'm now faced with the decision where and how to
store the data.

So far, i've come up with 3 possible solutions

1) Storing the observations in one big table with fields for the series,
position within the series and the value (float)
2) Storing the observations in an array (either in the same table as the
series or in an extra data-table)
3) Storing the observations in CSV-files on the hard disk and only
putting a reference to it in the database

I expect that around 50 series will be updated daily - which would mean
that for solution nr. 1 around 50.000 rows would be deleted and appended
(again) every day.

I personally prefer solution 1, because it is the easiest to implement
(i need to make different calculations and be able to transform the data
easily), but i'm concerned about perfomance and overhead. It effectively
triples the space needed (over solutions nr. 2) and will result in huge
index files.

Are there any other storage methods which are better suited for this
kind of data? How can i avoid trouble resulting from the daily updates
(high number of deleted rows)? Which method would you prefer?

Thanks in advance!

Andreas

Re: Design Question (Time Series Data)

From

"Pavel Stehule"

Date:

04 October 2007, 06:20:27

2007/10/4, Andreas Strasser <kontakt@andreas-strasser.com>:
> Hello,
>
> i'm currently designing an application that will retrieve economic data
> (mainly time series)from different sources and distribute it to clients.
> It is supposed to manage around 20.000 different series with differing
> numbers of observations (some have only a few dozen observations, others
> several thousand) and i'm now faced with the decision where and how to
> store the data.
>
> So far, i've come up with 3 possible solutions
>
> 1) Storing the observations in one big table with fields for the series,
> position within the series and the value (float)
> 2) Storing the observations in an array (either in the same table as the
> series or in an extra data-table)
> 3) Storing the observations in CSV-files on the hard disk and only
> putting a reference to it in the database
>

I did good experience with 2 variant. PostgreSQL needs 24bytes for
head of every row, so isn't too much efective store one field to one
row. You can simply do transformation between array and table now.

Pavel

> I expect that around 50 series will be updated daily - which would mean
> that for solution nr. 1 around 50.000 rows would be deleted and appended
> (again) every day.
>
> I personally prefer solution 1, because it is the easiest to implement
> (i need to make different calculations and be able to transform the data
> easily), but i'm concerned about perfomance and overhead. It effectively
> triples the space needed (over solutions nr. 2) and will result in huge
> index files.
>
> Are there any other storage methods which are better suited for this
> kind of data? How can i avoid trouble resulting from the daily updates
> (high number of deleted rows)? Which method would you prefer?
>
> Thanks in advance!
>
> Andreas
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: Design Question (Time Series Data)

From

Jorge Godoy

Date:

04 October 2007, 06:34:28

On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
>
> I did good experience with 2 variant. PostgreSQL needs 24bytes for
> head of every row, so isn't too much efective store one field to one
> row. You can simply do transformation between array and table now.

But then you'll make all SQL operations complex, you will have problems using
aggregators, etc.  For example, with a normalized design one can query the
average value of a specific serie using simple commands and given the use of
indices this could be highly optimized.  Now, using an array, he'll be doing
a seqscan on every row because he needs to find if there was a value for the
given series, then he'll need extracting those values and finally calculating
the average (I know you can select an element of the array, but it won't be
easy on the planner or the loop to calculate the average because they'll need
to do all that and on every row).

I'd use the same solution that he was going to: normalized table including a
timestamp (with TZ because of daylight saving times...), a column with a FK
to a series table and the value itself.  Index the two first columns (if
you're searching using the value as a parameter, then index it as well) and
this would be the basis of my design for this specific condition.

Having good statistics and tuning autovacuum will also help a lot on handling
new inserts and deletes.

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Design Question (Time Series Data)

From

"Pavel Stehule"

Date:

04 October 2007, 07:09:47

2007/10/4, Jorge Godoy <jgodoy@gmail.com>:
> On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
> >
> > I did good experience with 2 variant. PostgreSQL needs 24bytes for
> > head of every row, so isn't too much efective store one field to one
> > row. You can simply do transformation between array and table now.
>
> But then you'll make all SQL operations complex, you will have problems using
> aggregators, etc.  For example, with a normalized design one can query the
> average value of a specific serie using simple commands and given the use of
> indices this could be highly optimized.  Now, using an array, he'll be doing
> a seqscan on every row because he needs to find if there was a value for the
> given series, then he'll need extracting those values and finally calculating
> the average (I know you can select an element of the array, but it won't be
> easy on the planner or the loop to calculate the average because they'll need
> to do all that and on every row).
>
> I'd use the same solution that he was going to: normalized table including a
> timestamp (with TZ because of daylight saving times...), a column with a FK
> to a series table and the value itself.  Index the two first columns (if
> you're searching using the value as a parameter, then index it as well) and
> this would be the basis of my design for this specific condition.
>
> Having good statistics and tuning autovacuum will also help a lot on handling
> new inserts and deletes.
>

It's depend on work. Somewhere normalised solution can be better,
somewhere not. But I belive, if you have lot of timeseries, than
arrays is better. But I repeat, it's depend on task.

Pavel

Re: Design Question (Time Series Data)

From

Andreas Strasser

Date:

04 October 2007, 09:54:18

Pavel Stehule schrieb:
> 2007/10/4, Jorge Godoy <jgodoy@gmail.com>:
>> On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
>>
>> I'd use the same solution that he was going to: normalized table including a
>> timestamp (with TZ because of daylight saving times...), a column with a FK
>> to a series table and the value itself.  Index the two first columns (if
>> you're searching using the value as a parameter, then index it as well) and
>> this would be the basis of my design for this specific condition.
>>
>> Having good statistics and tuning autovacuum will also help a lot on handling
>> new inserts and deletes.
>>
>
> It's depend on work. Somewhere normalised solution can be better,
> somewhere not. But I belive, if you have lot of timeseries, than
> arrays is better. But I repeat, it's depend on task.
>
> Pavel
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Thanks for your input so far. Maybe i should add a few things about what
  i will do with the data. There are only a few operations that will be
done in the database:

a) retrieving a slice or the whole series
b) changing the frequency of the series
c) grouping several series (with same time frame/frequency) together in
a result set
d) calculating moving averages and other econometrics stuff :-)

I will always now which series i want (i.e. there will be no case where
i'm searching for a value within the series).

Two questions regarding the arrays: Do you know if these are really
dynamic (e.g. if i have two rows, one with an array with 12 values and
the other one with 1,000 values - will postgres pad the shorter row?)
and is there an built-in function to retrieve arrays as rows (i know
that you can build your own function for that, but i wonder whether
there is a faster native function)

Thank you very much!

Andreas

Re: Design Question (Time Series Data)

From

Ted Byers

Date:

04 October 2007, 11:31:02

--- Andreas Strasser <kontakt@andreas-strasser.com>
wrote:

> Hello,
>
> i'm currently designing an application that will
> retrieve economic data
> (mainly time series)from different sources and
> distribute it to clients.
> It is supposed to manage around 20.000 different
> series with differing
> numbers of observations (some have only a few dozen
> observations, others
> several thousand) and i'm now faced with the
> decision where and how to
> store the data.

If you really have such a disparity among your series,
then it is a mistake to blend them into a single
table.  You really need to spend more time analyzing
what the data means.  If one data set is comprised of
the daily close price of a suite of stocks or mutual
funds, then it makes sense to include all such series
in a given table, but if some of the series are daily
close price and others are monthly averages, then it
is a mistake to combine them in a single table, and
two or more would be warranted.  Or if the data are
from different data feed vendors, then you have to
think very carefully whether or not the data can
logically be combined.

>
> So far, i've come up with 3 possible solutions
>
> 1) Storing the observations in one big table with
> fields for the series,
> position within the series and the value (float)
> 2) Storing the observations in an array (either in
> the same table as the
> series or in an extra data-table)
> 3) Storing the observations in CSV-files on the hard
> disk and only
> putting a reference to it in the database
>
I don't much like any of the above.  When I have had
to process data for financial consultants, I applied a
few simple filters to ensure the data is clean (e.g.
tests to ensure data hasn't been corrupted during
transmission, proper handling of missing data, &c.),
and then bulk loaded the data into a suite of tables
designed specifically to match the vendor's
definitions of what the data means.  Only then did we
apply specific analyses designed in consultation with
the financial consultant's specialists; folk best
qualified to help us understand how best to understand
the data and especially how it can be combined in a
meaningful way.

If the data are stored in a suite of well defined
tables, subsequent analyses are much more easily
designed, implemented and executed.

I do not know if PostgreSQL, or any other RDBMS,
includes the ability to call on software such as "R"
to do specific statistical analysis, but if I had to
do some time series analysis, I would do it in a
client application that retrieves the appropriate data
from the database and either does the analysis in
custom code I have written (usually in C++, as some of
my favourite analyses have not made it into commonly
available open source or commercial statistical
software) or invokes the appropriate functions from
statistical software I have at my disposal.  The
strategy I describe above makes the SQL required for
much of this dirt simple.

HTH

Ted

Re: Design Question (Time Series Data)

From

Michael Glaesemann

Date:

04 October 2007, 11:51:08

On Oct 4, 2007, at 9:30 , Ted Byers wrote:

> I do not know if PostgreSQL, or any other RDBMS,
> includes the ability to call on software such as "R"

See PL/R:

http://www.joeconway.com/plr/

Michael Glaesemann
grzm seespotcode net

Re: Design Question (Time Series Data)

From

Andreas Strasser

Date:

04 October 2007, 12:54:32

Ted Byers wrote:

> If you really have such a disparity among your series,
> then it is a mistake to blend them into a single
> table.  You really need to spend more time analyzing
> what the data means.  If one data set is comprised of
> the daily close price of a suite of stocks or mutual
> funds, then it makes sense to include all such series
> in a given table, but if some of the series are daily
> close price and others are monthly averages, then it
> is a mistake to combine them in a single table, and
> two or more would be warranted.  Or if the data are
> from different data feed vendors, then you have to
> think very carefully whether or not the data can
> logically be combined.

 > I don't much like any of the above.  When I have had
 > to process data for financial consultants, I applied a
 > few simple filters to ensure the data is clean (e.g.
 > tests to ensure data hasn't been corrupted during
 > transmission, proper handling of missing data, &c.),
 > and then bulk loaded the data into a suite of tables
 > designed specifically to match the vendor's
 > definitions of what the data means.  Only then did we
 > apply specific analyses designed in consultation with
 > the financial consultant's specialists; folk best
 > qualified to help us understand how best to understand
 > the data and especially how it can be combined in a
 > meaningful way.

Thank you for your long and detailed answer. It would be possible for me
to group those series by frequency or vendor (or even by size), but this
would make sql-queries really complicated. Grouping the series by theme
would probably work but be a real pain and i doubt whether this has a
real advantage for my application.

The real analysis of the data will happen on the client-side, the server
has only to ensure that the time series are up to date and not corrupted
- it doesn't care about the meaning of the value of any given field (I
was probably a little bit unclear about that). It will (with the help of
R - via PL/R) do a few simple calculations and transformations, but
those can be applied without much knowledge about the properties of the
data.

> I do not know if PostgreSQL, or any other RDBMS,
> includes the ability to call on software such as "R"
> to do specific statistical analysis, but if I had to
> do some time series analysis, I would do it in a
> client application that retrieves the appropriate data
> from the database and either does the analysis in
> custom code I have written (usually in C++, as some of
> my favourite analyses have not made it into commonly
> available open source or commercial statistical
> software) or invokes the appropriate functions from
> statistical software I have at my disposal.  The
> strategy I describe above makes the SQL required for
> much of this dirt simple.

Exactly, i'm looking for a simple and clean way to store data(possibly
transform its frequency) and serve it to the clients (in different
formats).

A typical request would be "Give me quarterly GDP and quarterly Import
figures for the US between 1987-01 to 2007-2 plus the effective federal
funds rate (transformed to quarterly averages) for the same period".

The server therefore has to grab the values for the first two series
(both are much longer - go back to the 1930's or 40's), extract the
values for the requested period (pad it if it's too short). It will do
the same for the third series and transform it to quarterly values (only
available for months/weeks). Finally it will return a table with a
date-column and 3 columns for the requested data.

Most of the stuff will be done in stored procedures, since i don't want
to mess around with it in php (which acts as as proxy for the clients).

Andreas

Re: Design Question (Time Series Data)

From

Ted Byers

Date:

04 October 2007, 16:53:33

--- Michael Glaesemann <grzm@seespotcode.net> wrote:

>
> On Oct 4, 2007, at 9:30 , Ted Byers wrote:
>
> > I do not know if PostgreSQL, or any other RDBMS,
> > includes the ability to call on software such as
> "R"
>
> See PL/R:
>
> http://www.joeconway.com/plr/
>
Thanks.  Good to know.

Ted

Re: Design Question (Time Series Data)

From

"Josh Tolley"

Date:

05 October 2007, 10:57:07

On 10/4/07, Ted Byers <r.ted.byers@rogers.com> wrote:
> --- Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> >
> > On Oct 4, 2007, at 9:30 , Ted Byers wrote:
> >
> > > I do not know if PostgreSQL, or any other RDBMS,
> > > includes the ability to call on software such as
> > "R"
> >
> > See PL/R:
> >
> > http://www.joeconway.com/plr/
> >
> Thanks.  Good to know.

See also RdbiPgSQL
(http://bioconductor.org/packages/2.0/bioc/html/RdbiPgSQL.html)

PL/R lets you write R functions as procedural functions you can call
from pgsql (e.g. select my_r_function(myfield) from mytable. RdbiPgSQL
creates R functions you can use to query pgsql from within R.

-Josh/eggyknap