Thread: Design Question (Time Series Data)
Hello, i'm currently designing an application that will retrieve economic data (mainly time series)from different sources and distribute it to clients. It is supposed to manage around 20.000 different series with differing numbers of observations (some have only a few dozen observations, others several thousand) and i'm now faced with the decision where and how to store the data. So far, i've come up with 3 possible solutions 1) Storing the observations in one big table with fields for the series, position within the series and the value (float) 2) Storing the observations in an array (either in the same table as the series or in an extra data-table) 3) Storing the observations in CSV-files on the hard disk and only putting a reference to it in the database I expect that around 50 series will be updated daily - which would mean that for solution nr. 1 around 50.000 rows would be deleted and appended (again) every day. I personally prefer solution 1, because it is the easiest to implement (i need to make different calculations and be able to transform the data easily), but i'm concerned about perfomance and overhead. It effectively triples the space needed (over solutions nr. 2) and will result in huge index files. Are there any other storage methods which are better suited for this kind of data? How can i avoid trouble resulting from the daily updates (high number of deleted rows)? Which method would you prefer? Thanks in advance! Andreas
2007/10/4, Andreas Strasser <kontakt@andreas-strasser.com>: > Hello, > > i'm currently designing an application that will retrieve economic data > (mainly time series)from different sources and distribute it to clients. > It is supposed to manage around 20.000 different series with differing > numbers of observations (some have only a few dozen observations, others > several thousand) and i'm now faced with the decision where and how to > store the data. > > So far, i've come up with 3 possible solutions > > 1) Storing the observations in one big table with fields for the series, > position within the series and the value (float) > 2) Storing the observations in an array (either in the same table as the > series or in an extra data-table) > 3) Storing the observations in CSV-files on the hard disk and only > putting a reference to it in the database > I did good experience with 2 variant. PostgreSQL needs 24bytes for head of every row, so isn't too much efective store one field to one row. You can simply do transformation between array and table now. Pavel > I expect that around 50 series will be updated daily - which would mean > that for solution nr. 1 around 50.000 rows would be deleted and appended > (again) every day. > > I personally prefer solution 1, because it is the easiest to implement > (i need to make different calculations and be able to transform the data > easily), but i'm concerned about perfomance and overhead. It effectively > triples the space needed (over solutions nr. 2) and will result in huge > index files. > > Are there any other storage methods which are better suited for this > kind of data? How can i avoid trouble resulting from the daily updates > (high number of deleted rows)? Which method would you prefer? > > Thanks in advance! > > Andreas > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote: > > I did good experience with 2 variant. PostgreSQL needs 24bytes for > head of every row, so isn't too much efective store one field to one > row. You can simply do transformation between array and table now. But then you'll make all SQL operations complex, you will have problems using aggregators, etc. For example, with a normalized design one can query the average value of a specific serie using simple commands and given the use of indices this could be highly optimized. Now, using an array, he'll be doing a seqscan on every row because he needs to find if there was a value for the given series, then he'll need extracting those values and finally calculating the average (I know you can select an element of the array, but it won't be easy on the planner or the loop to calculate the average because they'll need to do all that and on every row). I'd use the same solution that he was going to: normalized table including a timestamp (with TZ because of daylight saving times...), a column with a FK to a series table and the value itself. Index the two first columns (if you're searching using the value as a parameter, then index it as well) and this would be the basis of my design for this specific condition. Having good statistics and tuning autovacuum will also help a lot on handling new inserts and deletes. -- Jorge Godoy <jgodoy@gmail.com>
2007/10/4, Jorge Godoy <jgodoy@gmail.com>: > On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote: > > > > I did good experience with 2 variant. PostgreSQL needs 24bytes for > > head of every row, so isn't too much efective store one field to one > > row. You can simply do transformation between array and table now. > > But then you'll make all SQL operations complex, you will have problems using > aggregators, etc. For example, with a normalized design one can query the > average value of a specific serie using simple commands and given the use of > indices this could be highly optimized. Now, using an array, he'll be doing > a seqscan on every row because he needs to find if there was a value for the > given series, then he'll need extracting those values and finally calculating > the average (I know you can select an element of the array, but it won't be > easy on the planner or the loop to calculate the average because they'll need > to do all that and on every row). > > I'd use the same solution that he was going to: normalized table including a > timestamp (with TZ because of daylight saving times...), a column with a FK > to a series table and the value itself. Index the two first columns (if > you're searching using the value as a parameter, then index it as well) and > this would be the basis of my design for this specific condition. > > Having good statistics and tuning autovacuum will also help a lot on handling > new inserts and deletes. > It's depend on work. Somewhere normalised solution can be better, somewhere not. But I belive, if you have lot of timeseries, than arrays is better. But I repeat, it's depend on task. Pavel
Pavel Stehule schrieb: > 2007/10/4, Jorge Godoy <jgodoy@gmail.com>: >> On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote: >> >> I'd use the same solution that he was going to: normalized table including a >> timestamp (with TZ because of daylight saving times...), a column with a FK >> to a series table and the value itself. Index the two first columns (if >> you're searching using the value as a parameter, then index it as well) and >> this would be the basis of my design for this specific condition. >> >> Having good statistics and tuning autovacuum will also help a lot on handling >> new inserts and deletes. >> > > It's depend on work. Somewhere normalised solution can be better, > somewhere not. But I belive, if you have lot of timeseries, than > arrays is better. But I repeat, it's depend on task. > > Pavel > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > Thanks for your input so far. Maybe i should add a few things about what i will do with the data. There are only a few operations that will be done in the database: a) retrieving a slice or the whole series b) changing the frequency of the series c) grouping several series (with same time frame/frequency) together in a result set d) calculating moving averages and other econometrics stuff :-) I will always now which series i want (i.e. there will be no case where i'm searching for a value within the series). Two questions regarding the arrays: Do you know if these are really dynamic (e.g. if i have two rows, one with an array with 12 values and the other one with 1,000 values - will postgres pad the shorter row?) and is there an built-in function to retrieve arrays as rows (i know that you can build your own function for that, but i wonder whether there is a faster native function) Thank you very much! Andreas
--- Andreas Strasser <kontakt@andreas-strasser.com> wrote: > Hello, > > i'm currently designing an application that will > retrieve economic data > (mainly time series)from different sources and > distribute it to clients. > It is supposed to manage around 20.000 different > series with differing > numbers of observations (some have only a few dozen > observations, others > several thousand) and i'm now faced with the > decision where and how to > store the data. If you really have such a disparity among your series, then it is a mistake to blend them into a single table. You really need to spend more time analyzing what the data means. If one data set is comprised of the daily close price of a suite of stocks or mutual funds, then it makes sense to include all such series in a given table, but if some of the series are daily close price and others are monthly averages, then it is a mistake to combine them in a single table, and two or more would be warranted. Or if the data are from different data feed vendors, then you have to think very carefully whether or not the data can logically be combined. > > So far, i've come up with 3 possible solutions > > 1) Storing the observations in one big table with > fields for the series, > position within the series and the value (float) > 2) Storing the observations in an array (either in > the same table as the > series or in an extra data-table) > 3) Storing the observations in CSV-files on the hard > disk and only > putting a reference to it in the database > I don't much like any of the above. When I have had to process data for financial consultants, I applied a few simple filters to ensure the data is clean (e.g. tests to ensure data hasn't been corrupted during transmission, proper handling of missing data, &c.), and then bulk loaded the data into a suite of tables designed specifically to match the vendor's definitions of what the data means. Only then did we apply specific analyses designed in consultation with the financial consultant's specialists; folk best qualified to help us understand how best to understand the data and especially how it can be combined in a meaningful way. If the data are stored in a suite of well defined tables, subsequent analyses are much more easily designed, implemented and executed. I do not know if PostgreSQL, or any other RDBMS, includes the ability to call on software such as "R" to do specific statistical analysis, but if I had to do some time series analysis, I would do it in a client application that retrieves the appropriate data from the database and either does the analysis in custom code I have written (usually in C++, as some of my favourite analyses have not made it into commonly available open source or commercial statistical software) or invokes the appropriate functions from statistical software I have at my disposal. The strategy I describe above makes the SQL required for much of this dirt simple. HTH Ted
On Oct 4, 2007, at 9:30 , Ted Byers wrote: > I do not know if PostgreSQL, or any other RDBMS, > includes the ability to call on software such as "R" See PL/R: http://www.joeconway.com/plr/ Michael Glaesemann grzm seespotcode net
Ted Byers wrote: > If you really have such a disparity among your series, > then it is a mistake to blend them into a single > table. You really need to spend more time analyzing > what the data means. If one data set is comprised of > the daily close price of a suite of stocks or mutual > funds, then it makes sense to include all such series > in a given table, but if some of the series are daily > close price and others are monthly averages, then it > is a mistake to combine them in a single table, and > two or more would be warranted. Or if the data are > from different data feed vendors, then you have to > think very carefully whether or not the data can > logically be combined. > I don't much like any of the above. When I have had > to process data for financial consultants, I applied a > few simple filters to ensure the data is clean (e.g. > tests to ensure data hasn't been corrupted during > transmission, proper handling of missing data, &c.), > and then bulk loaded the data into a suite of tables > designed specifically to match the vendor's > definitions of what the data means. Only then did we > apply specific analyses designed in consultation with > the financial consultant's specialists; folk best > qualified to help us understand how best to understand > the data and especially how it can be combined in a > meaningful way. Thank you for your long and detailed answer. It would be possible for me to group those series by frequency or vendor (or even by size), but this would make sql-queries really complicated. Grouping the series by theme would probably work but be a real pain and i doubt whether this has a real advantage for my application. The real analysis of the data will happen on the client-side, the server has only to ensure that the time series are up to date and not corrupted - it doesn't care about the meaning of the value of any given field (I was probably a little bit unclear about that). It will (with the help of R - via PL/R) do a few simple calculations and transformations, but those can be applied without much knowledge about the properties of the data. > I do not know if PostgreSQL, or any other RDBMS, > includes the ability to call on software such as "R" > to do specific statistical analysis, but if I had to > do some time series analysis, I would do it in a > client application that retrieves the appropriate data > from the database and either does the analysis in > custom code I have written (usually in C++, as some of > my favourite analyses have not made it into commonly > available open source or commercial statistical > software) or invokes the appropriate functions from > statistical software I have at my disposal. The > strategy I describe above makes the SQL required for > much of this dirt simple. Exactly, i'm looking for a simple and clean way to store data(possibly transform its frequency) and serve it to the clients (in different formats). A typical request would be "Give me quarterly GDP and quarterly Import figures for the US between 1987-01 to 2007-2 plus the effective federal funds rate (transformed to quarterly averages) for the same period". The server therefore has to grab the values for the first two series (both are much longer - go back to the 1930's or 40's), extract the values for the requested period (pad it if it's too short). It will do the same for the third series and transform it to quarterly values (only available for months/weeks). Finally it will return a table with a date-column and 3 columns for the requested data. Most of the stuff will be done in stored procedures, since i don't want to mess around with it in php (which acts as as proxy for the clients). Andreas
--- Michael Glaesemann <grzm@seespotcode.net> wrote: > > On Oct 4, 2007, at 9:30 , Ted Byers wrote: > > > I do not know if PostgreSQL, or any other RDBMS, > > includes the ability to call on software such as > "R" > > See PL/R: > > http://www.joeconway.com/plr/ > Thanks. Good to know. Ted
On 10/4/07, Ted Byers <r.ted.byers@rogers.com> wrote: > --- Michael Glaesemann <grzm@seespotcode.net> wrote: > > > > > On Oct 4, 2007, at 9:30 , Ted Byers wrote: > > > > > I do not know if PostgreSQL, or any other RDBMS, > > > includes the ability to call on software such as > > "R" > > > > See PL/R: > > > > http://www.joeconway.com/plr/ > > > Thanks. Good to know. See also RdbiPgSQL (http://bioconductor.org/packages/2.0/bioc/html/RdbiPgSQL.html) PL/R lets you write R functions as procedural functions you can call from pgsql (e.g. select my_r_function(myfield) from mytable. RdbiPgSQL creates R functions you can use to query pgsql from within R. -Josh/eggyknap