Thread: Stock Market Price Data & postgreSQL? HELLPPP Please
Hello All I want to store stock market price data in postgreSQL and some of my associates are saying this is not workable, not only in postgreSQL but in any database. Arguments given are listed below. Kindly comment and if possible suggest solutions and pointers to further reading. - This is Time Series Data (I don't know what that is except that it relates to data marked/related to time) and not suited to a RDBMS. - You need it in Esper (a CEP engine used by Marketcetera, an open source trading platform) which needs to get a data stream, whereas an RDBMS will return a recordset. - A RDBMS including postgreSQL simply cannot handle the amount of real-time data coming in from a stock exchange. - A RDBMS simply cannot provide data, even if you build streams from recordsets returned, at the speeds required by any technical analysis charting program. My level of knowledge is basic SysAdmin & Programming Concepts, not actual programming. I will be getting someone to build a solution for me. Need the correct direction please. Again, shoestring budget, all open source pointers please, unless they are for reading & general knowledge. I am hoping for an adapted postgreSQL solution, can't spend more than few hundred dollars for this....so please...HELLLPPP ;-) With best regards. Sanjay.
On Wed, Aug 19, 2009 at 9:22 PM, Sanjay Arora<sanjay.k.arora@gmail.com> wrote: > - This is Time Series Data (I don't know what that is except that it > relates to data marked/related to time) and not suited to a RDBMS. > - You need it in Esper (a CEP engine used by Marketcetera, an open > source trading platform) which needs to get a data stream, whereas an > RDBMS will return a recordset. > - A RDBMS including postgreSQL simply cannot handle the amount of > real-time data coming in from a stock exchange. > - A RDBMS simply cannot provide data, even if you build streams from > recordsets returned, at the speeds required by any technical analysis > charting program. There are whole companies doing nothing but providing streaming RDBMS used by traders, airline reservation systems, etc. c.f Streambase, and Truviso amongst others. I'm not aware of any open source streaming databaes so for a shoestring budget you're going to be limited to using existing tools. There are certainly people who scale up Postgres to high oltp traffic but it does require time and money as you scale up, TANSTAAFL... I'm sure it can be done, Postgres is very flexible, though whether it's a perfect fit with your needs or not I can't say. You might also consider whether one of the existing streaming database system providers would give you a development license for free or on a trial basis if you hope to be making big money down the road. -- greg http://mit.edu/~gsstark/resume.pdf
On Wed, Aug 19, 2009 at 4:22 PM, Sanjay Arora<sanjay.k.arora@gmail.com> wrote: > Hello All > > I want to store stock market price data in postgreSQL and some of my > associates are saying this is not workable, not only in postgreSQL but > in any database. Arguments given are listed below. Kindly comment and > if possible suggest solutions and pointers to further reading. > > - This is Time Series Data (I don't know what that is except that it > relates to data marked/related to time) and not suited to a RDBMS. > - You need it in Esper (a CEP engine used by Marketcetera, an open > source trading platform) which needs to get a data stream, whereas an > RDBMS will return a recordset. > - A RDBMS including postgreSQL simply cannot handle the amount of > real-time data coming in from a stock exchange. > - A RDBMS simply cannot provide data, even if you build streams from > recordsets returned, at the speeds required by any technical analysis > charting program. > > My level of knowledge is basic SysAdmin & Programming Concepts, not > actual programming. I will be getting someone to build a solution for > me. Need the correct direction please. Again, shoestring budget, all > open source pointers please, unless they are for reading & general > knowledge. > > I am hoping for an adapted postgreSQL solution, can't spend more than > few hundred dollars for this....so please...HELLLPPP ;-) I think your first step is to stand back, think about the problem as a whole and see if you can't define your requirements better: *) how much data comes in each day? how much rolls out? *) what types of analysis of the data would you like to do? *) what are your performance expectations? For example, try and describe what you would like to do in detail. The data you would import, how you would like it structured, and a typical query. While the amount of price data a stock market produces is immense, these types of problems tend to scale very well horizontally. So let's define the basic amount of work you'd like to do, then build off that. merlin
----- "Greg Stark" <gsstark@mit.edu> wrote: > On Wed, Aug 19, 2009 at 9:22 PM, Sanjay > Arora<sanjay.k.arora@gmail.com> wrote: > > - This is Time Series Data (I don't know what that is except that > it > > relates to data marked/related to time) and not suited to a RDBMS. > > - You need it in Esper (a CEP engine used by Marketcetera, an open > > source trading platform) which needs to get a data stream, whereas > an > > RDBMS will return a recordset. > > - A RDBMS including postgreSQL simply cannot handle the amount of > > real-time data coming in from a stock exchange. > > - A RDBMS simply cannot provide data, even if you build streams > from > > recordsets returned, at the speeds required by any technical > analysis > > charting program. > > There are whole companies doing nothing but providing streaming RDBMS > used by traders, airline reservation systems, etc. c.f Streambase, > and > Truviso amongst others. > > I'm not aware of any open source streaming databaes so for a > shoestring budget you're going to be limited to using existing tools. > There are certainly people who scale up Postgres to high oltp traffic > but it does require time and money as you scale up, TANSTAAFL... I'm > sure it can be done, Postgres is very flexible, though whether it's a > perfect fit with your needs or not I can't say. > > You might also consider whether one of the existing streaming > database > system providers would give you a development license for free or on > a > trial basis if you hope to be making big money down the road. > > -- > greg > http://mit.edu/~gsstark/resume.pdf In the its a small world category I just ran across a streaming database called Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with caution. Adrian Klaver aklaver@comcast.net
On Thu, Aug 20, 2009 at 2:45 AM, Merlin Moncure<mmoncure@gmail.com> wrote: > > I think your first step is to stand back, think about the problem as a > whole and see if you can't define your requirements better: > > *) how much data comes in each day? how much rolls out? Need to get some data...will have this tomorrow. > *) what types of analysis of the data would you like to do? Basically, what I need to do is feed the tick data to the CEP engine as a data stream. The CEP engine creates various data bars streams say 2 minute OHLC (open/high/low/close), 3 min, 5 min, 15 min, 30 min or 20 tick, 2000 tick, 5000 tick etc. & so on. These are to be fed to strategies which will trade. At least this is how it is done if data arrives from the online data provider. I am sure that this could be done on database. I had originally thought of populated views of prebuilt data bars so building the bars should not be required. They should be built using triggers as soon as the data arrives in postgres, but then everyone advised that it was exactly due to slow results in this type of data that CEP engines were built. I was told that if I used this type of thing, so many triggers would be running that I would have my processors and RAM maxing out, not to say the exponential increase in disk space requirements. And this is for one symbol. On an average 2000 plus scrips need to be analysed. Again the analysis is backtesting trade simulation results according to various strategies, so that means at least three years plus data, if one wants to be a stickler 5-6 years but one year is the bare minimum. > *) what are your performance expectations? > Well, needfully the simulations should finish overnight with three years worth of data. > For example, try and describe what you would like to do in detail. > The data you would import, how you would like it structured, and a > typical query. While the amount of price data a stock market produces > is immense, these types of problems tend to scale very well > horizontally. So let's define the basic amount of work you'd like to > do, then build off that. > I'll get back on this by tomorrow. Thanks. With best regards. Sanjay.
On Thu, Aug 20, 2009 at 3:11 AM, Adrian Klaver<aklaver@comcast.net> wrote: > > > In the its a small world category I just ran across a streaming database called Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with caution. > Thanks Adrian, Very interesting. The pages seem to have moved. I had to read them in google cache. Will email them regarding the project roadmap and if they will be pursuing it for some time to come or is it only a minor research project, that will stop as soon as the research paper is submitted. Mailing list of the project though shows mails are decreasing to almost one per day. However, it threw a very interesting question for me. It was based on postgres itself, v. 7.2 but thats beside the point. What exactly the difference in layman sys admin terms between our everyday postgres and this time series data. postgres supports geo data types and many others. Whats the big issue in not having this sort of time series data type and query as a normal support by postgres? It would have a big feasibility in stock market data storage and analysis. I am sure there would be some other major scientific applications like real time data acquisition & analysis & etc. So why is this sort of application not supported in postgres natively? First time, I've come across a db application that I cannot handle with postgres in my 9 years of using postgres.. Best regards. Sanjay.
Sanjay Arora <sanjay.k.arora@gmail.com> writes: > I want to store stock market price data in postgreSQL and some of my > associates are saying this is not workable, not only in postgreSQL but > in any database. Arguments given are listed below. Kindly comment and > if possible suggest solutions and pointers to further reading. Actually, the way I came to use Postgres was from a company that was working on technical analysis of market data. I would recommend looking at a hybrid approach. Storing raw tick data in a DBMS at one row per tick is indeed not very bright. You could possibly make it work if you throw beefy hardware at the problem, but there's not much point because you're really not playing to a DBMS's strengths when you do that. Once it's arrived the data is static (barring very-infrequent corrections, which in practice you might never do at all anyway). And the access patterns for it (at least if you're doing the same type of technical analysis we were doing) are extremely predictable and linear. So you aren't doing anything wherein SQL will shine. On the other hand, there are definitely components of the big picture where SQL *can* shine. I'd strongly recommend tracking your live trading positions in an RDBMS, for example. It might be worth managing your tick data using DBMS entries that represent, say, monthly tick data files. Activities like removing old data would get a lot easier that way. (I wish this had occurred to me twelve years ago, but it's water over the dam now.) Any performance issues can be avoided by having the technical analysis processes read the static tick data files directly. This is generally pretty similar to the frequently-asked question "should I store a lot of big image or document files directly in the database, or just use it as an index of external files?". If you dig around in the PG list archives you'll find more than you want to read about that on both sides. If you've aggregated the ticks in suitable volumes you can solve it either way, really, but an external file is a bit easier to append to during raw data collection. Don't fall into the trap of assuming that all your requirements must be solved by a single tool. You'll spend all your time building the perfect tool, and go broke before you finish it. Use a DBMS for the parts of the problem it's good for, and go outside it for what it isn't. BTW, we were doing full-speed tick data collection and real-time trading analysis in the nineties, on machines that my current cell phone would leave in the dust. The market data volume has grown a lot since then of course, but the price of hardware has fallen a lot more. I'd not recommend designing your system on the assumption that raw per-tick speed is the be-all and end-all. regards, tom lane
Thanks Tom For this reply and many earlier ones that have guided me around postgres ;-) On Thu, Aug 20, 2009 at 3:34 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Sanjay Arora <sanjay.k.arora@gmail.com> writes: > It might be worth managing your > tick data using DBMS entries that represent, say, monthly tick data > files. Activities like removing old data would get a lot easier that > way. (I wish this had occurred to me twelve years ago, but it's water > over the dam now.) Do you mean something like storing one month worth tick data in a blob type field and giving the contents to the CEP engine for further building of required data streams? This approach would allow getting big chunks of data to be handled with not caring location, naming & indexing etc. of tick data files? Anything else you can suggest? Best Regards. Sanjay.
Sanjay Arora <sanjay.k.arora@gmail.com> writes: > Do you mean something like storing one month worth tick data in a blob > type field and giving the contents to the CEP engine for further > building of required data streams? Either that or put the tick data in an external file and store that file's pathname in the database row. As I said, you can find plenty of argumentation on both sides of that in the PG archives. When SSS were doing this, we had the raw tick data in one set of files and pre-aggregated bar data in other files (I think we stored 5-min and daily bars, but it was a long time ago). The analysis functions would automatically build the bar width they wanted from the widest stored form that divided the desired width, so as to minimize what they had to read from disk. It does sound like you are doing pretty much exactly what we were doing. One thing to think about is that the real-time case is actually much slower and easier to deal with than back-testing. When you are back-testing, you'd like to test a trading strategy over say a year's worth of data, and you'd like that to require a bit less than a year to run, no? So the path you have to optimize is the one feeding stored, pre-aggregated data into your test engine. The people hollering about real-time performance are just betraying that they've never built one of these things. I'm not familiar with this CEP software, but it sounds to me like you want that as part of the real-time data collection process and nowhere near your back-testing data path. Another little tip: if your trading strategies are like ours were, they need to ramp up for actual trading by pre-scanning some amount of historical data. So you're going to need a mechanism that starts by reading the historical data at high speed and smoothly segues into reading the real-time feed (at which the passage of time in the model suddenly slows to one-to-one with real time). Also consider what happens when your machine crashes (it happens) and you need to not only redo that startup process, but persuade the model that its actual trading position is whatever you had open. Or you changed the model a bit and restart as above. The model might now wish it was in a different position, but it has to cope with your actual position. The crash reliability of a DBMS is a strong reason why you track your live positions in one, btw ... your brokerage isn't going to forget you were short IBM, even if your model crashes. regards, tom lane
On Thu, Aug 20, 2009 at 4:23 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > It does sound like you are doing pretty much exactly what we were > doing. One thing to think about is that the real-time case is actually > much slower and easier to deal with than back-testing. When you are > back-testing, you'd like to test a trading strategy over say a year's > worth of data, and you'd like that to require a bit less than a year > to run, no? So the path you have to optimize is the one feeding stored, > pre-aggregated data into your test engine. The people hollering about > real-time performance are just betraying that they've never built one of > these things. I'm not familiar with this CEP software, but it sounds to > me like you want that as part of the real-time data collection process > and nowhere near your back-testing data path. > CEP (Continuous Event Processing...just check out the explanation para at http://esper.codehaus.org/ ) is basically running queries on data streams in memory (i got that as rdbms tables in memory)...on time series data (whatever that is)..... no concept of storage etc. BTW, would you please look back a bit in this thread, a post (by Adrian Klaver see.. http://archives.postgresql.org/pgsql-general/2009-08/msg00745.php ) about a database built by a berkeley computer science deptt. project on postgres 7.2, that stores & queries, time series data just like this and returns data in a stream instead of a recordset. It is said that this type of databases are made for analyzing this type of data. And it has normal postgres mode for returning records and streams mode for returning results in a stream....which I think is a nothing but a TCP-IP socket which receives results on an ongoing basis I have also posted a question in its reply (see .. http://archives.postgresql.org/pgsql-general/2009-08/msg00748.php ), about why postgres itself does not have this (whatever this time series data is in mathematical terms) when that project itself extends postgres 7.2. Will you please have a look at these posts and this project which is using postgres itself. And please explain in layman terms what these guys are doing different that we cannot have a time series data types (we do have GIS & tons of other data types) & stream data returns. Maybe not immediately of use to me but I understand ODBMS, RDBMS and I want to understand this at a conceptual level, at least. > > Another little tip: if your trading strategies are like ours were, > they need to ramp up for actual trading by pre-scanning some amount of > historical data. So you're going to need a mechanism that starts by > reading the historical data at high speed and smoothly segues into > reading the real-time feed (at which the passage of time in the model > suddenly slows to one-to-one with real time). Also consider what > happens when your machine crashes (it happens) and you need to not only > redo that startup process, but persuade the model that its actual > trading position is whatever you had open. Or you changed the model a > bit and restart as above. The model might now wish it was in a different > position, but it has to cope with your actual position. > > The crash reliability of a DBMS is a strong reason why you track your > live positions in one, btw ... your brokerage isn't going to forget you > were short IBM, even if your model crashes. Thats very good advise....we had already planned it that way. And yes, the brokerages are not going to care that we ran out of money while our server was restarting ;-) > > regards, tom lane >
Adrian Klaver escribió: > In the its a small world category I just ran across a streaming database called Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with caution. TelegraphCQ became Truviso. I clearly remember somebody talking about how Truviso was all set up to solve the stock ticker problem, but if you go to their site now they don't mention stock at all. Does this mean anything? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Wed, 19 Aug 2009, Adrian Klaver wrote: > In the its a small world category I just ran across a streaming database > called Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/) > yesterday. It is based on Postgres 7.3.2, so approach with caution. TelegraphCQ was the research project that Truviso, the company where I work, originated from. Having taken considerably more mature code of this type than what's posted there and applied it to market data, I can tell you that just having a streaming database doesn't make that many of the problems go away. Streaming database technology certainly helps compute technical indicators and such more efficiently than having to pull the data back out of the database, but that's only a small subset of what you have to do in order to build a complete trading system. As for Sanjay's project, it sounds like it's underfunded by between two and four orders of magnitude. Before getting involved with Truviso, I once wrote a PostgreSQL based automatic trading system that pulled in streaming market data and made trading decisions based on it, with backtesting and everything. Took about six months of full time work to get the first version working, and I doubt that approach would scale up to more than a few hundred active symbols even on modern hardware. The libraries provided by Marketcetera would give you a big head start on the coding compared to where I began at, but without a serious development budget you're going to be stuck taking somebody's entire software stack as-is. You should be asking programatic traders where there are complete applications ready to go here, not asking about the database details because you're really not going to have any choice besides just using whatever the app is designed around. Given your lack of developent background, you sound more like a Tradestation or Metastock customer than someone who is going to assemble this app yourself. Sorry to sound like a downer here, but your problem is much bigger than you think it is and I'm not sure where to even begin sorting out the bits you're going to need but don't have yet. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 20 Aug 2009, Sanjay Arora wrote: > What exactly the difference in layman sys admin terms between our > everyday postgres and this time series data. There is a good introduction to where regular databases fail to solve time-series data problems at http://cs.nyu.edu/shasha/papers/jagtalk.html As mentioned there, one of the very hard problems to solve in a general way is how to deal with all the gaps in your data when the market isn't open, which really complicates how you compute indicators. For example, in regular SQL you might compute an average over some period using something like: select sum(x) / count(*) where ts>=t1 and ts<=t2 You might think that you could pick t1 and t2 here based on the number of samples you want to average; let's say you want an average over the last minute of data, so you try this: t1=<something> t2=<something> + interval('1 minute') This completely falls apart when when the data isn't continuous. If the market was closed for some period between t1 and t2, you need to use a rolling window over the data you've got instead. As of PostgreSQL 8.4's support for SQL window fuctions, it's easier to compute this sort of thing, but there's still plenty of rough spots to get nailed by. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 19 Aug 2009, Tom Lane wrote: > BTW, we were doing full-speed tick data collection and real-time trading > analysis in the nineties, on machines that my current cell phone would > leave in the dust. The market data volume has grown a lot since then > of course, but the price of hardware has fallen a lot more. Trading volume always grows to where it's just possible to keep up with it using a well designed app on the fastest hardware available. If you're always faster than someone else there's money to be made from them using that fact. The continuous arms race for the fastest scanning and execution platform keeps volume moving upward in lock step with what hardware is capable of. The last mainstream news article on this topic was http://www.nytimes.com/2009/07/24/business/24trading.html -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD