Thread: Stock Market Price Data & postgreSQL? HELLPPP Please

Stock Market Price Data & postgreSQL? HELLPPP Please

From
Sanjay Arora
Date:
Hello All

I want to store stock market price data in postgreSQL and some of my
associates are saying this is not workable, not only in postgreSQL but
in any database. Arguments given are listed below. Kindly comment and
if possible suggest solutions and pointers to further reading.

- This is Time Series Data (I don't know what that is except that it
relates to data marked/related to time) and not suited to a RDBMS.
- You need it in Esper (a CEP engine used by Marketcetera, an open
source trading platform) which needs to get a data stream, whereas an
RDBMS will return a recordset.
- A RDBMS including postgreSQL simply cannot handle the amount of
real-time data coming in from a stock exchange.
- A RDBMS simply cannot provide data, even if you build streams from
recordsets returned, at the speeds required by any technical analysis
charting program.

My level of knowledge is basic SysAdmin & Programming Concepts, not
actual programming. I will be getting someone to build a solution for
me. Need the correct direction please. Again, shoestring budget, all
open source pointers please, unless they are for reading & general
knowledge.

I am hoping for an adapted postgreSQL solution, can't spend more than
few hundred dollars for this....so please...HELLLPPP ;-)

With best regards.
Sanjay.

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Greg Stark
Date:
On Wed, Aug 19, 2009 at 9:22 PM, Sanjay Arora<sanjay.k.arora@gmail.com> wrote:
> - This is Time Series Data (I don't know what that is except that it
> relates to data marked/related to time) and not suited to a RDBMS.
> - You need it in Esper (a CEP engine used by Marketcetera, an open
> source trading platform) which needs to get a data stream, whereas an
> RDBMS will return a recordset.
> - A RDBMS including postgreSQL simply cannot handle the amount of
> real-time data coming in from a stock exchange.
> - A RDBMS simply cannot provide data, even if you build streams from
> recordsets returned, at the speeds required by any technical analysis
> charting program.

There are whole companies doing nothing but providing streaming RDBMS
used by traders, airline reservation systems, etc. c.f Streambase, and
Truviso amongst others.

I'm not aware of any open source streaming databaes so for a
shoestring budget you're going to be limited to using existing tools.
There are certainly people who scale up Postgres to high oltp traffic
but it does require time and money as you scale up, TANSTAAFL... I'm
sure it can be done, Postgres is very flexible, though whether it's a
perfect fit with your needs or not I can't say.

You might also consider whether one of the existing streaming database
system providers would give you a development license for free or on a
trial basis if you hope to be making big money down the road.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Merlin Moncure
Date:
On Wed, Aug 19, 2009 at 4:22 PM, Sanjay Arora<sanjay.k.arora@gmail.com> wrote:
> Hello All
>
> I want to store stock market price data in postgreSQL and some of my
> associates are saying this is not workable, not only in postgreSQL but
> in any database. Arguments given are listed below. Kindly comment and
> if possible suggest solutions and pointers to further reading.
>
> - This is Time Series Data (I don't know what that is except that it
> relates to data marked/related to time) and not suited to a RDBMS.
> - You need it in Esper (a CEP engine used by Marketcetera, an open
> source trading platform) which needs to get a data stream, whereas an
> RDBMS will return a recordset.
> - A RDBMS including postgreSQL simply cannot handle the amount of
> real-time data coming in from a stock exchange.
> - A RDBMS simply cannot provide data, even if you build streams from
> recordsets returned, at the speeds required by any technical analysis
> charting program.
>
> My level of knowledge is basic SysAdmin & Programming Concepts, not
> actual programming. I will be getting someone to build a solution for
> me. Need the correct direction please. Again, shoestring budget, all
> open source pointers please, unless they are for reading & general
> knowledge.
>
> I am hoping for an adapted postgreSQL solution, can't spend more than
> few hundred dollars for this....so please...HELLLPPP ;-)

I think your first step is to stand back, think about the problem as a
whole and see if you can't define your requirements better:

*) how much data comes in each day? how much rolls out?
*) what types of analysis of the data would you like to do?
*) what are your performance expectations?

For example, try and describe what you would like to do in detail.
The data you would import, how you would like it structured, and a
typical query.  While the amount of price data a stock market produces
is immense, these types of problems tend to scale very well
horizontally.  So let's define the basic amount of work you'd like to
do, then build off that.

merlin

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Adrian Klaver
Date:
----- "Greg Stark" <gsstark@mit.edu> wrote:

> On Wed, Aug 19, 2009 at 9:22 PM, Sanjay
> Arora<sanjay.k.arora@gmail.com> wrote:
> > - This is Time Series Data (I don't know what that is except that
> it
> > relates to data marked/related to time) and not suited to a RDBMS.
> > - You need it in Esper (a CEP engine used by Marketcetera, an open
> > source trading platform) which needs to get a data stream, whereas
> an
> > RDBMS will return a recordset.
> > - A RDBMS including postgreSQL simply cannot handle the amount of
> > real-time data coming in from a stock exchange.
> > - A RDBMS simply cannot provide data, even if you build streams
> from
> > recordsets returned, at the speeds required by any technical
> analysis
> > charting program.
>
> There are whole companies doing nothing but providing streaming RDBMS
> used by traders, airline reservation systems, etc. c.f Streambase,
> and
> Truviso amongst others.
>
> I'm not aware of any open source streaming databaes so for a
> shoestring budget you're going to be limited to using existing tools.
> There are certainly people who scale up Postgres to high oltp traffic
> but it does require time and money as you scale up, TANSTAAFL... I'm
> sure it can be done, Postgres is very flexible, though whether it's a
> perfect fit with your needs or not I can't say.
>
> You might also consider whether one of the existing streaming
> database
> system providers would give you a development license for free or on
> a
> trial basis if you hope to be making big money down the road.
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf

In the its a small world category I just ran across a streaming database called
Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with
caution.


Adrian Klaver
aklaver@comcast.net

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Sanjay Arora
Date:
On Thu, Aug 20, 2009 at 2:45 AM, Merlin Moncure<mmoncure@gmail.com> wrote:

>
> I think your first step is to stand back, think about the problem as a
> whole and see if you can't define your requirements better:
>
> *) how much data comes in each day? how much rolls out?

Need to get some data...will have this tomorrow.

> *) what types of analysis of the data would you like to do?

Basically, what I need to do is feed the tick data to the CEP engine
as a data stream. The CEP engine creates various data bars streams say
2 minute OHLC (open/high/low/close), 3 min, 5 min, 15 min, 30 min or
20 tick, 2000 tick, 5000 tick etc. & so on. These are to be fed to
strategies which will trade. At least this is how it is done if data
arrives from the online data provider.

I am sure that this could be done on database. I had originally
thought of populated views of prebuilt data bars so building the bars
should not be required. They should be built using triggers as soon as
the data arrives in postgres, but then everyone advised that it was
exactly due to slow results in this type of data that CEP engines were
built. I was told that if I used this type of thing, so many triggers
would be running that I would have my processors and RAM maxing out,
not to say the exponential increase in disk space requirements.

And this is for one symbol. On an average 2000 plus scrips need to be analysed.

Again the analysis is backtesting trade simulation results according
to various strategies, so that means at least three years plus data,
if one wants to be a stickler 5-6 years but one year is the bare
minimum.

> *) what are your performance expectations?
>
Well, needfully the simulations should finish overnight with three
years worth of data.

> For example, try and describe what you would like to do in detail.
> The data you would import, how you would like it structured, and a
> typical query.  While the amount of price data a stock market produces
> is immense, these types of problems tend to scale very well
> horizontally.  So let's define the basic amount of work you'd like to
> do, then build off that.
>

I'll get back on this by tomorrow. Thanks.

With best regards.
Sanjay.

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Sanjay Arora
Date:
On Thu, Aug 20, 2009 at 3:11 AM, Adrian Klaver<aklaver@comcast.net> wrote:
>
>
> In the its a small world category I just ran across a streaming database called
Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with
caution.
>
Thanks Adrian,

Very interesting. The pages seem to have moved. I had to read them in
google cache. Will email them regarding the project roadmap and if
they will be pursuing it for some time to come or is it only a minor
research project, that will stop as soon as the research paper is
submitted. Mailing list of the project though shows mails are
decreasing to almost one per day.

However, it threw a very interesting question for me. It was based on
postgres itself, v. 7.2 but thats beside the point.

What exactly the difference in layman sys admin terms between our
everyday postgres and this time series data. postgres supports geo
data types and many others. Whats the big issue in not having this
sort of time series data type and query as a normal support by
postgres?

It would have a big feasibility in stock market data storage and
analysis. I am sure there would be some other major scientific
applications like real time data acquisition & analysis & etc.

So why is this sort of application not supported in postgres natively?
First time, I've come across a db application that I cannot handle
with postgres in my 9 years of using postgres..

Best regards.
Sanjay.

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Tom Lane
Date:
Sanjay Arora <sanjay.k.arora@gmail.com> writes:
> I want to store stock market price data in postgreSQL and some of my
> associates are saying this is not workable, not only in postgreSQL but
> in any database. Arguments given are listed below. Kindly comment and
> if possible suggest solutions and pointers to further reading.

Actually, the way I came to use Postgres was from a company that was
working on technical analysis of market data.  I would recommend looking
at a hybrid approach.  Storing raw tick data in a DBMS at one row per
tick is indeed not very bright.  You could possibly make it work if
you throw beefy hardware at the problem, but there's not much point
because you're really not playing to a DBMS's strengths when you do
that.  Once it's arrived the data is static (barring very-infrequent
corrections, which in practice you might never do at all anyway).
And the access patterns for it (at least if you're doing the same type
of technical analysis we were doing) are extremely predictable and
linear.  So you aren't doing anything wherein SQL will shine.  On the
other hand, there are definitely components of the big picture where
SQL *can* shine.  I'd strongly recommend tracking your live trading
positions in an RDBMS, for example.  It might be worth managing your
tick data using DBMS entries that represent, say, monthly tick data
files.  Activities like removing old data would get a lot easier that
way.  (I wish this had occurred to me twelve years ago, but it's water
over the dam now.)  Any performance issues can be avoided by having
the technical analysis processes read the static tick data files
directly.  This is generally pretty similar to the frequently-asked
question "should I store a lot of big image or document files directly
in the database, or just use it as an index of external files?".
If you dig around in the PG list archives you'll find more than you want
to read about that on both sides.  If you've aggregated the ticks in
suitable volumes you can solve it either way, really, but an external
file is a bit easier to append to during raw data collection.

Don't fall into the trap of assuming that all your requirements must be
solved by a single tool.  You'll spend all your time building the
perfect tool, and go broke before you finish it.  Use a DBMS for the
parts of the problem it's good for, and go outside it for what it isn't.

BTW, we were doing full-speed tick data collection and real-time trading
analysis in the nineties, on machines that my current cell phone would
leave in the dust.  The market data volume has grown a lot since then
of course, but the price of hardware has fallen a lot more.  I'd not
recommend designing your system on the assumption that raw per-tick
speed is the be-all and end-all.

            regards, tom lane

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Sanjay Arora
Date:
Thanks Tom

For this reply and many earlier ones that have guided me around postgres ;-)

On Thu, Aug 20, 2009 at 3:34 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Sanjay Arora <sanjay.k.arora@gmail.com> writes:

>  It might be worth managing your
> tick data using DBMS entries that represent, say, monthly tick data
> files.  Activities like removing old data would get a lot easier that
> way.  (I wish this had occurred to me twelve years ago, but it's water
> over the dam now.)

Do you mean something like storing one month worth tick data in a blob
type field and giving the contents to the CEP engine for further
building of required data streams? This approach would allow getting
big chunks of data to be handled with not caring location, naming &
indexing etc. of tick data files? Anything else you can suggest?

Best Regards.
Sanjay.

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Tom Lane
Date:
Sanjay Arora <sanjay.k.arora@gmail.com> writes:
> Do you mean something like storing one month worth tick data in a blob
> type field and giving the contents to the CEP engine for further
> building of required data streams?

Either that or put the tick data in an external file and store that
file's pathname in the database row.  As I said, you can find plenty
of argumentation on both sides of that in the PG archives.

When SSS were doing this, we had the raw tick data in one set of files
and pre-aggregated bar data in other files (I think we stored 5-min
and daily bars, but it was a long time ago).  The analysis functions
would automatically build the bar width they wanted from the widest
stored form that divided the desired width, so as to minimize what they
had to read from disk.

It does sound like you are doing pretty much exactly what we were
doing.  One thing to think about is that the real-time case is actually
much slower and easier to deal with than back-testing.  When you are
back-testing, you'd like to test a trading strategy over say a year's
worth of data, and you'd like that to require a bit less than a year
to run, no?  So the path you have to optimize is the one feeding stored,
pre-aggregated data into your test engine.  The people hollering about
real-time performance are just betraying that they've never built one of
these things.  I'm not familiar with this CEP software, but it sounds to
me like you want that as part of the real-time data collection process
and nowhere near your back-testing data path.

Another little tip: if your trading strategies are like ours were,
they need to ramp up for actual trading by pre-scanning some amount of
historical data.  So you're going to need a mechanism that starts by
reading the historical data at high speed and smoothly segues into
reading the real-time feed (at which the passage of time in the model
suddenly slows to one-to-one with real time).  Also consider what
happens when your machine crashes (it happens) and you need to not only
redo that startup process, but persuade the model that its actual
trading position is whatever you had open.  Or you changed the model a
bit and restart as above.  The model might now wish it was in a different
position, but it has to cope with your actual position.

The crash reliability of a DBMS is a strong reason why you track your
live positions in one, btw ... your brokerage isn't going to forget you
were short IBM, even if your model crashes.

            regards, tom lane

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Sanjay Arora
Date:
On Thu, Aug 20, 2009 at 4:23 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

> It does sound like you are doing pretty much exactly what we were
> doing.  One thing to think about is that the real-time case is actually
> much slower and easier to deal with than back-testing.  When you are
> back-testing, you'd like to test a trading strategy over say a year's
> worth of data, and you'd like that to require a bit less than a year
> to run, no?  So the path you have to optimize is the one feeding stored,
> pre-aggregated data into your test engine.  The people hollering about
> real-time performance are just betraying that they've never built one of
> these things.  I'm not familiar with this CEP software, but it sounds to
> me like you want that as part of the real-time data collection process
> and nowhere near your back-testing data path.
>
CEP (Continuous Event Processing...just check out the explanation para
at http://esper.codehaus.org/ ) is basically running queries on data
streams in memory (i got that as rdbms tables in memory)...on time
series data (whatever that is)..... no concept of storage etc.

BTW, would you please look back a bit in this thread, a post (by
Adrian Klaver see..
http://archives.postgresql.org/pgsql-general/2009-08/msg00745.php )
about a database built by a berkeley computer science deptt. project
on postgres 7.2, that stores & queries, time series data just like
this and returns data in a stream instead of a recordset. It is said
that this type of databases are made for analyzing this type of data.
And it has normal postgres mode for returning records and streams mode
for returning results in a stream....which I think is a nothing but a
TCP-IP socket which receives results on an ongoing basis

I have also posted a question in its reply (see ..
http://archives.postgresql.org/pgsql-general/2009-08/msg00748.php ),
about why postgres itself does not have this (whatever this time
series data is in mathematical terms) when that project itself extends
postgres 7.2. Will you please have a look at these posts and this
project which is using postgres itself.

And please explain in layman terms what these guys are doing different
that we cannot have a time series data types (we do have GIS & tons of
other data types)  & stream data returns. Maybe not immediately of use
to me but I understand ODBMS, RDBMS and I want to understand this at a
conceptual level, at least.
>
> Another little tip: if your trading strategies are like ours were,
> they need to ramp up for actual trading by pre-scanning some amount of
> historical data.  So you're going to need a mechanism that starts by
> reading the historical data at high speed and smoothly segues into
> reading the real-time feed (at which the passage of time in the model
> suddenly slows to one-to-one with real time).  Also consider what
> happens when your machine crashes (it happens) and you need to not only
> redo that startup process, but persuade the model that its actual
> trading position is whatever you had open.  Or you changed the model a
> bit and restart as above.  The model might now wish it was in a different
> position, but it has to cope with your actual position.
>
> The crash reliability of a DBMS is a strong reason why you track your
> live positions in one, btw ... your brokerage isn't going to forget you
> were short IBM, even if your model crashes.

Thats very good advise....we had already planned it that way. And yes,
the brokerages are not going to care that we ran out of money while
our server was restarting ;-)

>
>                        regards, tom lane
>

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Alvaro Herrera
Date:
Adrian Klaver escribió:

> In the its a small world category I just ran across a streaming database called
Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)yesterday. It is based on Postgres 7.3.2, so approach with
caution.

TelegraphCQ became Truviso.

I clearly remember somebody talking about how Truviso was all set up to
solve the stock ticker problem, but if you go to their site now they
don't mention stock at all.  Does this mean anything?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Greg Smith
Date:
On Wed, 19 Aug 2009, Adrian Klaver wrote:

> In the its a small world category I just ran across a streaming database
> called Telegraph(http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/)
> yesterday. It is based on Postgres 7.3.2, so approach with caution.

TelegraphCQ was the research project that Truviso, the company where I
work, originated from.  Having taken considerably more mature code of this
type than what's posted there and applied it to market data, I can tell
you that just having a streaming database doesn't make that many of the
problems go away.  Streaming database technology certainly helps compute
technical indicators and such more efficiently than having to pull the
data back out of the database, but that's only a small subset of what you
have to do in order to build a complete trading system.

As for Sanjay's project, it sounds like it's underfunded by between two
and four orders of magnitude.  Before getting involved with Truviso, I
once wrote a PostgreSQL based automatic trading system that pulled in
streaming market data and made trading decisions based on it, with
backtesting and everything.  Took about six months of full time work to
get the first version working, and I doubt that approach would scale up to
more than a few hundred active symbols even on modern hardware.

The libraries provided by Marketcetera would give you a big head start on
the coding compared to where I began at, but without a serious development
budget you're going to be stuck taking somebody's entire software stack
as-is.  You should be asking programatic traders where there are complete
applications ready to go here, not asking about the database details
because you're really not going to have any choice besides just using
whatever the app is designed around.  Given your lack of developent
background, you sound more like a Tradestation or Metastock customer than
someone who is going to assemble this app yourself.

Sorry to sound like a downer here, but your problem is much bigger than
you think it is and I'm not sure where to even begin sorting out the bits
you're going to need but don't have yet.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Greg Smith
Date:
On Thu, 20 Aug 2009, Sanjay Arora wrote:

> What exactly the difference in layman sys admin terms between our
> everyday postgres and this time series data.

There is a good introduction to where regular databases fail to solve
time-series data problems at http://cs.nyu.edu/shasha/papers/jagtalk.html

As mentioned there, one of the very hard problems to solve in a general
way is how to deal with all the gaps in your data when the market isn't
open, which really complicates how you compute indicators.  For example,
in regular SQL you might compute an average over some period using
something like:

select sum(x) / count(*) where ts>=t1 and ts<=t2

You might think that you could pick t1 and t2 here based on the number of
samples you want to average; let's say you want an average over the last
minute of data, so you try this:

t1=<something>
t2=<something> + interval('1 minute')

This completely falls apart when when the data isn't continuous.  If the
market was closed for some period between t1 and t2, you need to use a
rolling window over the data you've got instead.  As of PostgreSQL 8.4's
support for SQL window fuctions, it's easier to compute this sort of
thing, but there's still plenty of rough spots to get nailed by.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Stock Market Price Data & postgreSQL? HELLPPP Please

From
Greg Smith
Date:
On Wed, 19 Aug 2009, Tom Lane wrote:

> BTW, we were doing full-speed tick data collection and real-time trading
> analysis in the nineties, on machines that my current cell phone would
> leave in the dust.  The market data volume has grown a lot since then
> of course, but the price of hardware has fallen a lot more.

Trading volume always grows to where it's just possible to keep up with it
using a well designed app on the fastest hardware available.  If you're
always faster than someone else there's money to be made from them using
that fact.  The continuous arms race for the fastest scanning and
execution platform keeps volume moving upward in lock step with what
hardware is capable of.  The last mainstream news article on this topic
was http://www.nytimes.com/2009/07/24/business/24trading.html

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD