Home > mailing lists

Optimal time series sampling. - Mailing list pgsql-general

From	Ted Byers
Subject	Optimal time series sampling.
Date	November 8, 2007 13:10:23
Msg-id	157205.25819.qm@web88305.mail.re4.yahoo.com Whole thread Raw
Responses	Re: Optimal time series sampling.
List	pgsql-general

Tree view

As a prelude to where I really want to go, please
consider the following SELECT statement.

  SELECT close_price FROM stockprices A
  WHERE price_date =
     (SELECT MAX(price_date) FROM stockprices B
      WHERE A.stock_id = B.stock_id AND A.stock_id =
id);

stockprices has a primary key comprised of stock_id
and price_date, and I tried the same query with an
extra inex on price_date (but that index made no
difference in apparent performance as seen on the
clock on the wall).

I have been advised (on the MySQL board), to use the
following (with the claim, unsupported as far as I can
tell, that it is both correct and much faster - it
appears to be correct, but it is certainly no faster):

  SELECT A.`close` AS close_price
  FROM stockprices A LEFT JOIN stockprices B
    ON A.stock_id = B.stock_id
      AND B.price_date > A.price_date
      WHERE B.price_date IS NULL
          AND A.stock_id = id;

It appears to do the right thing.  I certainly get the
right answer, but I am not seeing a significant
difference in performance.  Worse, when I invoke
something like it for a suite of about two dozen
stocks, it takes about ten minutes to complete.  (I
may try a variant in which the last clause used in
WHERE is replaced by IN followed by a trivial select
that gets the same two dozen stock_ids, to see if that
helps.)

Now, I am concerned with performance because, and this
is where I really want to go, I want to adapt this
logic to create new time series of closing prices, but
at the granularity of a week, a month or quarter, and
there is no predicting a priori how long the series
is.  IBM's data goes back decades while I have data
for other stocks that go back only a couple years.

Now, a junior programmer here had suggested just doing
a simple select, at least for weekly granularity, by
selecting a value if it's day of the week computes to
Friday.  That can't work correctly because in some
weeks, there are statutory holidays that land on
Fridays, resulting in the last actual trading day for
that week being Thursday.  His simple approach
guarantees that many records that ought to be included
will be ignored.  I need a more dynamic and flexible
approach which allows me to work on the basis that I
have prices for all trading days for a given stock
from the time my data for it begins.  So I need a more
complex select statement that will just select the
most recent price for a given stock for each week (or
month or quarter or year).

Now, I can get the full time series for two dozen
stocks, as slow and brain dead as doing a select for
each stock ID, AND have my Java code construct and
display a chart, in less than 20 seconds (and Java
does not have a reputation for being fast).  I need
whatever solution I use to be that quick.

Any thoughts about how best to attack this in order to
get the correct results as fast as is possible?  What
options would you consider, WRT defining the SQL
statements you would benchmark, in order to design
your benchmark testing?

Thanks,

Ted

pgsql-general by date:

From: Bill Moran
Date: 08 November 2007, 13:02:37
Subject: Re: Calculation for Max_FSM_pages : Any rules of thumb?

From: Tomáš Vondra
Date: 08 November 2007, 13:19:14
Subject: Re: INSERT performance deteriorates quickly during a large import

Optimal time series sampling. - Mailing list pgsql-general

Previous

Next