Home > mailing lists

Re: Optimal time series sampling. - Mailing list pgsql-general

From	Ted Byers
Subject	Re: Optimal time series sampling.
Date	November 9, 2007 15:47:32
Msg-id	659898.58604.qm@web88315.mail.re4.yahoo.com Whole thread Raw
In response to	Re: Optimal time series sampling. (Gregory Stark <stark@enterprisedb.com>)
Responses	Re: Optimal time series sampling. Re: Optimal time series sampling.
List	pgsql-general

Tree view

--- Gregory Stark <stark@enterprisedb.com> wrote:

> "Ted Byers" <r.ted.byers@rogers.com> writes:
>
> > As a prelude to where I really want to go, please
> > consider the following SELECT statement.
> >
> >   SELECT close_price FROM stockprices A
> >   WHERE price_date =
> >      (SELECT MAX(price_date) FROM stockprices B
> >       WHERE A.stock_id = B.stock_id AND A.stock_id
> = id);
>
> I assume you're missing another "stock_id = id" on
> the outer query?
>
Right.

> I think you'll have to post the actual explain
> analyze output you're getting
> and the precise schema you have.

OK, it is challenging to present it in plain text, but
here is the HTML exported by MySQL Query Browser.  If
you cut between the  "====" lines and paste the
content into a file with an html extension, it will
look fine and be easy to read.
======================================
<html>
<head>
<title>Query EXPLAIN SELECT price_date,`close` AS
close_price FROM stockprices A   WHERE A.stock_id = 1
AND price_date IN      (SELECT MAX(price_date) FROM
stockprices B       WHERE A.stock_id = B.stock_id AND
A.stock_id = 1 GROUP BY
YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov
09 11:12:46 2007
</title>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
</head>
<body><h1>Query EXPLAIN SELECT price_date,`close` AS
close_price FROM stockprices A   WHERE A.stock_id = 1
AND price_date IN      (SELECT MAX(price_date) FROM
stockprices B       WHERE A.stock_id = B.stock_id AND
A.stock_id = 1 GROUP BY
YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov
09 11:12:46 2007
</h1>
<table border=1 cellspacing=1 cellpadding=0><tr>

<th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>Extra</th></tr>
<tr>

<td>1</td><td>PRIMARY</td><td>A</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using
where</td></tr>
<tr>
<td>2</td><td>DEPENDENT
SUBQUERY</td><td>B</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using
where; Using index; Using temporary; Using
filesort</td></tr>
</table>
</body></html>
=======================================================

> You might need an
> index on
> <stock_id,price_date>.
>
That is the definition of the primary key.  That is
why both the main query and the subquery are shown
using the primary key.

> That would be an entirely different ball of wax than
> trying to pull out a
> single stock's closing price. I suspect you're going
> to want to use Postgres's
> "DISTINCT ON" SQL extension. Something like:
>
> SELECT DISTINCT ON (stock_id,price_date) *
>   FROM stockprices
>  ORDER BY stock_id, price_date DESC
>
> And you may want an index on < stock_id, price_date
> DESC >
>
That pair, as I mentioned, formed the primary key for
the stockprices table.

Here is my SQL for subsampling a time series:

SELECT price_date,close_price FROM stockprices A
  WHERE A.stock_id = 1 AND price_date IN
     (SELECT MAX(price_date) FROM stockprices B
      WHERE A.stock_id = B.stock_id AND A.stock_id = 1
GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));

This performs better than the outer join algorithm for
getting only the last price for a couple dozen stocks.
 This particular statement crawls to completion in
about 4 or 5 minutes, as compared to over ten to get
just the last price for a couple dozen stocks.

Not too surprisingly, Explain gives identical results
for this query as it did for the simpler SELECT above.

> Do you care what happens if there were no trades for
> a given stock in the time
> period? The query you give above using MAX would
> still work but the query I
> described using DISTINCT ON would not emit a record
> for the stock at all.
>
No.  But then I haven't yet analyzed the data to learn
what the prices for a given period really mean if
there haven't been any trades within the period.  I
have yet to see a series of prices for which the
volume is 0.  That may be an artifact of how my
colleagues selected stocks and etfs to use to test our
algorithm.  I do not yet know if it will be a
significant issue for us since I don't see how a stock
or etf that can go for a while without any trades at
all would be of interest given the kind of information
we will be producing for our clients.  I am presently
focussed on developing ways of looking at the data, to
help my colleagues better understand the data and what
our algorithm does with it.

One of the properties of interest, and relevance to
our algorithm is the common property that the series
seem to be self affine (as described most notably by
B. Mandlebrot): hence the need to sample with
different degrees of granularity.  My colleagues have
worked primarily with finance data (esp. commodities),
while my background is more focussed on risk
management in environmental science.

Thanks

Ted

pgsql-general by date:

From: "Krasimir Hristozov \(InterMedia Ltd\)"
Date: 09 November 2007, 15:24:11
Subject: Re: INSERT performance deteriorates quickly during a large import

From: Tom Hart
Date: 09 November 2007, 15:55:31
Subject: pg_dumpall and authentication

Re: Optimal time series sampling. - Mailing list pgsql-general

Previous

Next