Re: Optimal time series sampling. - Mailing list pgsql-general
From | Ted Byers |
---|---|
Subject | Re: Optimal time series sampling. |
Date | |
Msg-id | 659898.58604.qm@web88315.mail.re4.yahoo.com Whole thread Raw |
In response to | Re: Optimal time series sampling. (Gregory Stark <stark@enterprisedb.com>) |
Responses |
Re: Optimal time series sampling.
Re: Optimal time series sampling. |
List | pgsql-general |
--- Gregory Stark <stark@enterprisedb.com> wrote: > "Ted Byers" <r.ted.byers@rogers.com> writes: > > > As a prelude to where I really want to go, please > > consider the following SELECT statement. > > > > SELECT close_price FROM stockprices A > > WHERE price_date = > > (SELECT MAX(price_date) FROM stockprices B > > WHERE A.stock_id = B.stock_id AND A.stock_id > = id); > > I assume you're missing another "stock_id = id" on > the outer query? > Right. > I think you'll have to post the actual explain > analyze output you're getting > and the precise schema you have. OK, it is challenging to present it in plain text, but here is the HTML exported by MySQL Query Browser. If you cut between the "====" lines and paste the content into a file with an html extension, it will look fine and be easy to read. ====================================== <html> <head> <title>Query EXPLAIN SELECT price_date,`close` AS close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov 09 11:12:46 2007 </title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body><h1>Query EXPLAIN SELECT price_date,`close` AS close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov 09 11:12:46 2007 </h1> <table border=1 cellspacing=1 cellpadding=0><tr> <th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>Extra</th></tr> <tr> <td>1</td><td>PRIMARY</td><td>A</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using where</td></tr> <tr> <td>2</td><td>DEPENDENT SUBQUERY</td><td>B</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using where; Using index; Using temporary; Using filesort</td></tr> </table> </body></html> ======================================================= > You might need an > index on > <stock_id,price_date>. > That is the definition of the primary key. That is why both the main query and the subquery are shown using the primary key. > That would be an entirely different ball of wax than > trying to pull out a > single stock's closing price. I suspect you're going > to want to use Postgres's > "DISTINCT ON" SQL extension. Something like: > > SELECT DISTINCT ON (stock_id,price_date) * > FROM stockprices > ORDER BY stock_id, price_date DESC > > And you may want an index on < stock_id, price_date > DESC > > That pair, as I mentioned, formed the primary key for the stockprices table. Here is my SQL for subsampling a time series: SELECT price_date,close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date)); This performs better than the outer join algorithm for getting only the last price for a couple dozen stocks. This particular statement crawls to completion in about 4 or 5 minutes, as compared to over ten to get just the last price for a couple dozen stocks. Not too surprisingly, Explain gives identical results for this query as it did for the simpler SELECT above. > Do you care what happens if there were no trades for > a given stock in the time > period? The query you give above using MAX would > still work but the query I > described using DISTINCT ON would not emit a record > for the stock at all. > No. But then I haven't yet analyzed the data to learn what the prices for a given period really mean if there haven't been any trades within the period. I have yet to see a series of prices for which the volume is 0. That may be an artifact of how my colleagues selected stocks and etfs to use to test our algorithm. I do not yet know if it will be a significant issue for us since I don't see how a stock or etf that can go for a while without any trades at all would be of interest given the kind of information we will be producing for our clients. I am presently focussed on developing ways of looking at the data, to help my colleagues better understand the data and what our algorithm does with it. One of the properties of interest, and relevance to our algorithm is the common property that the series seem to be self affine (as described most notably by B. Mandlebrot): hence the need to sample with different degrees of granularity. My colleagues have worked primarily with finance data (esp. commodities), while my background is more focussed on risk management in environmental science. Thanks Ted
pgsql-general by date: