Thread: Optimal time series sampling.

Optimal time series sampling.

From

Ted Byers

Date:

08 November 2007, 13:10:23

As a prelude to where I really want to go, please
consider the following SELECT statement.

  SELECT close_price FROM stockprices A
  WHERE price_date =
     (SELECT MAX(price_date) FROM stockprices B
      WHERE A.stock_id = B.stock_id AND A.stock_id =
id);

stockprices has a primary key comprised of stock_id
and price_date, and I tried the same query with an
extra inex on price_date (but that index made no
difference in apparent performance as seen on the
clock on the wall).

I have been advised (on the MySQL board), to use the
following (with the claim, unsupported as far as I can
tell, that it is both correct and much faster - it
appears to be correct, but it is certainly no faster):

  SELECT A.`close` AS close_price
  FROM stockprices A LEFT JOIN stockprices B
    ON A.stock_id = B.stock_id
      AND B.price_date > A.price_date
      WHERE B.price_date IS NULL
          AND A.stock_id = id;

It appears to do the right thing.  I certainly get the
right answer, but I am not seeing a significant
difference in performance.  Worse, when I invoke
something like it for a suite of about two dozen
stocks, it takes about ten minutes to complete.  (I
may try a variant in which the last clause used in
WHERE is replaced by IN followed by a trivial select
that gets the same two dozen stock_ids, to see if that
helps.)

Now, I am concerned with performance because, and this
is where I really want to go, I want to adapt this
logic to create new time series of closing prices, but
at the granularity of a week, a month or quarter, and
there is no predicting a priori how long the series
is.  IBM's data goes back decades while I have data
for other stocks that go back only a couple years.

Now, a junior programmer here had suggested just doing
a simple select, at least for weekly granularity, by
selecting a value if it's day of the week computes to
Friday.  That can't work correctly because in some
weeks, there are statutory holidays that land on
Fridays, resulting in the last actual trading day for
that week being Thursday.  His simple approach
guarantees that many records that ought to be included
will be ignored.  I need a more dynamic and flexible
approach which allows me to work on the basis that I
have prices for all trading days for a given stock
from the time my data for it begins.  So I need a more
complex select statement that will just select the
most recent price for a given stock for each week (or
month or quarter or year).

Now, I can get the full time series for two dozen
stocks, as slow and brain dead as doing a select for
each stock ID, AND have my Java code construct and
display a chart, in less than 20 seconds (and Java
does not have a reputation for being fast).  I need
whatever solution I use to be that quick.

Any thoughts about how best to attack this in order to
get the correct results as fast as is possible?  What
options would you consider, WRT defining the SQL
statements you would benchmark, in order to design
your benchmark testing?

Thanks,

Ted

Re: Optimal time series sampling.

From

Gregory Stark

Date:

08 November 2007, 19:27:07

"Ted Byers" <r.ted.byers@rogers.com> writes:

> As a prelude to where I really want to go, please
> consider the following SELECT statement.
>
>   SELECT close_price FROM stockprices A
>   WHERE price_date =
>      (SELECT MAX(price_date) FROM stockprices B
>       WHERE A.stock_id = B.stock_id AND A.stock_id = id);

I assume you're missing another "stock_id = id" on the outer query?

I think you'll have to post the actual explain analyze output you're getting
and the precise schema you have. You might need an index on
<stock_id,price_date>.

> It appears to do the right thing.  I certainly get the
> right answer, but I am not seeing a significant
> difference in performance.  Worse, when I invoke
> something like it for a suite of about two dozen
> stocks, it takes about ten minutes to complete.

That would be an entirely different ball of wax than trying to pull out a
single stock's closing price. I suspect you're going to want to use Postgres's
"DISTINCT ON" SQL extension. Something like:

SELECT DISTINCT ON (stock_id,price_date) *
  FROM stockprices
 ORDER BY stock_id, price_date DESC

And you may want an index on < stock_id, price_date DESC >

I believe MySQL does have a similar extension where you can use GROUP BY and
have columns listed in the select target list which aren't included in the
grouping sets.

> So I need a more complex select statement that will just select the most
> recent price for a given stock for each week (or month or quarter or year).

Do you care what happens if there were no trades for a given stock in the time
period? The query you give above using MAX would still work but the query I
described using DISTINCT ON would not emit a record for the stock at all.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: Optimal time series sampling.

From

Ted Byers

Date:

09 November 2007, 12:47:32

--- Gregory Stark <stark@enterprisedb.com> wrote:

> "Ted Byers" <r.ted.byers@rogers.com> writes:
>
> > As a prelude to where I really want to go, please
> > consider the following SELECT statement.
> >
> >   SELECT close_price FROM stockprices A
> >   WHERE price_date =
> >      (SELECT MAX(price_date) FROM stockprices B
> >       WHERE A.stock_id = B.stock_id AND A.stock_id
> = id);
>
> I assume you're missing another "stock_id = id" on
> the outer query?
>
Right.

> I think you'll have to post the actual explain
> analyze output you're getting
> and the precise schema you have.

OK, it is challenging to present it in plain text, but
here is the HTML exported by MySQL Query Browser.  If
you cut between the  "====" lines and paste the
content into a file with an html extension, it will
look fine and be easy to read.
======================================
<html>
<head>
<title>Query EXPLAIN SELECT price_date,`close` AS
close_price FROM stockprices A   WHERE A.stock_id = 1
AND price_date IN      (SELECT MAX(price_date) FROM
stockprices B       WHERE A.stock_id = B.stock_id AND
A.stock_id = 1 GROUP BY
YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov
09 11:12:46 2007
</title>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
</head>
<body><h1>Query EXPLAIN SELECT price_date,`close` AS
close_price FROM stockprices A   WHERE A.stock_id = 1
AND price_date IN      (SELECT MAX(price_date) FROM
stockprices B       WHERE A.stock_id = B.stock_id AND
A.stock_id = 1 GROUP BY
YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov
09 11:12:46 2007
</h1>
<table border=1 cellspacing=1 cellpadding=0><tr>

<th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>Extra</th></tr>
<tr>

<td>1</td><td>PRIMARY</td><td>A</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using
where</td></tr>
<tr>
<td>2</td><td>DEPENDENT
SUBQUERY</td><td>B</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using
where; Using index; Using temporary; Using
filesort</td></tr>
</table>
</body></html>
=======================================================

> You might need an
> index on
> <stock_id,price_date>.
>
That is the definition of the primary key.  That is
why both the main query and the subquery are shown
using the primary key.

> That would be an entirely different ball of wax than
> trying to pull out a
> single stock's closing price. I suspect you're going
> to want to use Postgres's
> "DISTINCT ON" SQL extension. Something like:
>
> SELECT DISTINCT ON (stock_id,price_date) *
>   FROM stockprices
>  ORDER BY stock_id, price_date DESC
>
> And you may want an index on < stock_id, price_date
> DESC >
>
That pair, as I mentioned, formed the primary key for
the stockprices table.

Here is my SQL for subsampling a time series:

SELECT price_date,close_price FROM stockprices A
  WHERE A.stock_id = 1 AND price_date IN
     (SELECT MAX(price_date) FROM stockprices B
      WHERE A.stock_id = B.stock_id AND A.stock_id = 1
GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));

This performs better than the outer join algorithm for
getting only the last price for a couple dozen stocks.
 This particular statement crawls to completion in
about 4 or 5 minutes, as compared to over ten to get
just the last price for a couple dozen stocks.

Not too surprisingly, Explain gives identical results
for this query as it did for the simpler SELECT above.

> Do you care what happens if there were no trades for
> a given stock in the time
> period? The query you give above using MAX would
> still work but the query I
> described using DISTINCT ON would not emit a record
> for the stock at all.
>
No.  But then I haven't yet analyzed the data to learn
what the prices for a given period really mean if
there haven't been any trades within the period.  I
have yet to see a series of prices for which the
volume is 0.  That may be an artifact of how my
colleagues selected stocks and etfs to use to test our
algorithm.  I do not yet know if it will be a
significant issue for us since I don't see how a stock
or etf that can go for a while without any trades at
all would be of interest given the kind of information
we will be producing for our clients.  I am presently
focussed on developing ways of looking at the data, to
help my colleagues better understand the data and what
our algorithm does with it.

One of the properties of interest, and relevance to
our algorithm is the common property that the series
seem to be self affine (as described most notably by
B. Mandlebrot): hence the need to sample with
different degrees of granularity.  My colleagues have
worked primarily with finance data (esp. commodities),
while my background is more focussed on risk
management in environmental science.

Thanks

Ted

Re: Optimal time series sampling.

From

brian

Date:

09 November 2007, 12:55:41

Ted Byers wrote:
> --- Gregory Stark <stark@enterprisedb.com> wrote:
>
>>I think you'll have to post the actual explain
>>analyze output you're getting
>>and the precise schema you have.
>
>
> OK, it is challenging to present it in plain text, but
> here is the HTML exported by MySQL Query Browser.  If
> you cut between the  "====" lines and paste the
> content into a file with an html extension, it will
> look fine and be easy to read.

How about *you* cut & paste into a file, save that, then open it in a
browser, then cut & paste the resulting text into a mail?

And if you happen to be using an operating system that refuses to let go
of formatting instructions when copying plaintext, there's always Notepad.

brian

Re: Optimal time series sampling.

From

Tom Lane

Date:

09 November 2007, 13:22:56

Ted Byers <r.ted.byers@rogers.com> writes:
> OK, it is challenging to present it in plain text, but
> here is the HTML exported by MySQL Query Browser.

Why are you asking this list for help with a MySQL performance problem?

            regards, tom lane

Re: Optimal time series sampling.

From

Ted Byers

Date:

09 November 2007, 13:47:19

--- Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ted Byers <r.ted.byers@rogers.com> writes:
> > OK, it is challenging to present it in plain text,
> but
> > here is the HTML exported by MySQL Query Browser.
>
> Why are you asking this list for help with a MySQL
> performance problem?
>
because my question isn't really about MySQL, but
rather about how best to construct the SQL required to
get the job done, regardless of what database is used.
 I have seen some claims that it is better to use
joins instead of correlated subqueries and others that
say the opposite.  And I do not, at this stage, know
if there are other options in SQL that may or may not
be better.  At this time, the database in use is
irrelevant (I want to stick as close to the ANSI
standard as practicable so the rewriting required will
be minimal should we decide to change the database
later, for whatever reason).

Thanks,

Ted

Re: Optimal time series sampling.

From

Tom Lane

Date:

09 November 2007, 13:52:45

Ted Byers <r.ted.byers@rogers.com> writes:
> --- Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Why are you asking this list for help with a MySQL
>> performance problem?
>>
> because my question isn't really about MySQL, but
> rather about how best to construct the SQL required to
> get the job done, regardless of what database is used.
>  I have seen some claims that it is better to use
> joins instead of correlated subqueries and others that
> say the opposite.

The problem is that the answer may well be different for different
databases.  You'd get better answers about MySQL on a MySQL list
than you'll get here.

            regards, tom lane

Re: Optimal time series sampling.

From

"Scott Marlowe"

Date:

09 November 2007, 15:23:05

On Nov 9, 2007 11:47 AM, Ted Byers <r.ted.byers@rogers.com> wrote:
>
> --- Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Ted Byers <r.ted.byers@rogers.com> writes:
> > > OK, it is challenging to present it in plain text,
> > but
> > > here is the HTML exported by MySQL Query Browser.
> >
> > Why are you asking this list for help with a MySQL
> > performance problem?
> >
> because my question isn't really about MySQL, but
> rather about how best to construct the SQL required to
> get the job done, regardless of what database is used.
>  I have seen some claims that it is better to use
> joins instead of correlated subqueries and others that
> say the opposite.  And I do not, at this stage, know
> if there are other options in SQL that may or may not
> be better.

Which is better depends largely on how your database is built.  MySQL
still uses loops for all subselects, so with large numbers of tuples
in the subselect method, it will be slow.  But they might fix this in
a later release.

Fairly recent versions of PostgreSQL could make some bad choices when
doing joins for certain datasets that would be much faster with a
correlated subquery (specifically the old left join where
righttable.field is null trick made some pgsql versions choose an
inefficient join method)

So, the "right" way is a question of which db, and even which version
of that DB you're on.


>  At this time, the database in use is
> irrelevant (I want to stick as close to the ANSI
> standard as practicable so the rewriting required will
> be minimal should we decide to change the database
> later, for whatever reason).

If you want to stick with ANSI, MySQL tends to be more divergent from
the spec than pgsql and other rdbms.

Most people would consider the correlate subquery the better method.
But it's also likely to be the slowest on MySQL.

Re: Optimal time series sampling.

From

Ted Byers

Date:

09 November 2007, 17:18:27

--- Scott Marlowe <scott.marlowe@gmail.com> wrote:

> On Nov 9, 2007 11:47 AM, Ted Byers
> <r.ted.byers@rogers.com> wrote:
> >
> > --- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > > Ted Byers <r.ted.byers@rogers.com> writes:
> [snip]
> Which is better depends largely on how your database
> is built.  MySQL
> still uses loops for all subselects, so with large
> numbers of tuples
> in the subselect method, it will be slow.  But they
> might fix this in
> a later release.
>
> Fairly recent versions of PostgreSQL could make some
> bad choices when
> doing joins for certain datasets that would be much
> faster with a
> correlated subquery (specifically the old left join
> where
> righttable.field is null trick made some pgsql
> versions choose an
> inefficient join method)
>
> So, the "right" way is a question of which db, and
> even which version
> of that DB you're on.
>
My life just got soooooooo much more complicated.  Oh
well, I guess that will be useful when providing
advice to management when they start seeing
performance issues.  Thanks ;-)

Have you looked at version 5.0.45?  I am always seeing
the claim that the left join trick is so much more
faster than the correlated subquery, especially if a
function like MAX() is used, but the numbers I am
seeing with real stock price data has it running, on
average, about three times faster than the left join.
So I assumed I was doing something wrong in a manner
than would get me the right answer the slowest way
possible.


> >  At this time, the database in use is
> > irrelevant (I want to stick as close to the ANSI
> > standard as practicable so the rewriting required
> will
> > be minimal should we decide to change the database
> > later, for whatever reason).
>
> If you want to stick with ANSI, MySQL tends to be
> more divergent from
> the spec than pgsql and other rdbms.
>
The books I am using, which describe the SQL language,
don't seem to mention or illustrate much difference
among any of the rdbms (including my references that
talk about Oracle and MS SQL Server).  The SQL I try
from those books seem to work reasonably well in all
of them (I can't check against Oracle, though, since I
don't have that), and I try most of my SQL against
MySQL, Postgres and MS SQL Server (the biggest
divergences seem to be in how bulk loading of data
happens).  Maybe I haven't explored enough of the SQL
language, with large enough datasets, to see the
differences you mention; or perhaps things are
improving  with all of them.

> Most people would consider the correlate subquery
> the better method.
> But it's also likely to be the slowest on MySQL.
>
Right now, with this particular query the correlated
subquery is the one that gets me the right answers
about 3 times faster than any other method I have
tried.  But it still takes several minutes to get the
results for only a few dozen stocks.  And yet I can
get several megabytes of data from the following query
in about a quarter of the time.

SELECT price_date,close_price FROM stockprices A
  WHERE A.stock_id = 1 AND price_date IN
     (SELECT MAX(B.price_date) FROM stockprices B
      WHERE A.stock_id = B.stock_id AND A.stock_id = 1
GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));

I had expected this to take many times longer than the
"simple" select that gets only the last price for a
given stock, but I was surprised to see it so much
faster than the query that gets just the last prices
for only a couple dozen stocks.

Thanks alot.  I learned alot from your reply.

Ted

Enabling password complexity for password authentication

From

paul rivers

Date:

09 November 2007, 19:26:19

Is there an existing way to enforce password complexity for password
authentication?

I am not seeing anything in the docs, and I can only turn up this
reference to a pending patch for 8.2 (bottom of page):

http://www.postgresql.org/community/weeklynews/pwn20061210

Thanks in advance for any suggestions.

Paul