Thread: Strategy for doing number-crunching

Strategy for doing number-crunching

From
Matthew Foster
Date:
We have a database with approximately 130M rows, and we need to produce statistics (e.g. mean, standard deviation, etc.) on the data.  Right now, we're generating these stats via a single SELECT, and it is extremely slow...like it can take hours to return results.  The queries are variable, so there is no way I can think of to "pre-calculate" the stats.

Here is an example query...

SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') AS bias,
to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS mae_min,
to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') AS bias_min,
to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), 'FM990.999') AS rms_min,
to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
to_char(stddev(mae), 'FM990.999') AS mae_stddev,
to_char(stddev(bias), 'FM990.999') AS bias_stddev
FROM verify_data.verification_data
WHERE model_name='foo'
AND...several other conditions...

As one might imagine, this query pounds the CPU relentlessly for hours.

I'm thinking there must be a better approach for doing large amounts of calculations on the data.  Any pointers would be greatly appreciated!

Matt

Re: Strategy for doing number-crunching

From
Tom Lane
Date:
Matthew Foster <matthew.foster@noaa.gov> writes:
> We have a database with approximately 130M rows, and we need to produce
> statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
> we're generating these stats via a single SELECT, and it is extremely
> slow...like it can take hours to return results.

What datatype are the columns being averaged?  If "numeric", consider
casting to float8 before applying the aggregates.  You'll lose some
precision but it'll likely be orders of magnitude faster.

            regards, tom lane

Re: Strategy for doing number-crunching

From
Sean Davis
Date:
On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote:
> We have a database with approximately 130M rows, and we need to produce
> statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
> we're generating these stats via a single SELECT, and it is extremely
> slow...like it can take hours to return results.  The queries are variable,
> so there is no way I can think of to "pre-calculate" the stats.
>
> Here is an example query...
>
> SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
> to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
> AS bias,
> to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
> mae_min,
> to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
> AS bias_min,
> to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
> 'FM990.999') AS rms_min,
> to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
> to_char(stddev(mae), 'FM990.999') AS mae_stddev,
> to_char(stddev(bias), 'FM990.999') AS bias_stddev
> FROM verify_data.verification_data
> WHERE model_name='foo'
> AND...several other conditions...
>
> As one might imagine, this query pounds the CPU relentlessly for hours.
>
> I'm thinking there must be a better approach for doing large amounts of
> calculations on the data.  Any pointers would be greatly appreciated!

I'm assuming that someone will point out a quicker way with postgres,
so I'll mention another solution.

It may be quicker to pull the data out of the database and calculate
in-memory using something like R (take a look at the ff package if
memory is an issue).  In R, a numeric vector of length 130M uses about
1Gb (a little less).  Calculating both a mean and stddev on this
vector takes a total of 750ms on my Mac laptop.

Sean

Re: Strategy for doing number-crunching

From
"ktm@rice.edu"
Date:
On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote:
> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote:
> > We have a database with approximately 130M rows, and we need to produce
> > statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
> > we're generating these stats via a single SELECT, and it is extremely
> > slow...like it can take hours to return results.  The queries are variable,
> > so there is no way I can think of to "pre-calculate" the stats.
> >
> > Here is an example query...
> >
> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
> > AS bias,
> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
> > mae_min,
> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
> > AS bias_min,
> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
> > 'FM990.999') AS rms_min,
> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
> > to_char(stddev(mae), 'FM990.999') AS mae_stddev,
> > to_char(stddev(bias), 'FM990.999') AS bias_stddev
> > FROM verify_data.verification_data
> > WHERE model_name='foo'
> > AND...several other conditions...
> >
> > As one might imagine, this query pounds the CPU relentlessly for hours.
> >
> > I'm thinking there must be a better approach for doing large amounts of
> > calculations on the data.  Any pointers would be greatly appreciated!
>
> I'm assuming that someone will point out a quicker way with postgres,
> so I'll mention another solution.
>
> It may be quicker to pull the data out of the database and calculate
> in-memory using something like R (take a look at the ff package if
> memory is an issue).  In R, a numeric vector of length 130M uses about
> 1Gb (a little less).  Calculating both a mean and stddev on this
> vector takes a total of 750ms on my Mac laptop.
>
> Sean
>

Maybe pl/R would be an option as well.

Ken

Re: Strategy for doing number-crunching

From
Sean Davis
Date:
On Wed, Jan 4, 2012 at 11:58 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote:
>> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote:
>> > We have a database with approximately 130M rows, and we need to produce
>> > statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
>> > we're generating these stats via a single SELECT, and it is extremely
>> > slow...like it can take hours to return results.  The queries are variable,
>> > so there is no way I can think of to "pre-calculate" the stats.
>> >
>> > Here is an example query...
>> >
>> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
>> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999')
>> > AS bias,
>> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS
>> > mae_min,
>> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999')
>> > AS bias_min,
>> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)),
>> > 'FM990.999') AS rms_min,
>> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
>> > to_char(stddev(mae), 'FM990.999') AS mae_stddev,
>> > to_char(stddev(bias), 'FM990.999') AS bias_stddev
>> > FROM verify_data.verification_data
>> > WHERE model_name='foo'
>> > AND...several other conditions...
>> >
>> > As one might imagine, this query pounds the CPU relentlessly for hours.
>> >
>> > I'm thinking there must be a better approach for doing large amounts of
>> > calculations on the data.  Any pointers would be greatly appreciated!
>>
>> I'm assuming that someone will point out a quicker way with postgres,
>> so I'll mention another solution.
>>
>> It may be quicker to pull the data out of the database and calculate
>> in-memory using something like R (take a look at the ff package if
>> memory is an issue).  In R, a numeric vector of length 130M uses about
>> 1Gb (a little less).  Calculating both a mean and stddev on this
>> vector takes a total of 750ms on my Mac laptop.
>>
>> Sean
>>
>
> Maybe pl/R would be an option as well.

Good point.  I do not know what the performance is for these types of
aggregates using Pl/R.  Perhaps Joe Conway or others could comment.

Sean

Re: Strategy for doing number-crunching

From
"Jean-Yves F. Barbier"
Date:
On Wed, 4 Jan 2012 10:36:16 -0600
Matthew Foster <matthew.foster@noaa.gov> wrote:

Hi Matt,

I'm not a PG guru nor a stats one, so also wait for more skilled
answers:)

You're using a lot of of nested functions that take a lot of CPU, so
IMHO the timing won't be easy to reduce without adding some columns
that'll "pre-chew" part of the work.

ie: I see enough times "sqrt()" in your query for them to be stored
    in their own column; it'll take a few time at insertion, but will
    save a lot while querying.

An EXPLAIN ANALYSE would also be welcome as there's a big risk that
it shows some calculations are done for *each* row.

About those that can't be pre-calculated, may be calculating+storing
some data first would help (I think about min(), max(), avg()),
depending of what EXPLAIN ANALYSE will show.

JY
--
Q:    What's the difference between hard and dark?
A:    It stays dark all night.

Re: Strategy for doing number-crunching

From
Matthew Foster
Date:
On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Matthew Foster <matthew.foster@noaa.gov> writes:
> We have a database with approximately 130M rows, and we need to produce
> statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
> we're generating these stats via a single SELECT, and it is extremely
> slow...like it can take hours to return results.

What datatype are the columns being averaged?  If "numeric", consider
casting to float8 before applying the aggregates.  You'll lose some
precision but it'll likely be orders of magnitude faster.

                       regards, tom lane

Tom,

The data are type double.

Matt

Re: Strategy for doing number-crunching

From
Tom Lane
Date:
Matthew Foster <matthew.foster@noaa.gov> writes:
> On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Matthew Foster <matthew.foster@noaa.gov> writes:
>>> We have a database with approximately 130M rows, and we need to produce
>>> statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
>>> we're generating these stats via a single SELECT, and it is extremely
>>> slow...like it can take hours to return results.

>> What datatype are the columns being averaged?  If "numeric", consider
>> casting to float8 before applying the aggregates.  You'll lose some
>> precision but it'll likely be orders of magnitude faster.

> The data are type double.

Hmm.  In that case I think you have some other problem that's hidden in
details you didn't show us.  It should not take "hours" to process only
130M rows.  This would best be taken up on pgsql-performance; please see
http://wiki.postgresql.org/wiki/Slow_Query_Questions

            regards, tom lane

Re: Strategy for doing number-crunching

From
Joe Conway
Date:
On 01/04/2012 09:04 AM, Sean Davis wrote:
> Good point.  I do not know what the performance is for these types of
> aggregates using Pl/R.  Perhaps Joe Conway or others could comment.

Depending on your approach should be similar to native R performance.
For example, you could use the spi interface to pull all the data at
once into a dataframe, and then process as Sean mentioned.

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: Strategy for doing number-crunching

From
János Löbb
Date:
On Jan 4, 2012, at 11:36 AM, Matthew Foster wrote:

> We have a database with approximately 130M rows, and we need to produce statistics (e.g. mean, standard deviation,
etc.)on the data.  Right now, we're generating these stats via a single SELECT, and it is extremely slow...like it can
takehours to return results.  The queries are variable, so there is no way I can think of to "pre-calculate" the stats. 
>
> Here is an example query...
>
> SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
> to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') AS bias,
> to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS mae_min,
> to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') AS bias_min,
> to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), 'FM990.999') AS rms_min,
> to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
> to_char(stddev(mae), 'FM990.999') AS mae_stddev,
> to_char(stddev(bias), 'FM990.999') AS bias_stddev
> FROM verify_data.verification_data
> WHERE model_name='foo'
> AND...several other conditions...
>
> As one might imagine, this query pounds the CPU relentlessly for hours.
>
> I'm thinking there must be a better approach for doing large amounts of calculations on the data.  Any pointers would
begreatly appreciated! 
>
> Matt
>

Make sure you have to right indexes.

Re: Strategy for doing number-crunching

From
Matthew Foster
Date:


On Wed, Jan 4, 2012 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Matthew Foster <matthew.foster@noaa.gov> writes:
> On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Matthew Foster <matthew.foster@noaa.gov> writes:
>>> We have a database with approximately 130M rows, and we need to produce
>>> statistics (e.g. mean, standard deviation, etc.) on the data.  Right now,
>>> we're generating these stats via a single SELECT, and it is extremely
>>> slow...like it can take hours to return results.

>> What datatype are the columns being averaged?  If "numeric", consider
>> casting to float8 before applying the aggregates.  You'll lose some
>> precision but it'll likely be orders of magnitude faster.

> The data are type double.

Hmm.  In that case I think you have some other problem that's hidden in
details you didn't show us.  It should not take "hours" to process only
130M rows.  This would best be taken up on pgsql-performance; please see
http://wiki.postgresql.org/wiki/Slow_Query_Questions

                       regards, tom lane

Tom,

I think you are absolutely right.  Some additional testing, with the arithmetic removed from the queries, still shows very slow performance.

I'll do some more digging, and perhaps take this to the performance list.  Thanks for your advice!

Matt