Thread: Strategy for doing number-crunching
We have a database with approximately 130M rows, and we need to produce statistics (e.g. mean, standard deviation, etc.) on the data. Right now, we're generating these stats via a single SELECT, and it is extremely slow...like it can take hours to return results. The queries are variable, so there is no way I can think of to "pre-calculate" the stats.
Here is an example query...
SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae,
to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') AS bias,
to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS mae_min,
to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') AS bias_min,
to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), 'FM990.999') AS rms_min,
to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev,
to_char(stddev(mae), 'FM990.999') AS mae_stddev,
to_char(stddev(bias), 'FM990.999') AS bias_stddev
FROM verify_data.verification_data
WHERE model_name='foo'
AND...several other conditions...
As one might imagine, this query pounds the CPU relentlessly for hours.
I'm thinking there must be a better approach for doing large amounts of calculations on the data. Any pointers would be greatly appreciated!
Matt
Matthew Foster <matthew.foster@noaa.gov> writes: > We have a database with approximately 130M rows, and we need to produce > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, > we're generating these stats via a single SELECT, and it is extremely > slow...like it can take hours to return results. What datatype are the columns being averaged? If "numeric", consider casting to float8 before applying the aggregates. You'll lose some precision but it'll likely be orders of magnitude faster. regards, tom lane
On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: > We have a database with approximately 130M rows, and we need to produce > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, > we're generating these stats via a single SELECT, and it is extremely > slow...like it can take hours to return results. The queries are variable, > so there is no way I can think of to "pre-calculate" the stats. > > Here is an example query... > > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') > AS bias, > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS > mae_min, > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') > AS bias_min, > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), > 'FM990.999') AS rms_min, > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, > to_char(stddev(mae), 'FM990.999') AS mae_stddev, > to_char(stddev(bias), 'FM990.999') AS bias_stddev > FROM verify_data.verification_data > WHERE model_name='foo' > AND...several other conditions... > > As one might imagine, this query pounds the CPU relentlessly for hours. > > I'm thinking there must be a better approach for doing large amounts of > calculations on the data. Any pointers would be greatly appreciated! I'm assuming that someone will point out a quicker way with postgres, so I'll mention another solution. It may be quicker to pull the data out of the database and calculate in-memory using something like R (take a look at the ff package if memory is an issue). In R, a numeric vector of length 130M uses about 1Gb (a little less). Calculating both a mean and stddev on this vector takes a total of 750ms on my Mac laptop. Sean
On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote: > On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: > > We have a database with approximately 130M rows, and we need to produce > > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, > > we're generating these stats via a single SELECT, and it is extremely > > slow...like it can take hours to return results. The queries are variable, > > so there is no way I can think of to "pre-calculate" the stats. > > > > Here is an example query... > > > > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, > > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') > > AS bias, > > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS > > mae_min, > > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') > > AS bias_min, > > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), > > 'FM990.999') AS rms_min, > > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, > > to_char(stddev(mae), 'FM990.999') AS mae_stddev, > > to_char(stddev(bias), 'FM990.999') AS bias_stddev > > FROM verify_data.verification_data > > WHERE model_name='foo' > > AND...several other conditions... > > > > As one might imagine, this query pounds the CPU relentlessly for hours. > > > > I'm thinking there must be a better approach for doing large amounts of > > calculations on the data. Any pointers would be greatly appreciated! > > I'm assuming that someone will point out a quicker way with postgres, > so I'll mention another solution. > > It may be quicker to pull the data out of the database and calculate > in-memory using something like R (take a look at the ff package if > memory is an issue). In R, a numeric vector of length 130M uses about > 1Gb (a little less). Calculating both a mean and stddev on this > vector takes a total of 750ms on my Mac laptop. > > Sean > Maybe pl/R would be an option as well. Ken
On Wed, Jan 4, 2012 at 11:58 AM, ktm@rice.edu <ktm@rice.edu> wrote: > On Wed, Jan 04, 2012 at 11:55:38AM -0500, Sean Davis wrote: >> On Wed, Jan 4, 2012 at 11:36 AM, Matthew Foster <matthew.foster@noaa.gov> wrote: >> > We have a database with approximately 130M rows, and we need to produce >> > statistics (e.g. mean, standard deviation, etc.) on the data. Right now, >> > we're generating these stats via a single SELECT, and it is extremely >> > slow...like it can take hours to return results. The queries are variable, >> > so there is no way I can think of to "pre-calculate" the stats. >> > >> > Here is an example query... >> > >> > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, >> > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') >> > AS bias, >> > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS >> > mae_min, >> > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') >> > AS bias_min, >> > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), >> > 'FM990.999') AS rms_min, >> > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, >> > to_char(stddev(mae), 'FM990.999') AS mae_stddev, >> > to_char(stddev(bias), 'FM990.999') AS bias_stddev >> > FROM verify_data.verification_data >> > WHERE model_name='foo' >> > AND...several other conditions... >> > >> > As one might imagine, this query pounds the CPU relentlessly for hours. >> > >> > I'm thinking there must be a better approach for doing large amounts of >> > calculations on the data. Any pointers would be greatly appreciated! >> >> I'm assuming that someone will point out a quicker way with postgres, >> so I'll mention another solution. >> >> It may be quicker to pull the data out of the database and calculate >> in-memory using something like R (take a look at the ff package if >> memory is an issue). In R, a numeric vector of length 130M uses about >> 1Gb (a little less). Calculating both a mean and stddev on this >> vector takes a total of 750ms on my Mac laptop. >> >> Sean >> > > Maybe pl/R would be an option as well. Good point. I do not know what the performance is for these types of aggregates using Pl/R. Perhaps Joe Conway or others could comment. Sean
On Wed, 4 Jan 2012 10:36:16 -0600 Matthew Foster <matthew.foster@noaa.gov> wrote: Hi Matt, I'm not a PG guru nor a stats one, so also wait for more skilled answers:) You're using a lot of of nested functions that take a lot of CPU, so IMHO the timing won't be easy to reduce without adding some columns that'll "pre-chew" part of the work. ie: I see enough times "sqrt()" in your query for them to be stored in their own column; it'll take a few time at insertion, but will save a lot while querying. An EXPLAIN ANALYSE would also be welcome as there's a big risk that it shows some calculations are done for *each* row. About those that can't be pre-calculated, may be calculating+storing some data first would help (I think about min(), max(), avg()), depending of what EXPLAIN ANALYSE will show. JY -- Q: What's the difference between hard and dark? A: It stays dark all night.
On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Matthew Foster <matthew.foster@noaa.gov> writes:What datatype are the columns being averaged? If "numeric", consider
> We have a database with approximately 130M rows, and we need to produce
> statistics (e.g. mean, standard deviation, etc.) on the data. Right now,
> we're generating these stats via a single SELECT, and it is extremely
> slow...like it can take hours to return results.
casting to float8 before applying the aggregates. You'll lose some
precision but it'll likely be orders of magnitude faster.
regards, tom lane
Tom,
The data are type double.
Matt
Matthew Foster <matthew.foster@noaa.gov> writes: > On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Matthew Foster <matthew.foster@noaa.gov> writes: >>> We have a database with approximately 130M rows, and we need to produce >>> statistics (e.g. mean, standard deviation, etc.) on the data. Right now, >>> we're generating these stats via a single SELECT, and it is extremely >>> slow...like it can take hours to return results. >> What datatype are the columns being averaged? If "numeric", consider >> casting to float8 before applying the aggregates. You'll lose some >> precision but it'll likely be orders of magnitude faster. > The data are type double. Hmm. In that case I think you have some other problem that's hidden in details you didn't show us. It should not take "hours" to process only 130M rows. This would best be taken up on pgsql-performance; please see http://wiki.postgresql.org/wiki/Slow_Query_Questions regards, tom lane
On 01/04/2012 09:04 AM, Sean Davis wrote: > Good point. I do not know what the performance is for these types of > aggregates using Pl/R. Perhaps Joe Conway or others could comment. Depending on your approach should be similar to native R performance. For example, you could use the spi interface to pull all the data at once into a dataframe, and then process as Sean mentioned. Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support
On Jan 4, 2012, at 11:36 AM, Matthew Foster wrote: > We have a database with approximately 130M rows, and we need to produce statistics (e.g. mean, standard deviation, etc.)on the data. Right now, we're generating these stats via a single SELECT, and it is extremely slow...like it can takehours to return results. The queries are variable, so there is no way I can think of to "pre-calculate" the stats. > > Here is an example query... > > SELECT count(mae) AS count, to_char(avg(mae), 'FM990.999') AS mae, > to_char(avg(sqrt(rms)), 'FM990.999') AS rms, to_char(avg(bias), 'FM990.999') AS bias, > to_char(max(mae), 'FM990.999') AS mae_max, to_char(min(mae), 'FM990.999') AS mae_min, > to_char(max(bias), 'FM990.999') AS bias_max, to_char(min(bias), 'FM990.999') AS bias_min, > to_char(max(sqrt(rms)), 'FM990.999') AS rms_max, to_char(min(sqrt(rms)), 'FM990.999') AS rms_min, > to_char(stddev(sqrt(rms)), 'FM990.999') AS rms_stddev, > to_char(stddev(mae), 'FM990.999') AS mae_stddev, > to_char(stddev(bias), 'FM990.999') AS bias_stddev > FROM verify_data.verification_data > WHERE model_name='foo' > AND...several other conditions... > > As one might imagine, this query pounds the CPU relentlessly for hours. > > I'm thinking there must be a better approach for doing large amounts of calculations on the data. Any pointers would begreatly appreciated! > > Matt > Make sure you have to right indexes.
On Wed, Jan 4, 2012 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Matthew Foster <matthew.foster@noaa.gov> writes:
> On Wed, Jan 4, 2012 at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Matthew Foster <matthew.foster@noaa.gov> writes:
>>> We have a database with approximately 130M rows, and we need to produce
>>> statistics (e.g. mean, standard deviation, etc.) on the data. Right now,
>>> we're generating these stats via a single SELECT, and it is extremely
>>> slow...like it can take hours to return results.
>> What datatype are the columns being averaged? If "numeric", consider
>> casting to float8 before applying the aggregates. You'll lose some
>> precision but it'll likely be orders of magnitude faster.> The data are type double.Hmm. In that case I think you have some other problem that's hidden in
details you didn't show us. It should not take "hours" to process only
130M rows. This would best be taken up on pgsql-performance; please see
http://wiki.postgresql.org/wiki/Slow_Query_Questions
regards, tom lane
Tom,
I think you are absolutely right. Some additional testing, with the arithmetic removed from the queries, still shows very slow performance.
I'll do some more digging, and perhaps take this to the performance list. Thanks for your advice!
Matt