Re: Q:Aggregrating Weekly Production Data. How do you do it? - Mailing list pgsql-general

From Scott Marlowe
Subject Re: Q:Aggregrating Weekly Production Data. How do you do it?
Date
Msg-id dcc563d10709180755u2fd8ce2cub060a5b992cf043f@mail.gmail.com
Whole thread Raw
In response to Q:Aggregrating Weekly Production Data. How do you do it?  (Ow Mun Heng <Ow.Mun.Heng@wdc.com>)
Responses Re: Q:Aggregrating Weekly Production Data. How do you do it?
List pgsql-general
On 9/17/07, Ow Mun Heng <Ow.Mun.Heng@wdc.com> wrote:

> Just wondering how everyone is doing aggregration of production data.

Where I work, we aggregate by the minute in the application, then dump
those stats entries into the database en masse from each machine.  So,
we wind up with rows something like this: (simplified)

id | parentid | host | timestamp | request | total | success | fail1 |
fail2 | fail3 | totalresptime | maxresptime
1 | NULL | jboss01 | 2007-09-17 12:02:03 | authcc | 120 | 112 | 4 | 2
| 2 | 480000 | 12039
2 | 1 | jboss01 | 2007-09-17 12:02:03 | ccconnect | 120 | 118 | 0 | 0
| 2 | 423900 | 10394

where the id comes from a sequence, and parent ID ties our children
stats to their parents.  I.e. in this example authcc called ccconnect,
and so on.  The fail1/2/3 are types of failures, the totalresptime
time is all the successful requests added together, so that
totalresptime/success = average for that minute, and max is the
longest running request ending in that minute.

Then we can aggregate those minutes together, monitor individual
machine performance, etc.  Ours is actually more complex than this,
but you get the idea.

We have a cron job that checks the statistics every x minutes for high
failure rates and have it generate an alert email if any of our
requests go over a preset threshold.  This catches problems long
before anything shows up interesting in the logs most of the time.

> (getting min/max/count isn't much of an issue. Stdev is the main issue I
> believe)
>
> One such instance I've read about is..

Isn't stddev() working for you?

What I do is aggregate the minute time slices by grouping by
date_trunc('xxx',timestamp) and then use that as a subselect to a
query that does the stddev() outside of that.  works pretty well for
us, and makes it easy to identify trends.

One of the things we started doing is to aggregate the last six weeks
data by the day / hour and then comparing the last 24 hours worth of
data to those six weeks worth to see trends for each hour of each day.
 The queries are huge and ugly, but they work, and run in about 4
minutes on a fairly good sized chunk of data.  We have about 150k to
300k entries a day put into this db.

pgsql-general by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: read-only queries on PITRslaves, any progress?
Next
From: Tom Lane
Date:
Subject: Re: stability issues