Re: Approach to Data Summary and Analysis - Mailing list pgsql-general

From Roxanne Reid-Bennett
Subject Re: Approach to Data Summary and Analysis
Date
Msg-id 534EA4CA.5070401@tara-lu.com
Whole thread Raw
In response to Re: Approach to Data Summary and Analysis  (Robert DiFalco <robert.difalco@gmail.com>)
Responses Re: Approach to Data Summary and Analysis
List pgsql-general
On 4/15/2014 9:10 PM, Robert DiFalco wrote:
> 1. >500K rows per day into the calls table.
> 2. Very rarely. The only common query is gathering users that have not
> been called "today" (along with some other qualifying criteria). More
> analytical queries/reports are done for internal use and it is not
> essential that they be lickity-split.
> a. Usually just one connection at a time executes read queries.
> b. the users not called today query will be done once a day.
> c. Daily
> d. All users for the last year (if you are asking about retention). We
> will also rarely have to run for all time.
> e. Not that I'm aware of (or seen) today.
> f. For the simple queries we cannot afford latency between calls and
> querying who was already called.
>
> While I don't seem to be getting much support for it here :D my write
> performance (which is most essential) has been much better since I
> further normalized the tables and made it so that NULL is never used
> and data is never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back
essentially one easy to find result set [who has NOT been successfully
called] on an ongoing single threaded basis [once per day?].  So you are
absolutely correct  - tune for writing speed.

> The summary table was really a separate point from whether or not
> people liked my schema or not -- I mean whether I de-normalize as
> people are asking or not, there would still be the question of a
> summary table for MAX and COUNT queries or to not have a summary table
> for those. I probably made the original question too open ended.
>
Do you know your answer?
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today
query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with
latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to  ... simplify
maintenance [note I did NOT say "development"] and then only IF it
doesn't impact the write speeds beyond an acceptable level.  Proper
internal / external documentation can mitigate maintenance nightmares.
If your developer(s) can't figure out how to get the data they need from
the schema - then give them the queries to run. [you are likely better
at tuning those anyway]

Last consideration - business consumption of data does change over
time.  Disk space is cheap [but getting and keeping speed sometimes
isn't].  You might consider including ongoing partial archival of the
operational data during slow usage (write) periods.

Roxanne


pgsql-general by date:

Previous
From: Dev Kumkar
Date:
Subject: Re: Heartbleed Impact
Next
From: Dev Kumkar
Date:
Subject: Re: Heartbleed Impact