Re: Approach to Data Summary and Analysis - Mailing list pgsql-general

From Robert DiFalco
Subject Re: Approach to Data Summary and Analysis
Date
Msg-id CAAXGW-ySnmq587HQvxGCH0PkixhvqRE+NmEZJH0=inZCt9_iPg@mail.gmail.com
Whole thread Raw
In response to Re: Approach to Data Summary and Analysis  (Roxanne Reid-Bennett <rox@tara-lu.com>)
Responses Re: Approach to Data Summary and Analysis  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Re: Approach to Data Summary and Analysis  (Roxanne Reid-Bennett <rox@tara-lu.com>)
List pgsql-general
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.  
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.

While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written). 

As for wanting to avoid NULLs I don't really know what to say. Obviously some times NULL's are required. For this design I don't really need them and they make the data harder to reason about (because they are kind of open to interpretation).  They can also give you different results than you sometimes expect (for example when looking for a non matching key, you start having to inject some OR IS NULLs and such). Also, the absence of null can make a lot of queries more optimal). That said, I understand where you all are coming from with de-normalization. It's definitely the path of the least resistance. Our instinct is to want to see all related data in a single table when possible. 

The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended. 
 


On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:
And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".
...

Sometimes I might want to get this data for a whole bunch of users.
...

So the other option is to create a call_summary table that is updated with triggers.
...

My only issue with a summary table is that I don't want a bunch of null fields.
...

But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about.
...

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for any specific scenario.  In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time.  And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -  they have to be asked.
1. How much data are you feeding into your system how fast?
       this directly affects your choices on distribution, parallel processing...  writes vs updates vs triggers for copying vs all reads
       [and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
       you have given "simple" use cases (how many calls completed within a time frame or to a number)
       you have given very slightly more complex use cases (when was the last time John answered a call)
       you have given a slightly more bulky processing question of (how many times have these users been called)
So...
      a) How many users executing read queries do you have?
      b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed)
      c) What is the expected load for your mid-line complex queries
      d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?)  and how frequently will that kind of query be executed?  How much tolerance for delay do your users have?
      e) do you have any known really complex queries that might bog the system down?
      f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which defines the style of schema you need.  Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created...

Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there.  But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished
- see the blogs on kimballgroup DOT com  they cover a lot of high(er) level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values.  They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself.

Roxanne




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: [GENARAL] round() bug
Next
From: Gavin Flower
Date:
Subject: Re: Approach to Data Summary and Analysis