Thread: Approach to Data Summary and Analysis

Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call. 

The basic schema is something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
  id        BIGINT NOT NULL,
  answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called". 

I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance. 

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.   

This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about. 

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing. 

Thanks!


Re: Approach to Data Summary and Analysis

From
Vincent Veyron
Date:

On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that
happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of
possiblestates. This would make your queries simpler. 

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed',
'Completed'),('call_errors', 'Error'); 

> CREATE TABLE calls (
>   id      BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

>   user_id BIGINT NOT NULL,
>   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
>   PRIMARY KEY (id),
>   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );


--

                    Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?


On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:


On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

> CREATE TABLE calls (
>   id      BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

>   user_id BIGINT NOT NULL,
>   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
>   PRIMARY KEY (id),
>   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );


--

                                        Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Approach to Data Summary and Analysis

From
Rob Sargent
Date:
On 04/14/2014 04:22 PM, Robert DiFalco wrote:
But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?


On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:


On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

> CREATE TABLE calls (
>   id      BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

>   user_id BIGINT NOT NULL,
>   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
>   PRIMARY KEY (id),
>   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );


--

                                        Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I wonder if you really need to place the parts of the call into the various tables.  ringtime, answertime, closetime and all the values associated with those parts of a call are all fundamental to a single call, though perhaps collected incrementally.  Easy queries, for sure. (Sorry, I haven't gone back to see your orig. schema.  If it's clear there why these are in separate files, say no more)


Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
Things like this. AVG ring time before answer, average connected call duration. % of calls never answered. % of calls that are answered that are connected. Number of times John has answered a call versus how many times we've called him.That sort of stuff. 


On Mon, Apr 14, 2014 at 3:34 PM, Rob Sargent <robjsargent@gmail.com> wrote:
On 04/14/2014 04:22 PM, Robert DiFalco wrote:
But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?


On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:


On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');

> CREATE TABLE calls (
>   id      BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

>   user_id BIGINT NOT NULL,
>   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
>   PRIMARY KEY (id),
>   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );


--

                                        Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I wonder if you really need to place the parts of the call into the various tables.  ringtime, answertime, closetime and all the values associated with those parts of a call are all fundamental to a single call, though perhaps collected incrementally.  Easy queries, for sure. (Sorry, I haven't gone back to see your orig. schema.  If it's clear there why these are in separate files, say no more)



Re: Approach to Data Summary and Analysis

From
Vincent Veyron
Date:
On Mon, 14 Apr 2014 15:22:13 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

Hi Robert,

> But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
> connection, etc. Btw, currently these tables never need to be UPDATEd. They
> are immutable in the current design.

Yes, but you wrote :

>Occasionally I will want to know things like "When was the last time a user
>answered a call" or "How many times has a user been called".

With your schema, you would have to query several tables each time. Queries will get messy in a hurry, updates and DDL
changestoo. 


> And in the end I'm not sure how the
> proposal of one table and a state that is updatable changes the basic
> thrust of the question. For example, getting last call, last answered,
> total called, total answered. If the state of a call transitions from
> called to answered then making it a field loses all the data with the
> previous state, make sense?
>

If you need this, you can use a schema that accommodates it.

(off the top of my head, insert a new record instead of altering an existing one, and create a hierarchy with parent
recordsthat point to an original call, may be with a 'parent' field and recursive queries. You probably have many
requirementsthat I'm not aware of, but this one can be met) 


--
                    Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
I'm sorry Vincent I'm not exactly sure what you are proposing. Are you proposing that I add another table in addition to what I already have that all the other tables JOIN to and add a state field in that parent table? How is that different than what I have except now I have a new table with an updatable state field? Maybe you can show a query or two to more specifically show what you are suggesting?

Right now this seems like a simple way to get the last time John was called:

    // last answered called for John
    SELECT MAX(a.answered) 
    FROM calls_answered a JOIN calls c ON c.id = a.id
    WHERE c.user = John;

If I don't have a calls_answered table I'm not sure how I would get this information if I had a single table with a mutable state. Unless you are suggesting denormalizing all the tables into one table that would have a lot of null fields. For example answered_date would be null if the call was never answered.






On Tue, Apr 15, 2014 at 5:37 AM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:
On Mon, 14 Apr 2014 15:22:13 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

Hi Robert,

> But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
> connection, etc. Btw, currently these tables never need to be UPDATEd. They
> are immutable in the current design.

Yes, but you wrote :

>Occasionally I will want to know things like "When was the last time a user
>answered a call" or "How many times has a user been called".

With your schema, you would have to query several tables each time. Queries will get messy in a hurry, updates and DDL changes too.


> And in the end I'm not sure how the
> proposal of one table and a state that is updatable changes the basic
> thrust of the question. For example, getting last call, last answered,
> total called, total answered. If the state of a call transitions from
> called to answered then making it a field loses all the data with the
> previous state, make sense?
>

If you need this, you can use a schema that accommodates it.

(off the top of my head, insert a new record instead of altering an existing one, and create a hierarchy with parent records that point to an original call, may be with a 'parent' field and recursive queries. You probably have many requirements that I'm not aware of, but this one can be met)


--
                                        Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance

Re: Approach to Data Summary and Analysis

From
Vincent Veyron
Date:
On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?

Hi Robert,

I guess a call state is subject to change, in which case you would have to shuffle records between tables when that
happens?

ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of
possiblestates. This would make your queries simpler. 

create table call_state(
id_call_state text PRIMARY KEY,
libelle text);

INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed',
'Completed'),('call_errors', 'Error'); 

> CREATE TABLE calls (
>   id      BIGINT NOT NULL, // sequence generator

id_call_state INTEGER NOT NULL REFERENCES call_state,

>   user_id BIGINT NOT NULL,
>   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
>   PRIMARY KEY (id),
>   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );


--
                    Regards, Vincent Veyron

http://libremen.com/
Legal case, contract and insurance claim management software


Re: Approach to Data Summary and Analysis

From
Chris Curvey
Date:
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call. 

The basic schema is something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
  id        BIGINT NOT NULL,
  answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called". 

I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance. 

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.   

This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about. 

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing. 

Thanks!



My initial thought is:  that design is over-normalized. The thing you are trying to model is the call, and it has severl attributes, some of which may be unknown or not applicable (which is what NULL is for).  So my thought would be to do something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered  TIMESTAMPTZ

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);


--
I asked the Internet how to train my cat, and the Internet told me to get a dog.

Re: Approach to Data Summary and Analysis

From
Chris Curvey
Date:



On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com> wrote:
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call. 

The basic schema is something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
  id        BIGINT NOT NULL,
  answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called". 

I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance. 

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.   

This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about. 

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing. 

Thanks!



(Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  answered  TIMESTAMPTZ NULL,
  connected  TIMESTAMPTZ NULL,
  completed  TIMESTAMPTZ NULL,


  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then  your queries end up looking like this:

--last time john answered
SELECT MAX(a.id
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
(  select *
  from calls
  where calls.user_id = myusers.user_id
  and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null



--
I asked the Internet how to train my cat, and the Internet told me to get a dog.

Re: Approach to Data Summary and Analysis

From
Vincent Veyron
Date:
On Tue, 15 Apr 2014 07:21:58 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:

> I'm sorry Vincent I'm not exactly sure what you are proposing. Are you
> proposing that I add another table in addition to what I already have that
> all the other tables JOIN to and add a state field in that parent table?

No : keep table 'calls' with an additional 'status' field, and drop the other tables

> How is that different than what I have except now I have a new table with
> an updatable state field? Maybe you can show a query or two to more
> specifically show what you are suggesting?
>
> Right now this seems like a simple way to get the last time John was called:
>
>     // last answered called for John
>     SELECT MAX(a.answered)
>     FROM calls_answered a JOIN calls c ON c.id = a.id
>     WHERE c.user = John;
>

I probably misunderstood something from your first post. Not sure what you call calls_completed, call_errors for
instance,but I had the impression your records would go into a different table according to their status. Do
calls_answeredmove to calls_completed at some point? 

In that case, how do you know that you should query calls_answered instead of calls_completed? Or, if John did not
answer,do you query call_errors? 

> If I don't have a calls_answered table I'm not sure how I would get this
> information if I had a single table with a mutable state.

Check an appropriate boolean field (call_answered boolean not null default false) would be a way. Again, this needs
studying,and it would take more details to go on. 

I stand by my earlier comment though, see Chris's answer which is on the same line.

>Unless you are
> suggesting denormalizing all the tables into one table that would have a
> lot of null fields. For example answered_date would be null if the call was
> never answered.
>

Not a big problem I should say, unless you deal with really hude data.

--
                    Salutations, Vincent Veyron

http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance


Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
Actually that was exactly the initial table design. There were more fields because for my use case there were a lot more states and certain states have additional data (for example when a call goes from answered to connected it also gets the user_id of the person being connected to). So that one table started getting a LOT of columns which starting making it hard to reason about. 

The more normalized version has a couple of things going for it. COUNT, MIN, MAX, etc are very fast because I don't have to conditionally add null checks. Everything is inserted so for the millions of calls that get made the normalized schema was much more efficient for writing. It was also easier to understand. The answer table only has calls that were answered, the error table only has calls the resulted in an error after being connected, etc. 

I know this kind of gets into a religious area when discussing NULLs and what level of normalization is appropriate so I don't want to spark any of that on this thread. But only doing inserts and never doing updates or deletes performed very well for large data sets.  

That said, I could explore a compromise between the monolithic table approach and the completely normalized set of tables approach. Thanks for your input!


On Tue, Apr 15, 2014 at 8:12 AM, Chris Curvey <chris@chriscurvey.com> wrote:



On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com> wrote:
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call. 

The basic schema is something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
  id        BIGINT NOT NULL,
  answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called". 

I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance. 

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.   

This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about. 

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing. 

Thanks!



(Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  answered  TIMESTAMPTZ NULL,
  connected  TIMESTAMPTZ NULL,
  completed  TIMESTAMPTZ NULL,


  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then  your queries end up looking like this:

--last time john answered
SELECT MAX(a.id
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
(  select *
  from calls
  where calls.user_id = myusers.user_id
  and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null



--
I asked the Internet how to train my cat, and the Internet told me to get a dog.

Re: Approach to Data Summary and Analysis

From
Rob Sargent
Date:
On 04/15/2014 09:53 AM, Robert DiFalco wrote:
> Actually that was exactly the initial table design. There were more
> fields because for my use case there were a lot more states and
> certain states have additional data (for example when a call goes from
> answered to connected it also gets the user_id of the person being
> connected to). So that one table started getting a LOT of columns
> which starting making it hard to reason about.
>
> The more normalized version has a couple of things going for it.
> COUNT, MIN, MAX, etc are very fast because I don't have to
> conditionally add null checks. Everything is inserted so for the
> millions of calls that get made the normalized schema was much more
> efficient for writing. It was also easier to understand. The answer
> table only has calls that were answered, the error table only has
> calls the resulted in an error after being connected, etc.
>
> I know this kind of gets into a religious area when discussing NULLs
> and what level of normalization is appropriate so I don't want to
> spark any of that on this thread. But only doing inserts and never
> doing updates or deletes performed very well for large data sets.
>
> That said, I could explore a compromise between the monolithic table
> approach and the completely normalized set of tables approach. Thanks
> for your input!
>
I wonder if the "LOT of columns" are the bits that need to be parcelled
off as specific to one condition of a call?


Re: Approach to Data Summary and Analysis

From
Roxanne Reid-Bennett
Date:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> Occasionally I will want to know things like "When was the last time a
> user answered a call" or "How many times has a user been called".
> ...
> Sometimes I might want to get this data for a whole bunch of users.
> ...
> So the other option is to create a call_summary table that is updated
> with triggers.
> ...
> My only issue with a summary table is that I don't want a bunch of
> null fields.
> ...
> But over time all fields would eventually become non-NULL.
>
> So that leads me to a summary table for EACH call state. This approach
> has the down side that it creates a lot of tables and triggers. It has
> the upside of being pretty efficient without having to deal with NULL
> values.  It's also pretty easy to reason about.
> ...
> So for my question -- is the choice between these a personal
> preference sort of thing or is there a right or wrong approach? Am I
> missing another approach that would be better?  I'm okay with SQL but
> I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN
> for this that I am missing.
>
There is no right or wrong - there is better, worse, best, and worst for
any specific scenario.  In my experience, most people have time/money to
get to an 80% "better" design than all the other answers during design
and then it gets refined over time.  And yes, personal experience does
play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -
they have to be asked.
1. How much data are you feeding into your system how fast?
        this directly affects your choices on distribution, parallel
processing...  writes vs updates vs triggers for copying vs all reads
        [and if on bare metal - potentially where you place your logs,
indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
        you have given "simple" use cases (how many calls completed
within a time frame or to a number)
        you have given very slightly more complex use cases (when was
the last time John answered a call)
        you have given a slightly more bulky processing question of (how
many times have these users been called)
So...
       a) How many users executing read queries do you have?
       b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
       c) What is the expected load for your mid-line complex queries
       d) What is the "maximum" volume you expect a bulk query to go
after (like all users in the last 18 years, or this city's users in the
last day?)  and how frequently will that kind of query be executed?  How
much tolerance for delay do your users have?
       e) do you have any known really complex queries that might bog
the system down?
       f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which
defines the style of schema you need.  Queries can be written to pull
data from any schema design - but how fast they can perform or how
easily they can be created...

Chris and Vincent both targeted a balance between writes and reads -
which adequately answers 80-85% of the usages out there.  But you didn't
give us any of the above - so their recommendation (while very likely
valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational
system because normalization results in fewer writes/updates and lowers
the risk of corruption if a failure takes place. It also isolates
updates for any specific value to one location minimizing internally
caused data corruption.
Reporting systems are generally less normalized because writes are more
one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review
information about how it is accomplished
- see the blogs on kimballgroup DOT com  they cover a lot of high(er)
level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots
of" "NULL" values.  They provide meaning in a number of different
ways... not the least of which is that you don't know (yet) - which is
knowledge in and of itself.

Roxanne



Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.  
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.

While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written). 

As for wanting to avoid NULLs I don't really know what to say. Obviously some times NULL's are required. For this design I don't really need them and they make the data harder to reason about (because they are kind of open to interpretation).  They can also give you different results than you sometimes expect (for example when looking for a non matching key, you start having to inject some OR IS NULLs and such). Also, the absence of null can make a lot of queries more optimal). That said, I understand where you all are coming from with de-normalization. It's definitely the path of the least resistance. Our instinct is to want to see all related data in a single table when possible. 

The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended. 
 


On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:
And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".
...

Sometimes I might want to get this data for a whole bunch of users.
...

So the other option is to create a call_summary table that is updated with triggers.
...

My only issue with a summary table is that I don't want a bunch of null fields.
...

But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about.
...

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for any specific scenario.  In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time.  And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -  they have to be asked.
1. How much data are you feeding into your system how fast?
       this directly affects your choices on distribution, parallel processing...  writes vs updates vs triggers for copying vs all reads
       [and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
       you have given "simple" use cases (how many calls completed within a time frame or to a number)
       you have given very slightly more complex use cases (when was the last time John answered a call)
       you have given a slightly more bulky processing question of (how many times have these users been called)
So...
      a) How many users executing read queries do you have?
      b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed)
      c) What is the expected load for your mid-line complex queries
      d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?)  and how frequently will that kind of query be executed?  How much tolerance for delay do your users have?
      e) do you have any known really complex queries that might bog the system down?
      f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which defines the style of schema you need.  Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created...

Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there.  But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished
- see the blogs on kimballgroup DOT com  they cover a lot of high(er) level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values.  They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself.

Roxanne




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Approach to Data Summary and Analysis

From
Gavin Flower
Date:
On 16/04/14 13:10, Robert DiFalco wrote:
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.  
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.

While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written). 

As for wanting to avoid NULLs I don't really know what to say. Obviously some times NULL's are required. For this design I don't really need them and they make the data harder to reason about (because they are kind of open to interpretation).  They can also give you different results than you sometimes expect (for example when looking for a non matching key, you start having to inject some OR IS NULLs and such). Also, the absence of null can make a lot of queries more optimal). That said, I understand where you all are coming from with de-normalization. It's definitely the path of the least resistance. Our instinct is to want to see all related data in a single table when possible. 

The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended. 
 


On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:
And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".
...

Sometimes I might want to get this data for a whole bunch of users.
...

So the other option is to create a call_summary table that is updated with triggers.
...

My only issue with a summary table is that I don't want a bunch of null fields.
...

But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about.
...

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.

There is no right or wrong - there is better, worse, best, and worst for any specific scenario.  In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time.  And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;)

I didn't see anybody ask these questions - and to identify "better" -  they have to be asked.
1. How much data are you feeding into your system how fast?
       this directly affects your choices on distribution, parallel processing...  writes vs updates vs triggers for copying vs all reads
       [and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
       you have given "simple" use cases (how many calls completed within a time frame or to a number)
       you have given very slightly more complex use cases (when was the last time John answered a call)
       you have given a slightly more bulky processing question of (how many times have these users been called)
So...
      a) How many users executing read queries do you have?
      b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed)
      c) What is the expected load for your mid-line complex queries
      d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?)  and how frequently will that kind of query be executed?  How much tolerance for delay do your users have?
      e) do you have any known really complex queries that might bog the system down?
      f) How much lag time can you afford between capture and reporting?

Answers to the above define your performance requirements - which defines the style of schema you need.  Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created...

Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there.  But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all.

As to design patterns -
"Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that appropriately supports both.

you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished
- see the blogs on kimballgroup DOT com  they cover a lot of high(er) level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS']

Specify your performance requirements, then figure out your schema design.

FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values.  They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself.

Roxanne




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Have you considered partial indexes? Using the WHERE predicate option of CREATE INDEX.

This can be useful if you often look for things that are often only a small subset of keys.  For example a partial index on sex would useful for nurses, only indexing those that are male as they are in a very small minority.


Cheers,
Gavin

Re: Approach to Data Summary and Analysis

From
Roxanne Reid-Bennett
Date:
On 4/15/2014 9:10 PM, Robert DiFalco wrote:
> 1. >500K rows per day into the calls table.
> 2. Very rarely. The only common query is gathering users that have not
> been called "today" (along with some other qualifying criteria). More
> analytical queries/reports are done for internal use and it is not
> essential that they be lickity-split.
> a. Usually just one connection at a time executes read queries.
> b. the users not called today query will be done once a day.
> c. Daily
> d. All users for the last year (if you are asking about retention). We
> will also rarely have to run for all time.
> e. Not that I'm aware of (or seen) today.
> f. For the simple queries we cannot afford latency between calls and
> querying who was already called.
>
> While I don't seem to be getting much support for it here :D my write
> performance (which is most essential) has been much better since I
> further normalized the tables and made it so that NULL is never used
> and data is never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back
essentially one easy to find result set [who has NOT been successfully
called] on an ongoing single threaded basis [once per day?].  So you are
absolutely correct  - tune for writing speed.

> The summary table was really a separate point from whether or not
> people liked my schema or not -- I mean whether I de-normalize as
> people are asking or not, there would still be the question of a
> summary table for MAX and COUNT queries or to not have a summary table
> for those. I probably made the original question too open ended.
>
Do you know your answer?
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today
query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with
latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to  ... simplify
maintenance [note I did NOT say "development"] and then only IF it
doesn't impact the write speeds beyond an acceptable level.  Proper
internal / external documentation can mitigate maintenance nightmares.
If your developer(s) can't figure out how to get the data they need from
the schema - then give them the queries to run. [you are likely better
at tuning those anyway]

Last consideration - business consumption of data does change over
time.  Disk space is cheap [but getting and keeping speed sometimes
isn't].  You might consider including ongoing partial archival of the
operational data during slow usage (write) periods.

Roxanne


Re: Approach to Data Summary and Analysis

From
Robert DiFalco
Date:
Thanks Roxanne, I suppose when it comes down to it -- for the current use cases and data size -- my only concern is the "calling" query that will need to use max to determine if a user has already had a call today. For a large user set, for each user I would either have to MAX on the answered timestamp to compare it against today or do an exist query to see if any timestamp for that user is greater or equal than "today". 

But I suppose I just need to construct a huge dataset and see. I was thinking by keeping a summary so I always knew the last answer or call time for each user that I could mitigate this becoming an issue.  Over time a single user could have answered a call thousands of times. So that would make a "<=" timestamp query be just # of users instead of # of users X 1000 (or however many calls they have answered over the non-archived time period).







On Wed, Apr 16, 2014 at 8:42 AM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/15/2014 9:10 PM, Robert DiFalco wrote:
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.

While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back essentially one easy to find result set [who has NOT been successfully called] on an ongoing single threaded basis [once per day?].  So you are absolutely correct  - tune for writing speed.


The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended.

Do you know your answer?
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to  ... simplify maintenance [note I did NOT say "development"] and then only IF it doesn't impact the write speeds beyond an acceptable level.  Proper internal / external documentation can mitigate maintenance nightmares.  If your developer(s) can't figure out how to get the data they need from the schema - then give them the queries to run. [you are likely better at tuning those anyway]

Last consideration - business consumption of data does change over time.  Disk space is cheap [but getting and keeping speed sometimes isn't].  You might consider including ongoing partial archival of the operational data during slow usage (write) periods.


Roxanne


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Approach to Data Summary and Analysis

From
Roxanne Reid-Bennett
Date:
On 4/16/2014 2:40 PM, Robert DiFalco wrote:
> Thanks Roxanne, I suppose when it comes down to it -- for the current
> use cases and data size -- my only concern is the "calling" query that
> will need to use max to determine if a user has already had a call
> today. For a large user set, for each user I would either have to MAX
> on the answered timestamp to compare it against today or do an exist
> query to see if any timestamp for that user is greater or equal than
> "today".

I didn't go back to look at your original schema- but.. if your 500K
records are coming in time ordered... You may be able to track "max" as
an attribute  on an "SCD" based on the caller/callee table [or the
caller/ee table itself if that table is only used by your app] with an
update from a post-insert trigger on the appropriate table. Even if they
aren't time ordered, you add the overhead of a single comparative in the
trigger.  Downside is that you fire a trigger and an update for every
insert. [or just an update depending on what is driving your load of the
500K records]

Again - the proof on "value" of this overhead is a comparison of the
cost for the updates vs the cost on the query to find max()  I suspect
your once a day query can afford all sorts of other optimizations that
are "better" than a trigger fired on every insert.  [such as the
function index - that was already mentioned] I really suspect you just
don't have enough load on the query side (complex queries * # of users)
to justify the extra load on the write side (+1 trigger, +1 update /
insert) to avoid a (potentially) heavy query load 1x/day.

Another option... if only worried about "today".. then keep only
"today's" data in your query table, and migrate historical data nightly
to a pseudo archive table for those "every once in a while" questions.
I haven't played with table inheritance in Postgres - but that's a
capability I might look at if I were doing a pseudo archive table.

Roxanne