Thread: Approach to Data Summary and Analysis
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call.
The basic schema is something like this:
CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);CREATE TABLE calls_answered (id BIGINT NOT NULL,answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE);
And so on for calls_connected, calls_completed, call_errors, etc.
Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".
I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance.
SELECT MAX(a.id)WHERE c.user_id = ?;
Or the number of answered calls:
Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.
The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.
My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.
So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.
This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.
So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.
Thanks!
On Mon, 14 Apr 2014 09:27:29 -0700 Robert DiFalco <robert.difalco@gmail.com> wrote: > I have several related tables that represent a call state. > > And so on for calls_connected, calls_completed, call_errors, etc. > > So for my question -- is the choice between these a personal preference > sort of thing or is there a right or wrong approach? Am I missing another > approach that would be better? Hi Robert, I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens? ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possiblestates. This would make your queries simpler. create table call_state( id_call_state text PRIMARY KEY, libelle text); INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'),('call_errors', 'Error'); > CREATE TABLE calls ( > id BIGINT NOT NULL, // sequence generator id_call_state INTEGER NOT NULL REFERENCES call_state, > user_id BIGINT NOT NULL, > called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP, > > PRIMARY KEY (id), > FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE > ); -- Salutations, Vincent Veyron http://marica.fr Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?
On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:
Salutations, Vincent Veyron
On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:
> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?
Hi Robert,
I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?
ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.
create table call_state(
id_call_state text PRIMARY KEY,
libelle text);
INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');
> CREATE TABLE calls (
> id BIGINT NOT NULL, // sequence generator
id_call_state INTEGER NOT NULL REFERENCES call_state,
> user_id BIGINT NOT NULL,
> called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
> PRIMARY KEY (id),
> FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );
--
http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 04/14/2014 04:22 PM, Robert DiFalco wrote:
I wonder if you really need to place the parts of the call into the various tables. ringtime, answertime, closetime and all the values associated with those parts of a call are all fundamental to a single call, though perhaps collected incrementally. Easy queries, for sure. (Sorry, I haven't gone back to see your orig. schema. If it's clear there why these are in separate files, say no more)But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:Salutations, Vincent Veyron
On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:
> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?
Hi Robert,
I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?
ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.
create table call_state(
id_call_state text PRIMARY KEY,
libelle text);
INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');
> CREATE TABLE calls (
> id BIGINT NOT NULL, // sequence generator
id_call_state INTEGER NOT NULL REFERENCES call_state,
> user_id BIGINT NOT NULL,
> called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
> PRIMARY KEY (id),
> FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );
--
http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Things like this. AVG ring time before answer, average connected call duration. % of calls never answered. % of calls that are answered that are connected. Number of times John has answered a call versus how many times we've called him.That sort of stuff.
On Mon, Apr 14, 2014 at 3:34 PM, Rob Sargent <robjsargent@gmail.com> wrote:
I wonder if you really need to place the parts of the call into the various tables. ringtime, answertime, closetime and all the values associated with those parts of a call are all fundamental to a single call, though perhaps collected incrementally. Easy queries, for sure. (Sorry, I haven't gone back to see your orig. schema. If it's clear there why these are in separate files, say no more)On 04/14/2014 04:22 PM, Robert DiFalco wrote:But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, connection, etc. Btw, currently these tables never need to be UPDATEd. They are immutable in the current design. And in the end I'm not sure how the proposal of one table and a state that is updatable changes the basic thrust of the question. For example, getting last call, last answered, total called, total answered. If the state of a call transitions from called to answered then making it a field loses all the data with the previous state, make sense?On Mon, Apr 14, 2014 at 2:43 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:Salutations, Vincent Veyron
On Mon, 14 Apr 2014 09:27:29 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:
> I have several related tables that represent a call state.
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> So for my question -- is the choice between these a personal preference
> sort of thing or is there a right or wrong approach? Am I missing another
> approach that would be better?
Hi Robert,
I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens?
ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possible states. This would make your queries simpler.
create table call_state(
id_call_state text PRIMARY KEY,
libelle text);
INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'), ('call_errors', 'Error');
> CREATE TABLE calls (
> id BIGINT NOT NULL, // sequence generator
id_call_state INTEGER NOT NULL REFERENCES call_state,
> user_id BIGINT NOT NULL,
> called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
>
> PRIMARY KEY (id),
> FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
> );
--
http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, 14 Apr 2014 15:22:13 -0700 Robert DiFalco <robert.difalco@gmail.com> wrote: Hi Robert, > But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer, > connection, etc. Btw, currently these tables never need to be UPDATEd. They > are immutable in the current design. Yes, but you wrote : >Occasionally I will want to know things like "When was the last time a user >answered a call" or "How many times has a user been called". With your schema, you would have to query several tables each time. Queries will get messy in a hurry, updates and DDL changestoo. > And in the end I'm not sure how the > proposal of one table and a state that is updatable changes the basic > thrust of the question. For example, getting last call, last answered, > total called, total answered. If the state of a call transitions from > called to answered then making it a field loses all the data with the > previous state, make sense? > If you need this, you can use a schema that accommodates it. (off the top of my head, insert a new record instead of altering an existing one, and create a hierarchy with parent recordsthat point to an original call, may be with a 'parent' field and recursive queries. You probably have many requirementsthat I'm not aware of, but this one can be met) -- Salutations, Vincent Veyron http://marica.fr Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
I'm sorry Vincent I'm not exactly sure what you are proposing. Are you proposing that I add another table in addition to what I already have that all the other tables JOIN to and add a state field in that parent table? How is that different than what I have except now I have a new table with an updatable state field? Maybe you can show a query or two to more specifically show what you are suggesting?
Right now this seems like a simple way to get the last time John was called:
// last answered called for John
SELECT MAX(a.answered)
WHERE c.user = John;
If I don't have a calls_answered table I'm not sure how I would get this information if I had a single table with a mutable state. Unless you are suggesting denormalizing all the tables into one table that would have a lot of null fields. For example answered_date would be null if the call was never answered.
On Tue, Apr 15, 2014 at 5:37 AM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:
On Mon, 14 Apr 2014 15:22:13 -0700
Robert DiFalco <robert.difalco@gmail.com> wrote:
Hi Robert,Yes, but you wrote :
> But then I lose a bunch of data like the TIMESTAMPTZ of the call, answer,
> connection, etc. Btw, currently these tables never need to be UPDATEd. They
> are immutable in the current design.With your schema, you would have to query several tables each time. Queries will get messy in a hurry, updates and DDL changes too.
>Occasionally I will want to know things like "When was the last time a user
>answered a call" or "How many times has a user been called".If you need this, you can use a schema that accommodates it.
> And in the end I'm not sure how the
> proposal of one table and a state that is updatable changes the basic
> thrust of the question. For example, getting last call, last answered,
> total called, total answered. If the state of a call transitions from
> called to answered then making it a field loses all the data with the
> previous state, make sense?
>
(off the top of my head, insert a new record instead of altering an existing one, and create a hierarchy with parent records that point to an original call, may be with a 'parent' field and recursive queries. You probably have many requirements that I'm not aware of, but this one can be met)
--
Salutations, Vincent Veyron
http://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
On Mon, 14 Apr 2014 09:27:29 -0700 Robert DiFalco <robert.difalco@gmail.com> wrote: > I have several related tables that represent a call state. > > And so on for calls_connected, calls_completed, call_errors, etc. > > So for my question -- is the choice between these a personal preference > sort of thing or is there a right or wrong approach? Am I missing another > approach that would be better? Hi Robert, I guess a call state is subject to change, in which case you would have to shuffle records between tables when that happens? ISTM you should consider using only a 'calls' table, and add an 'id_call_state' field to it that references the list of possiblestates. This would make your queries simpler. create table call_state( id_call_state text PRIMARY KEY, libelle text); INSERT INTO call_state (id_call_state, libelle) VALUES ('calls_connected', 'Connected'), ('calls_completed', 'Completed'),('call_errors', 'Error'); > CREATE TABLE calls ( > id BIGINT NOT NULL, // sequence generator id_call_state INTEGER NOT NULL REFERENCES call_state, > user_id BIGINT NOT NULL, > called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP, > > PRIMARY KEY (id), > FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE > ); -- Regards, Vincent Veyron http://libremen.com/ Legal case, contract and insurance claim management software
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call.The basic schema is something like this:CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);CREATE TABLE calls_answered (id BIGINT NOT NULL,answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE);And so on for calls_connected, calls_completed, call_errors, etc.Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance.SELECT MAX(a.id)WHERE c.user_id = ?;Or the number of answered calls:Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.Thanks!
CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ
PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);
I asked the Internet how to train my cat, and the Internet told me to get a dog.
On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com> wrote:
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call.The basic schema is something like this:CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);CREATE TABLE calls_answered (id BIGINT NOT NULL,answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE);And so on for calls_connected, calls_completed, call_errors, etc.Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance.SELECT MAX(a.id)WHERE c.user_id = ?;Or the number of answered calls:Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.Thanks!
(Sorry, fat-fingered and hit "send too early"...)
CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ NULL,
connected TIMESTAMPTZ NULL,
completed TIMESTAMPTZ NULL,
PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);
Then your queries end up looking like this:
--last time john answered
FROM calls
where answered is not null
and user_id = ?
-- users that have not answered a call in the last five days (I can think of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
( select *
from calls
where calls.user_id = myusers.user_id
and answered >= <five days ago>)
-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null
I asked the Internet how to train my cat, and the Internet told me to get a dog.
On Tue, 15 Apr 2014 07:21:58 -0700 Robert DiFalco <robert.difalco@gmail.com> wrote: > I'm sorry Vincent I'm not exactly sure what you are proposing. Are you > proposing that I add another table in addition to what I already have that > all the other tables JOIN to and add a state field in that parent table? No : keep table 'calls' with an additional 'status' field, and drop the other tables > How is that different than what I have except now I have a new table with > an updatable state field? Maybe you can show a query or two to more > specifically show what you are suggesting? > > Right now this seems like a simple way to get the last time John was called: > > // last answered called for John > SELECT MAX(a.answered) > FROM calls_answered a JOIN calls c ON c.id = a.id > WHERE c.user = John; > I probably misunderstood something from your first post. Not sure what you call calls_completed, call_errors for instance,but I had the impression your records would go into a different table according to their status. Do calls_answeredmove to calls_completed at some point? In that case, how do you know that you should query calls_answered instead of calls_completed? Or, if John did not answer,do you query call_errors? > If I don't have a calls_answered table I'm not sure how I would get this > information if I had a single table with a mutable state. Check an appropriate boolean field (call_answered boolean not null default false) would be a way. Again, this needs studying,and it would take more details to go on. I stand by my earlier comment though, see Chris's answer which is on the same line. >Unless you are > suggesting denormalizing all the tables into one table that would have a > lot of null fields. For example answered_date would be null if the call was > never answered. > Not a big problem I should say, unless you deal with really hude data. -- Salutations, Vincent Veyron http://marica.fr Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
Actually that was exactly the initial table design. There were more fields because for my use case there were a lot more states and certain states have additional data (for example when a call goes from answered to connected it also gets the user_id of the person being connected to). So that one table started getting a LOT of columns which starting making it hard to reason about.
The more normalized version has a couple of things going for it. COUNT, MIN, MAX, etc are very fast because I don't have to conditionally add null checks. Everything is inserted so for the millions of calls that get made the normalized schema was much more efficient for writing. It was also easier to understand. The answer table only has calls that were answered, the error table only has calls the resulted in an error after being connected, etc.
I know this kind of gets into a religious area when discussing NULLs and what level of normalization is appropriate so I don't want to spark any of that on this thread. But only doing inserts and never doing updates or deletes performed very well for large data sets.
That said, I could explore a compromise between the monolithic table approach and the completely normalized set of tables approach. Thanks for your input!
On Tue, Apr 15, 2014 at 8:12 AM, Chris Curvey <chris@chriscurvey.com> wrote:
On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com> wrote:On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call.The basic schema is something like this:CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);CREATE TABLE calls_answered (id BIGINT NOT NULL,answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE);And so on for calls_connected, calls_completed, call_errors, etc.Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance.SELECT MAX(a.id)WHERE c.user_id = ?;Or the number of answered calls:Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.Thanks!(Sorry, fat-fingered and hit "send too early"...)CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,answered TIMESTAMPTZ NULL,connected TIMESTAMPTZ NULL,completed TIMESTAMPTZ NULL,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);Then your queries end up looking like this:--last time john answeredFROM callswhere answered is not nulland user_id = ?-- users that have not answered a call in the last five days (I can think of a few ways to interpret that phrase)select myusers.*from myuserswhere not exists( select *from callswhere calls.user_id = myusers.user_idand answered >= <five days ago>)-- average ring timeselect avg(extract ('seconds' from called - answered))where answered is not null--I asked the Internet how to train my cat, and the Internet told me to get a dog.
On 04/15/2014 09:53 AM, Robert DiFalco wrote: > Actually that was exactly the initial table design. There were more > fields because for my use case there were a lot more states and > certain states have additional data (for example when a call goes from > answered to connected it also gets the user_id of the person being > connected to). So that one table started getting a LOT of columns > which starting making it hard to reason about. > > The more normalized version has a couple of things going for it. > COUNT, MIN, MAX, etc are very fast because I don't have to > conditionally add null checks. Everything is inserted so for the > millions of calls that get made the normalized schema was much more > efficient for writing. It was also easier to understand. The answer > table only has calls that were answered, the error table only has > calls the resulted in an error after being connected, etc. > > I know this kind of gets into a religious area when discussing NULLs > and what level of normalization is appropriate so I don't want to > spark any of that on this thread. But only doing inserts and never > doing updates or deletes performed very well for large data sets. > > That said, I could explore a compromise between the monolithic table > approach and the completely normalized set of tables approach. Thanks > for your input! > I wonder if the "LOT of columns" are the bits that need to be parcelled off as specific to one condition of a call?
On 4/14/2014 12:27 PM, Robert DiFalco wrote: > And so on for calls_connected, calls_completed, call_errors, etc. > > Occasionally I will want to know things like "When was the last time a > user answered a call" or "How many times has a user been called". > ... > Sometimes I might want to get this data for a whole bunch of users. > ... > So the other option is to create a call_summary table that is updated > with triggers. > ... > My only issue with a summary table is that I don't want a bunch of > null fields. > ... > But over time all fields would eventually become non-NULL. > > So that leads me to a summary table for EACH call state. This approach > has the down side that it creates a lot of tables and triggers. It has > the upside of being pretty efficient without having to deal with NULL > values. It's also pretty easy to reason about. > ... > So for my question -- is the choice between these a personal > preference sort of thing or is there a right or wrong approach? Am I > missing another approach that would be better? I'm okay with SQL but > I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN > for this that I am missing. > There is no right or wrong - there is better, worse, best, and worst for any specific scenario. In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time. And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;) I didn't see anybody ask these questions - and to identify "better" - they have to be asked. 1. How much data are you feeding into your system how fast? this directly affects your choices on distribution, parallel processing... writes vs updates vs triggers for copying vs all reads [and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc] 2. How much data are you reading out of your system - how fast? you have given "simple" use cases (how many calls completed within a time frame or to a number) you have given very slightly more complex use cases (when was the last time John answered a call) you have given a slightly more bulky processing question of (how many times have these users been called) So... a) How many users executing read queries do you have? b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed) c) What is the expected load for your mid-line complex queries d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?) and how frequently will that kind of query be executed? How much tolerance for delay do your users have? e) do you have any known really complex queries that might bog the system down? f) How much lag time can you afford between capture and reporting? Answers to the above define your performance requirements - which defines the style of schema you need. Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created... Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there. But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all. As to design patterns - "Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption. Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs. Sometimes you have to use data replication to have a system that appropriately supports both. you have shown you are already aware of normalization. If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished - see the blogs on kimballgroup DOT com they cover a lot of high(er) level concepts with enough specificity to be of some direct use. [that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS'] Specify your performance requirements, then figure out your schema design. FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values. They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself. Roxanne
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.
While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written).
As for wanting to avoid NULLs I don't really know what to say. Obviously some times NULL's are required. For this design I don't really need them and they make the data harder to reason about (because they are kind of open to interpretation). They can also give you different results than you sometimes expect (for example when looking for a non matching key, you start having to inject some OR IS NULLs and such). Also, the absence of null can make a lot of queries more optimal). That said, I understand where you all are coming from with de-normalization. It's definitely the path of the least resistance. Our instinct is to want to see all related data in a single table when possible.
The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended.
On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:There is no right or wrong - there is better, worse, best, and worst for any specific scenario. In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time. And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;)And so on for calls_connected, calls_completed, call_errors, etc....
Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called"....
Sometimes I might want to get this data for a whole bunch of users....
So the other option is to create a call_summary table that is updated with triggers....
My only issue with a summary table is that I don't want a bunch of null fields.So that leads me to a summary table for EACH call state. This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.
But over time all fields would eventually become non-NULL.
...
So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.
I didn't see anybody ask these questions - and to identify "better" - they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution, parallel processing... writes vs updates vs triggers for copying vs all reads
[and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed within a time frame or to a number)
you have given very slightly more complex use cases (when was the last time John answered a call)
you have given a slightly more bulky processing question of (how many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?) and how frequently will that kind of query be executed? How much tolerance for delay do your users have?
e) do you have any known really complex queries that might bog the system down?
f) How much lag time can you afford between capture and reporting?
Answers to the above define your performance requirements - which defines the style of schema you need. Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created...
Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there. But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all.
As to design patterns -
"Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that appropriately supports both.
you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of high(er) level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS']
Specify your performance requirements, then figure out your schema design.
FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values. They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 16/04/14 13:10, Robert DiFalco wrote:
Have you considered partial indexes? Using the WHERE predicate option of CREATE INDEX.1. >500K rows per day into the calls table.2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.a. Usually just one connection at a time executes read queries.b. the users not called today query will be done once a day.c. Dailyd. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.e. Not that I'm aware of (or seen) today.f. For the simple queries we cannot afford latency between calls and querying who was already called.While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written).As for wanting to avoid NULLs I don't really know what to say. Obviously some times NULL's are required. For this design I don't really need them and they make the data harder to reason about (because they are kind of open to interpretation). They can also give you different results than you sometimes expect (for example when looking for a non matching key, you start having to inject some OR IS NULLs and such). Also, the absence of null can make a lot of queries more optimal). That said, I understand where you all are coming from with de-normalization. It's definitely the path of the least resistance. Our instinct is to want to see all related data in a single table when possible.The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended.On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:On 4/14/2014 12:27 PM, Robert DiFalco wrote:There is no right or wrong - there is better, worse, best, and worst for any specific scenario. In my experience, most people have time/money to get to an 80% "better" design than all the other answers during design and then it gets refined over time. And yes, personal experience does play a part in how people interpret better/worse [aka religion] ;)And so on for calls_connected, calls_completed, call_errors, etc....
Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called"....
Sometimes I might want to get this data for a whole bunch of users....
So the other option is to create a call_summary table that is updated with triggers....
My only issue with a summary table is that I don't want a bunch of null fields.So that leads me to a summary table for EACH call state. This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.
But over time all fields would eventually become non-NULL.
...
So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.
I didn't see anybody ask these questions - and to identify "better" - they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution, parallel processing... writes vs updates vs triggers for copying vs all reads
[and if on bare metal - potentially where you place your logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed within a time frame or to a number)
you have given very slightly more complex use cases (when was the last time John answered a call)
you have given a slightly more bulky processing question of (how many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to go after (like all users in the last 18 years, or this city's users in the last day?) and how frequently will that kind of query be executed? How much tolerance for delay do your users have?
e) do you have any known really complex queries that might bog the system down?
f) How much lag time can you afford between capture and reporting?
Answers to the above define your performance requirements - which defines the style of schema you need. Queries can be written to pull data from any schema design - but how fast they can perform or how easily they can be created...
Chris and Vincent both targeted a balance between writes and reads - which adequately answers 80-85% of the usages out there. But you didn't give us any of the above - so their recommendation (while very likely valid) may not actually fit your case at all.
As to design patterns -
"Generally" a database schema is more normalized for an operational system because normalization results in fewer writes/updates and lowers the risk of corruption if a failure takes place. It also isolates updates for any specific value to one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that appropriately supports both.
you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can review information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of high(er) level concepts with enough specificity to be of some direct use.
[that website also covers some ideas for "Big Data" which aren't necessarily limited to RDBMS']
Specify your performance requirements, then figure out your schema design.
FWIW I don't understand your (or any other person's) hesitancy for "lots of" "NULL" values. They provide meaning in a number of different ways... not the least of which is that you don't know (yet) - which is knowledge in and of itself.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
This can be useful if you often look for things that are often only a small subset of keys. For example a partial index on sex would useful for nurses, only indexing those that are male as they are in a very small minority.
Cheers,
Gavin
On 4/15/2014 9:10 PM, Robert DiFalco wrote: > 1. >500K rows per day into the calls table. > 2. Very rarely. The only common query is gathering users that have not > been called "today" (along with some other qualifying criteria). More > analytical queries/reports are done for internal use and it is not > essential that they be lickity-split. > a. Usually just one connection at a time executes read queries. > b. the users not called today query will be done once a day. > c. Daily > d. All users for the last year (if you are asking about retention). We > will also rarely have to run for all time. > e. Not that I'm aware of (or seen) today. > f. For the simple queries we cannot afford latency between calls and > querying who was already called. > > While I don't seem to be getting much support for it here :D my write > performance (which is most essential) has been much better since I > further normalized the tables and made it so that NULL is never used > and data is never updated (i.e. it is immutable once it is written). Based on the above you are primarily capturing data and feeding back essentially one easy to find result set [who has NOT been successfully called] on an ongoing single threaded basis [once per day?]. So you are absolutely correct - tune for writing speed. > The summary table was really a separate point from whether or not > people liked my schema or not -- I mean whether I de-normalize as > people are asking or not, there would still be the question of a > summary table for MAX and COUNT queries or to not have a summary table > for those. I probably made the original question too open ended. > Do you know your answer? you said : "Occasionally I will want to know things like " you answered to frequency on queries as "the users not called today query will be done once a day." as was c) [I'm assuming once?] and d) appears to be "ad-hoc" and you said your users can deal with latency in response for those. So finding Min/Max/Count quickly really *don't* matter for tuning. So the only reason I can see to add a summary table is to ... simplify maintenance [note I did NOT say "development"] and then only IF it doesn't impact the write speeds beyond an acceptable level. Proper internal / external documentation can mitigate maintenance nightmares. If your developer(s) can't figure out how to get the data they need from the schema - then give them the queries to run. [you are likely better at tuning those anyway] Last consideration - business consumption of data does change over time. Disk space is cheap [but getting and keeping speed sometimes isn't]. You might consider including ongoing partial archival of the operational data during slow usage (write) periods. Roxanne
Thanks Roxanne, I suppose when it comes down to it -- for the current use cases and data size -- my only concern is the "calling" query that will need to use max to determine if a user has already had a call today. For a large user set, for each user I would either have to MAX on the answered timestamp to compare it against today or do an exist query to see if any timestamp for that user is greater or equal than "today".
But I suppose I just need to construct a huge dataset and see. I was thinking by keeping a summary so I always knew the last answer or call time for each user that I could mitigate this becoming an issue. Over time a single user could have answered a call thousands of times. So that would make a "<=" timestamp query be just # of users instead of # of users X 1000 (or however many calls they have answered over the non-archived time period).
On Wed, Apr 16, 2014 at 8:42 AM, Roxanne Reid-Bennett <rox@tara-lu.com> wrote:
On 4/15/2014 9:10 PM, Robert DiFalco wrote:Based on the above you are primarily capturing data and feeding back essentially one easy to find result set [who has NOT been successfully called] on an ongoing single threaded basis [once per day?]. So you are absolutely correct - tune for writing speed.1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not been called "today" (along with some other qualifying criteria). More analytical queries/reports are done for internal use and it is not essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and querying who was already called.
While I don't seem to be getting much support for it here :D my write performance (which is most essential) has been much better since I further normalized the tables and made it so that NULL is never used and data is never updated (i.e. it is immutable once it is written).Do you know your answer?The summary table was really a separate point from whether or not people liked my schema or not -- I mean whether I de-normalize as people are asking or not, there would still be the question of a summary table for MAX and COUNT queries or to not have a summary table for those. I probably made the original question too open ended.
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with latency in response for those.
So finding Min/Max/Count quickly really *don't* matter for tuning.
So the only reason I can see to add a summary table is to ... simplify maintenance [note I did NOT say "development"] and then only IF it doesn't impact the write speeds beyond an acceptable level. Proper internal / external documentation can mitigate maintenance nightmares. If your developer(s) can't figure out how to get the data they need from the schema - then give them the queries to run. [you are likely better at tuning those anyway]
Last consideration - business consumption of data does change over time. Disk space is cheap [but getting and keeping speed sometimes isn't]. You might consider including ongoing partial archival of the operational data during slow usage (write) periods.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 4/16/2014 2:40 PM, Robert DiFalco wrote: > Thanks Roxanne, I suppose when it comes down to it -- for the current > use cases and data size -- my only concern is the "calling" query that > will need to use max to determine if a user has already had a call > today. For a large user set, for each user I would either have to MAX > on the answered timestamp to compare it against today or do an exist > query to see if any timestamp for that user is greater or equal than > "today". I didn't go back to look at your original schema- but.. if your 500K records are coming in time ordered... You may be able to track "max" as an attribute on an "SCD" based on the caller/callee table [or the caller/ee table itself if that table is only used by your app] with an update from a post-insert trigger on the appropriate table. Even if they aren't time ordered, you add the overhead of a single comparative in the trigger. Downside is that you fire a trigger and an update for every insert. [or just an update depending on what is driving your load of the 500K records] Again - the proof on "value" of this overhead is a comparison of the cost for the updates vs the cost on the query to find max() I suspect your once a day query can afford all sorts of other optimizations that are "better" than a trigger fired on every insert. [such as the function index - that was already mentioned] I really suspect you just don't have enough load on the query side (complex queries * # of users) to justify the extra load on the write side (+1 trigger, +1 update / insert) to avoid a (potentially) heavy query load 1x/day. Another option... if only worried about "today".. then keep only "today's" data in your query table, and migrate historical data nightly to a pseudo archive table for those "every once in a while" questions. I haven't played with table inheritance in Postgres - but that's a capability I might look at if I were doing a pseudo archive table. Roxanne