Re: [GENERAL] Approach to Data Summary and Analysis - Mailing list pgsql-performance
From | Chris Curvey |
---|---|
Subject | Re: [GENERAL] Approach to Data Summary and Analysis |
Date | |
Msg-id | CADfwSsAabdU0ix3e08daxACy8=gv1t=KGjHHgQGm6+A6Vq6ngg@mail.gmail.com Whole thread Raw |
In response to | Approach to Data Summary and Analysis (Robert DiFalco <robert.difalco@gmail.com>) |
Responses |
Re: [GENERAL] Approach to Data Summary and Analysis
|
List | pgsql-performance |
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call.The basic schema is something like this:CREATE TABLE calls (id BIGINT NOT NULL, // sequence generatoruser_id BIGINT NOT NULL,called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE);CREATE TABLE calls_answered (id BIGINT NOT NULL,answered TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,PRIMARY KEY (id),FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE);And so on for calls_connected, calls_completed, call_errors, etc.Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called".I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance.SELECT MAX(a.id)WHERE c.user_id = ?;Or the number of answered calls:Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values. It's also pretty easy to reason about.So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better? I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing.Thanks!
CREATE TABLE calls (
id BIGINT NOT NULL, // sequence generator
user_id BIGINT NOT NULL,
called TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered TIMESTAMPTZ
PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);
I asked the Internet how to train my cat, and the Internet told me to get a dog.
pgsql-performance by date: