Thread: Best practices for aggregate table design

Best practices for aggregate table design

From
droberts
Date:
Hi,
I'm trying to construct an agg table to capture phone call data and group by
state, city and time but also want just general measures by month. I'm
thinking to have this:

month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls

2015-01 12 2 54 2 56
2015-01 10 4 147 15 162
2015-01 null null 201 17 218


-----------------------

and a dimension table to easily convert city, state into their string
versions and also provide other attributes (e.g. GPS coordinates).

My questions are:
1. I'm including 'total_calls' in the schema even thought it could easily be
calculated from inbound + outbound. I did this for simplicity in a REST
call, is that a bad idea?
2. I'm adding a 'null' row to show all the calls for a given month
regardless of city or state, again to simplify the client side. It adds a
row and is somewhat sparse but preferrable by the developer. Acceptable
practice?

Thanks!



--
View this message in context: http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
"David G. Johnston"
Date:
On Tue, Oct 6, 2015 at 11:59 AM, droberts <david.roberts@riverbed.com> wrote:
Hi,
I'm trying to construct an agg table to capture phone call data and group by
state, city and time but also want just general measures by month. I'm
thinking to have this:

month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls

2015-01 12 2 54 2 56
2015-01 10 4 147 15 162
2015-01 null null 201 17 218


-----------------------

and a dimension table to easily convert city, state into their string
versions and also provide other attributes (e.g. GPS coordinates).

My questions are:
1. I'm including 'total_calls' in the schema even thought it could easily be
calculated from inbound + outbound. I did this for simplicity in a REST
call, is that a bad idea?

​Hard to say​ given the limited insight into the use case.  More concerned about writing since on the read side you can easily wrap the table in a view that provides the derived value as a column.  You are also trading space for processing power.  You only end up processing the small subset actively being queried presently while you end up storing the derived data for every single record even if it is likely never to be queried again - or at least queried in a highly time-sensitive environment.

2. I'm adding a 'null' row to show all the calls for a given month
regardless of city or state, again to simplify the client side. It adds a
row and is somewhat sparse but preferrable by the developer. Acceptable
practice?

​I would make up a city_id and state_id representing "ALL" and use that in place of NULL.

Storing derived information is a matter of making calculated trade-offs in risking data anomalies in exchange for performance benefits.  Choosing to go this route is likely worthwhile if you can execute it correctly.

Neither choice is flat-out wrong.  Beyond that it takes more information than provided to pass judgement.

David J.

Re: Best practices for aggregate table design

From
droberts
Date:
Thanks for your response.  One more follow-up question.

Is there a best practice when to create a measure that includes a
property/dimension?

Let me give an example, say in my example where I have outbound and inbound
calls.

Is is best to have measures:
 -total_inbound
 -total_outbound

OR a dimension 'type' with values outbound/inbound and a single measure
column 'total' ?

I think i would have more rows but fewer columns.  Perhaps it depends upon
the number of possible options there would be (e.g. is it always just
outbound/inbound or would they ever grow further)



--
View this message in context:
http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5868958.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
John R Pierce
Date:
On 10/6/2015 11:24 AM, droberts wrote:
> OR a dimension 'type' with values outbound/inbound and a single measure
> column 'total' ?

that smells a bit too much like an "EAV" (entity-attribute-value) which
is considered an antipattern in relational circles

--
john r pierce, recycling bits in santa cruz



Re: Best practices for aggregate table design

From
"David G. Johnston"
Date:
On Tue, Oct 6, 2015 at 2:34 PM, John R Pierce <pierce@hogranch.com> wrote:
On 10/6/2015 11:24 AM, droberts wrote:
OR a dimension 'type' with values outbound/inbound and a single measure
column 'total' ?

that smells a bit too much like an "EAV" (entity-attribute-value) which is considered an antipattern in relational circles

​​All models are wrong - but some are useful.

In this example what if you care to distinguish between external and internal variations of inbound/outbound. i.e., Inter-office calls.
​While this could degrade to EAV used in moderation its called normalization.  Think about the best way to record phone numbers for a person.  This is similar.

The more arbitrary the distinction the more trouble you can get into with using columns instead of rows.  Typically if its simply "one, two, three" you are asking to get hosed.  When it is more attribute based you are generally safe in-so-far as useful questions that your model is expected to answer can be answered with a minimum of difficulty.  And if you end up having to answer different questions than designed for the choice of row-vs-column probably will be the smallest of the changes that need to be made.

In a data warehouse situation the choice should also be informed but how the source system models the data in question.  Which itself is based upon the real-world applicability of the model as it pertains to the users of the system.

David J.

 

Re: Best practices for aggregate table design

From
droberts
Date:
Okay, so is it safe to say I should use loosely use these guidelines when
deciding whether to model an attribute as a dimension
(type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?

If you know the number of values for a dimension are fixed (e.g. boolean),
then creating a measure will have benefits of:
  - reduced number of rows/storage
  - better performance since less indexing/vacuuming

the drawbacks are:
  -rigid structure, not very extensible over time (e.g. later realize I need
to also track 'internal' calls).

In my case, I'm now needing to add another measure 'encrypted=true/false',
so my table is starting to look like

month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls_inbound_encr | total_calls_outbound_encr |

getting a bit hairy but the alternative seems like it would start growing
too quickly in rows and more I/O for inserts.

month | city_id | state_id | encrypted  | type  |  total_calls

2015-01 12 2 true, false, 56
2015-01 10 4  true, true, 147
2015-01 null null 201 17 218




--
View this message in context:
http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5868967.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
"David G. Johnston"
Date:
On Tue, Oct 6, 2015 at 2:53 PM, droberts <david.roberts@riverbed.com> wrote:

month | city_id | state_id | total_calls_inbound | total_calls_outbound |
total_calls_inbound_encr | total_calls_outbound_encr |

getting a bit hairy but the alternative seems like it would start growing
too quickly in rows and more I/O for inserts.

month | city_id | state_id | encrypted  | type  |  total_calls

2015-01 12 2 true, false, 56
2015-01 10 4  true, true, 147
2015-01 null null 201 17 218

​A bit beyond my experience but my gut tells me I would choose the later over the former.  It even has a name - star schema.

David J.

Re: Best practices for aggregate table design

From
Marc Mamin
Date:
>2. I'm adding a 'null' row to show all the calls for a given month
>regardless of city or state, again to simplify the client side. It adds a
>row and is somewhat sparse but preferrable by the developer. Acceptable
>practice?

do you see any advantage with this model?
I would store your monthly data within a separate table.

regards,
Marc Mamin

Re: Best practices for aggregate table design

From
droberts
Date:
I see the advantage is for the developer.  We right one REST API call that
leverages this single table regardless whether he wants groups by city for a
month or total for a month.  Creating a separate table would make the
backend a bit more complex is all and wouldn't save on space I don't think.



--
View this message in context:
http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869195.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
Thomas Kellerer
Date:
droberts schrieb am 06.10.2015 um 20:53:
> Okay, so is it safe to say I should use loosely use these guidelines when
> deciding whether to model an attribute as a dimension
> (type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?
>
> If you know the number of values for a dimension are fixed (e.g. boolean),
> then creating a measure will have benefits of:
>   - reduced number of rows/storage
>   - better performance since less indexing/vacuuming
>
> the drawbacks are:
>   -rigid structure, not very extensible over time (e.g. later realize I need
> to also track 'internal' calls).
>
> In my case, I'm now needing to add another measure 'encrypted=true/false',
> so my table is starting to look like

Have you considered using a hstore column to store the attributes you don't know yet?

Which makes this extensible, flexible and fast.



Re: Best practices for aggregate table design

From
hari.fuchs@gmail.com
Date:
Thomas Kellerer <spam_eater@gmx.net> writes:

> droberts schrieb am 06.10.2015 um 20:53:
>> Okay, so is it safe to say I should use loosely use these guidelines when
>> deciding whether to model an attribute as a dimension
>> (type=[inbound,outbound]) vs. bundling with a measure (total_inbound) ?
>>
>> If you know the number of values for a dimension are fixed (e.g. boolean),
>> then creating a measure will have benefits of:
>>   - reduced number of rows/storage
>>   - better performance since less indexing/vacuuming
>>
>> the drawbacks are:
>>   -rigid structure, not very extensible over time (e.g. later realize I need
>> to also track 'internal' calls).
>>
>> In my case, I'm now needing to add another measure 'encrypted=true/false',
>> so my table is starting to look like
>
> Have you considered using a hstore column to store the attributes you
> don't know yet?
>
> Which makes this extensible, flexible and fast.

Is there an advantage of hstore vs. json/jsonb?

Re: Best practices for aggregate table design

From
Vick Khera
Date:

On Thu, Oct 8, 2015 at 3:49 AM, <hari.fuchs@gmail.com> wrote:
Is there an advantage of hstore vs. json/jsonb?

Definitely over json because that is not indexable. I'm not seeing an advantage over jsonb unless you want to prevent storing complex data structures.

Re: Best practices for aggregate table design

From
droberts
Date:
I haven't but wouldn't it be better to wait and just add new columns if/when
I need to?



--
View this message in context:
http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869372.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
"David G. Johnston"
Date:
Nabble has a "quote" feature - please use it.

On Thu, Oct 8, 2015 at 5:00 PM, droberts <david.roberts@riverbed.com> wrote:
I haven't but wouldn't it be better to wait and just add new columns if/when
I need to?

​Its worth keeping in the back of your mind but I tend to think that choosing hstore in order to "be flexible" is a solution looking for a problem.  The dynamics between your system and the source system will play a large role in determining whether to use a mini-table-within-a-table in your model.

David J.

Re: Best practices for aggregate table design

From
droberts
Date:
David G Johnston wrote
> Nabble has a "quote" feature - please use it.
>
> On Thu, Oct 8, 2015 at 5:00 PM, droberts <

> david.roberts@

> > wrote:
>
>> I haven't but wouldn't it be better to wait and just add new columns
>> if/when
>> I need to?
>>
>
> ​Its worth keeping in the back of your mind but I tend to think that
> choosing hstore in order to "be flexible" is a solution looking for a
> problem.  The dynamics between your system and the source system will play
> a large role in determining whether to use a mini-table-within-a-table in
> your model.
>
> David J.
> ​


Thanks for everyone's help.  Can anyone recommend a good book on database
modeling, around these 'cube' or aggregate concepts in particular?  I'm
using Postgres but shouldn't matter too much I assume.

Thanks



--
View this message in context:
http://postgresql.nabble.com/Best-practices-for-aggregate-table-design-tp5868940p5869500.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Best practices for aggregate table design

From
Roxanne Reid-Bennett
Date:
On 10/9/2015 3:16 PM, droberts wrote:
> Thanks for everyone's help. Can anyone recommend a good book on
> database modeling, around these 'cube' or aggregate concepts in
> particular? I'm using Postgres but shouldn't matter too much I assume.

Given the shift towards NoSQL for BI, and the age of the consultants, I
don't know how much of a driver of design patterns the Kimball Group
still is in the BI/Data warehouse space - but their blogs and books have
been good fodder for thought for the modeling I've done.

Roxanne


Re: Best practices for aggregate table design

From
John R Pierce
Date:
On 10/9/2015 6:07 PM, Roxanne Reid-Bennett wrote:
> On 10/9/2015 3:16 PM, droberts wrote:
>> Thanks for everyone's help. Can anyone recommend a good book on
>> database modeling, around these 'cube' or aggregate concepts in
>> particular? I'm using Postgres but shouldn't matter too much I assume.
>
> Given the shift towards NoSQL for BI, and the age of the consultants,
> I don't know how much of a driver of design patterns the Kimball Group
> still is in the BI/Data warehouse space - but their blogs and books
> have been good fodder for thought for the modeling I've done.

actually, there've been numerous cases where NoSQL deployments never
worked right, and people reverted to SQL for reporting.

--
john r pierce, recycling bits in santa cruz