Re: When to store data that could be derived - Mailing list pgsql-general

From Ron
Subject Re: When to store data that could be derived
Date
Msg-id 835132f4-e1c6-d362-1b13-ad48671afe39@gmail.com
Whole thread Raw
In response to When to store data that could be derived  (Frank <frank@chagford.com>)
Responses Re: When to store data that could be derived  (Frank <frank@chagford.com>)
List pgsql-general
On 3/24/19 1:42 AM, Frank wrote:
> Hi all
>
> As I understand it, a  general rule of thumb is that you should never 
> create a physical column if the data could be derived from existing 
> columns. A possible reason for breaking this rule is for performance reasons.
>
> I have a situation where I am considering breaking the rule, but I am not 
> experienced enough in SQL to know if my reason is valid. I would 
> appreciate it if someone could glance at my 'before' and 'after' scenarios 
> and see if, from a 'gut-feel' point of view, I should proceed.
>
> I have a VIEW constructed as follows -
>
> CREATE VIEW view_name AS
> [select statement 1]
> UNION ALL
> [select statement 2]
> etc.
>
> This is one of the select statements. I will give the 'after' scenario 
> first -
>
> SELECT
>     'arec' AS tran_type, a.row_id AS tran_row_id,
>     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
>     END AS tran_date,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.text
>         WHEN a.tran_type = 'cb_rec' THEN w.text
>     END AS text,
>     0 - a.arec_cust AS amount_cust,
>     0 - a.arec_local AS amount_local
> FROM ar_rec_subtran a
> LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
> LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
> LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
> LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
> WHERE
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.posted
>         WHEN a.tran_type = 'cb_rec' THEN w.posted
>     END = '1'
>
> The two columns a.arec_cust and a.arec_local *can* be derived from other 
> columns, and in fact that is how it is working at the moment, so here is 
> the 'before' scenario -
>
> SELECT
>     'arec' AS tran_type, a.row_id AS tran_row_id,
>     a.tran_number AS tran_number, a.cust_row_id AS cust_row_id,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_date
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_date
>     END AS tran_date,
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.text
>         WHEN a.tran_type = 'cb_rec' THEN w.text
>     END AS text,
>     ROUND(0 - (ROUND(a.arec_amount / CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
>     END * a.cust_exch_rate, u.scale)), 2) AS amount_cust,
>     ROUND(0 - (ROUND(a.arec_amount / CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.tran_exch_rate
>         WHEN a.tran_type = 'cb_rec' THEN w.tran_exch_rate
>     END, s.scale)), 2) AS amount_local
> FROM ar_rec_subtran a
> LEFT JOIN ar_tran_rec_det z ON z.row_id = a.tran_det_row_id
> LEFT JOIN ar_tran_rec y ON y.row_id = z.tran_row_id
> LEFT JOIN cb_tran_rec_det x ON x.row_id = a.tran_det_row_id
> LEFT JOIN cb_tran_rec w ON w.row_id = x.tran_row_id
> LEFT JOIN ar_customers v ON v.row_id = a.cust_row_id
> LEFT JOIN adm_currencies u ON u.row_id = v.currency_id
> LEFT JOIN adm_params t ON t.row_id = 1
> LEFT JOIN adm_currencies s ON s.row_id = t.local_curr_id
> WHERE
>     CASE
>         WHEN a.tran_type = 'ar_rec' THEN y.posted
>         WHEN a.tran_type = 'cb_rec' THEN w.posted
>     END = '1'
>
> As you can see, complexity has increased and there are four additional JOINs.
>
> I am expecting the VIEW to be used extensively for query purposes, and my 
> gut-feel says that the second one is likely to lead to performance 
> problems in a system with a lot of data and a lot of users.

Generate an artificial load and test it?

> I am not looking for an answer - I know that I should create dummy data 
> and run some timing tests. I was just wondering if someone more 
> experienced would wince when they look at the second SELECT, or if they 
> would shrug and think that it looks fine.

Sure the second query joins a lot of tables, but is pretty straightforward.

What REALLY worries me is whether or not the query optimiser would look at 
the WHERE CASE, run away screaming and then make it use sequential scans.  
Thus, even query #1 would be slow.

>
> Any input will be appreciated.


Is this a historical data set that's never updated, or current data that's 
constantly added to?

>
> Frank Millman




-- 
Angular momentum makes the world go 'round.



pgsql-general by date:

Previous
From: Frank
Date:
Subject: When to store data that could be derived
Next
From: Steve Baldwin
Date:
Subject: Re: regr_slope returning NULL