Thread: Design ? table vs. view?

Design ? table vs. view?

From
John McKown
Date:
I have a table which has some "raw" data in it. By "raw", I mean it is minimally processed from a log file. Every week, I update this table by processing the weekly log using awk to create a "psql script" file which looks similar to:

COPY rawdata FROM STDIN;
.... lines created by awk script
\.

The table schema is:
        Table "jobrun.rawdata"
  Column  |     Type      |
----------+---------------+
 lpar     | character(4)  |
 yyddd    | character(5)  |
 timedata | character(11) |
 jobid    | character(8)  |
 msgid    | character(7)  |
 jobname  | character(8)  |

Now, this data is not really very useful in its raw form. So I "process" it via a view:

                          View "jobrun.rundata"
 Column  |           Type           |
---------+--------------------------+
 lpar    | character(4)             |
 msgid   | character(7)             |
 jobname | character(8)             |
 jobid   | character(8)             |
 msgtime | timestamp with time zone |
View definition:
 SELECT rawdata.lpar,
    rawdata.msgid,
    rawdata.jobname,
    rawdata.jobid,
    to_timestamp((rawdata.yyddd::text || ' '::text) || rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
   FROM rawdata;

My question is this: If I do a number of SELECTs on the "rundata" table. So, would it be worth while to make this a table in itself? The plus of a view is that I don't need to worry about updates. And I still have the "raw" data around. In reality, this is just the first VIEW. I create three other views. Two views are to "subset" the data based on the contents of the "msgid" value (there are only 2 possibilities at present: 'START' and 'END'). The final view, which is my actual information is a FULL OUTER JOIN of the START and END subset, based on lpar,jobname, and jobid:

                          View "jobrun.runinfo"
  Column  |           Type           |
----------+--------------------------+
 lpar     | character(4)             |
 jobname  | character(8)             |
 jobid    | character(8)             |
 runstart | timestamp with time zone |
 runend   | timestamp with time zone |
View definition:
 SELECT COALESCE(a.lpar, b.lpar) AS lpar,
    COALESCE(a.jobname, b.jobname) AS jobname,
    COALESCE(a.jobid, b.jobid) AS jobid,
    a.msgtime AS runstart,
    b.msgtime AS runend
   FROM runstart a
   FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND a.jobid = b.jobid;

So the overhead may be quite high, because to SELECT from RUNINFO, PostgreSQL must realize all four views.

I appreciate your thoughts on if this is OK, given that performance is currently acceptable. Mainly because this work is basically only done one a week, on Sundays. And I don't do it myself, it is done via a scheduler (not cron, but similar) which runs some scripts.

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Re: Design ? table vs. view?

From
David G Johnston
Date:
John McKown wrote
> I have a table which has some "raw" data in it. By "raw", I mean it is
> minimally processed from a log file. Every week, I update this table by
> processing the weekly log using awk to create a "psql script" file which
> looks similar to:
>
> COPY rawdata FROM STDIN;
> .... lines created by awk script
> \.
>
> The table schema is:
>         Table "jobrun.rawdata"
>   Column  |     Type      |
> ----------+---------------+
>  lpar     | character(4)  |
>  yyddd    | character(5)  |
>  timedata | character(11) |
>  jobid    | character(8)  |
>  msgid    | character(7)  |
>  jobname  | character(8)  |
>
> Now, this data is not really very useful in its raw form. So I "process"
> it
> via a view:
>
>                           View "jobrun.rundata"
>  Column  |           Type           |
> ---------+--------------------------+
>  lpar    | character(4)             |
>  msgid   | character(7)             |
>  jobname | character(8)             |
>  jobid   | character(8)             |
>  msgtime | timestamp with time zone |
> View definition:
>  SELECT rawdata.lpar,
>     rawdata.msgid,
>     rawdata.jobname,
>     rawdata.jobid,
>     to_timestamp((rawdata.yyddd::text || ' '::text) ||
> rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
>    FROM rawdata;
>
> My question is this: If I do a number of SELECTs on the "rundata" table.
> So, would it be worth while to make this a table in itself? The plus of a
> view is that I don't need to worry about updates. And I still have the
> "raw" data around. In reality, this is just the first VIEW. I create three
> other views. Two views are to "subset" the data based on the contents of
> the "msgid" value (there are only 2 possibilities at present: 'START' and
> 'END'). The final view, which is my actual information is a FULL OUTER
> JOIN
> of the START and END subset, based on lpar,jobname, and jobid:
>
>                           View "jobrun.runinfo"
>   Column  |           Type           |
> ----------+--------------------------+
>  lpar     | character(4)             |
>  jobname  | character(8)             |
>  jobid    | character(8)             |
>  runstart | timestamp with time zone |
>  runend   | timestamp with time zone |
> View definition:
>  SELECT COALESCE(a.lpar, b.lpar) AS lpar,
>     COALESCE(a.jobname, b.jobname) AS jobname,
>     COALESCE(a.jobid, b.jobid) AS jobid,
>     a.msgtime AS runstart,
>     b.msgtime AS runend
>    FROM runstart a
>    FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND
> a.jobid = b.jobid;
>
> So the overhead may be quite high, because to SELECT from RUNINFO,
> PostgreSQL must realize all four views.
>
> I appreciate your thoughts on if this is OK, given that performance is
> currently acceptable. Mainly because this work is basically only done one
> a
> week, on Sundays. And I don't do it myself, it is done via a scheduler
> (not
> cron, but similar) which runs some scripts.

I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is.  I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.

In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Design-table-vs-view-tp5811577p5811589.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Design ? table vs. view?

From
John McKown
Date:
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
John McKown wrote
> I have a table which has some "raw" data in it. By "raw", I mean it is
> minimally processed from a log file. Every week, I update this table by
> processing the weekly log using awk to create a "psql script" file which
> looks similar to:
>
<snip>
> So the overhead may be quite high, because to SELECT from RUNINFO,
> PostgreSQL must realize all four views.
>
> I appreciate your thoughts on if this is OK, given that performance is
> currently acceptable. Mainly because this work is basically only done one
> a
> week, on Sundays. And I don't do it myself, it is done via a scheduler
> (not
> cron, but similar) which runs some scripts.

I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is.  I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.

In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.

David J.


Being the indecisive nut that I am, I am going to do both <grin/>. I will keep the current view. But when I update the rawdata, what I will then do is:

drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;

I am fairly confident that there cannot be any duplicates in runinfo. But, being paranoid as well, I will do the DISTINCT just to be sure. I may change the VIEW to do that in the future, and remove it from the preceeding. Since the process which updates the rawdata table is automated and runs on a Sunday, the time needed to recreate runinfo_table is not relevant to me. So I get what I want, unless I update rawdata off schedule. I cannot imagine why I would do that since the logs from which I create it are generally only available after 17:00 local time on Sunday. Getting the iogs-to-date information for the time since the last dump is basically a PITA and my current use is not critical. Actually, it is more a "skunkworks" project of my own to produce a set of nice graphs, using R, which _might_ turn out to be interesting to management, but the production of which _will_ help me learn PostgreSQL and R better (hopefully).

Many thanks.

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Re: Design ? table vs. view?

From
Rémi Cura
Date:
Hey,
I guess you know all about PL/R,
the R language extension for postgres .
It is very convenient, though be carefull as sometime it crashed my server.

Cheers,
Rémi-C


2014-07-16 3:42 GMT+02:00 John McKown <john.archie.mckown@gmail.com>:
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
John McKown wrote
> I have a table which has some "raw" data in it. By "raw", I mean it is
> minimally processed from a log file. Every week, I update this table by
> processing the weekly log using awk to create a "psql script" file which
> looks similar to:
>
<snip>
> So the overhead may be quite high, because to SELECT from RUNINFO,
> PostgreSQL must realize all four views.
>
> I appreciate your thoughts on if this is OK, given that performance is
> currently acceptable. Mainly because this work is basically only done one
> a
> week, on Sundays. And I don't do it myself, it is done via a scheduler
> (not
> cron, but similar) which runs some scripts.

I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is.  I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.

In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.

David J.


Being the indecisive nut that I am, I am going to do both <grin/>. I will keep the current view. But when I update the rawdata, what I will then do is:

drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;

I am fairly confident that there cannot be any duplicates in runinfo. But, being paranoid as well, I will do the DISTINCT just to be sure. I may change the VIEW to do that in the future, and remove it from the preceeding. Since the process which updates the rawdata table is automated and runs on a Sunday, the time needed to recreate runinfo_table is not relevant to me. So I get what I want, unless I update rawdata off schedule. I cannot imagine why I would do that since the logs from which I create it are generally only available after 17:00 local time on Sunday. Getting the iogs-to-date information for the time since the last dump is basically a PITA and my current use is not critical. Actually, it is more a "skunkworks" project of my own to produce a set of nice graphs, using R, which _might_ turn out to be interesting to management, but the production of which _will_ help me learn PostgreSQL and R better (hopefully).

Many thanks.

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown