On 8/9/06, Stefano B. <stefano.bonnin@comai.to> wrote:
>
>
> hi,
> I'm working in the implementation of a datawarehouse on Postgres.
>
> For analisys of aggregated data I'd like to use some OLAP tools like for
> example, Mondrian, OR use the meterialized view (if better).
>
> My questions:
> is there any documentation about warehouse (with OLAP or materialized view)
> on Postgres?
> Which is better? OLAP tools (I think) or materialized view?
>
> Any information is welcome!
> Thanks
You'll obviously have to identify all possible changes to your
normalized data that affect the data warehouse consistency.
As far as I know, pgsql doesn't directly support materialized views,
so you'll write one or more update_warehouse functions and do one of
the following:
1.) set up a collection of triggers to keep track of all changes since
the last synchronization. Set up a periodic task (probably using
pgAgent) which will invoke the update_warehouse functions which will
than update the needed records.
2.) set up a collection of triggers directly calling your update_functions
The first option is what probably 99% users need because of the
implicit nature of the queries run against warehouse data. An hour,
day or even week of the latest data very often makes no difference
when analyzing an OLAP cube, but the exact tolerance level obviously
depends on the exact queries analysts really run, the resources
available/needed to refresh the warehouse, the nature of the data etc.
The second option is nice in that it keeps the relational data in sync
with the warehouse, but this can only be implemented in specific
systems where the update load is tolerable, changes limited in scope
and update triggers highly focused on the scope of the change. I can
imagine circumstances when you'd need such a setup, but most of the
time it's just a theoretical possibility.
I have implemented the second approach using the very good Mondrian
OLAP server and stunning JRubik analysis interface in one of my
systems (small database, several dozen MB, less than a milion records,
total) at the price of a ~2 sec delay when updating a row - an action
that happens about 50-60 times per day. The reason was very
non-technical: we needed to be able to say "you can analyze data in
real time".
Hope this helps.
Cheers,
t.n.a.