Thread: Postgres and data warehouses

Postgres and data warehouses

From
Nigel Gardiner
Date:
I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a real-time data warehouse, as opposed to batch-driven.

One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all database writes and echoing them to a program that rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to use triggers on the tables of interest to write to tables to journal the changes, which then get processed by a batch job to achieve the same thing.

One of the problems of the trigger on the transactional database approach is that if there is ever a problem with the trigger, the main transaction is affected. I'm not sure if that is avoidable with proper exception handling in the trigger code? It does mean a lot of trigger code to maintain, and another schema to maintain (the journalled changes), so there were several drawbacks.

I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of Postgres databases could be used as a streaming journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping. I've not dug into the async rep protocol yet, before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some better approach I should be looking at first instead.

Any feedback much appreciated.

Regards,
Nigel

Re: Postgres and data warehouses

From
Stephen Frost
Date:
* Nigel Gardiner (nigelgardiner@gmail.com) wrote:
> I've had a quick search and haven't seen this approach used yet, but I was
> thinking, the asynchronous replication of Postgres databases could be used
> as a streaming journal of changes to be processed by a data warehouse. The
> other approach that suggests itself is WAL file shipping. I've not dug into
> the async rep protocol yet, before I do so I just wanted to get some brief
> feedback on whether I'm on the wrong track or not, and if there's some
> better approach I should be looking at first instead.

Using a streaming replica for data warehouse queries is quite common..
The issue there is if you want to change the data structure or store
data on the replica because a streaming replica based on WAL shipping is
read-only.  You can create FDW tables which you can write to (even on
the replica) or you can create an independent database which has FDW
tables to the replica.  They have their own pros and cons, of course.

    Thanks!

        Stephen

Attachment

Re: Postgres and data warehouses

From
Bill Moran
Date:
On Sun, 8 Mar 2015 11:20:16 -0400
Stephen Frost <sfrost@snowman.net> wrote:

> * Nigel Gardiner (nigelgardiner@gmail.com) wrote:
> > I've had a quick search and haven't seen this approach used yet, but I was
> > thinking, the asynchronous replication of Postgres databases could be used
> > as a streaming journal of changes to be processed by a data warehouse. The
> > other approach that suggests itself is WAL file shipping. I've not dug into
> > the async rep protocol yet, before I do so I just wanted to get some brief
> > feedback on whether I'm on the wrong track or not, and if there's some
> > better approach I should be looking at first instead.
>
> Using a streaming replica for data warehouse queries is quite common..
> The issue there is if you want to change the data structure or store
> data on the replica because a streaming replica based on WAL shipping is
> read-only.  You can create FDW tables which you can write to (even on
> the replica) or you can create an independent database which has FDW
> tables to the replica.  They have their own pros and cons, of course.

Another option is to replicate using Slony instead of streaming, which allows
you to create additional tables on the replica that are read/write in
addition to triggers that only fire on the replica. It's complicated, but
pretty damn powerful.

--
Bill Moran


Re: Postgres and data warehouses

From
John R Pierce
Date:
On 3/8/2015 7:40 AM, Nigel Gardiner wrote:
> I'm looking at making a data warehouse to address our rapidly
> spiralling report query times against the OLTP. I'm looking first at
> what it would take to make this a real-time data warehouse, as opposed
> to batch-driven.

we use a hybrid architecture.    we have a 'republisher' process that
repeatedly slurps new data from the OLTP database and sends it to the
back end databases, using a 'publish/subscribe' messaging bus.
several back end databases subscribe to this data, and their subscriber
processes insert the incoming data into those OLAP and various other
reporting databases.   this way the reporting databases can have
completely different schemas optimized for their needs, and have
different retention requirements than the OLTP database.

this republisher is usually within a few seconds of live new data. in
our case its made fairly easy to track 'new' because all our OLTP
transactions are event-oriented.


--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Postgres and data warehouses

From
Rob Sargent
Date:
I strongly suspect many would like further details on your implementation. A user conference session well recorded
perhaps?

> On Mar 8, 2015, at 4:57 PM, John R Pierce <pierce@hogranch.com> wrote:
>
>> On 3/8/2015 7:40 AM, Nigel Gardiner wrote:
>> I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm
lookingfirst at what it would take to make this a real-time data warehouse, as opposed to batch-driven. 
>
> we use a hybrid architecture.    we have a 'republisher' process that repeatedly slurps new data from the OLTP
databaseand sends it to the back end databases, using a 'publish/subscribe' messaging bus.    several back end
databasessubscribe to this data, and their subscriber processes insert the incoming data into those OLAP and various
otherreporting databases.   this way the reporting databases can have completely different schemas optimized for their
needs,and have different retention requirements than the OLTP database. 
>
> this republisher is usually within a few seconds of live new data. in our case its made fairly easy to track 'new'
becauseall our OLTP transactions are event-oriented. 
>
>
> --
> john r pierce                                      37N 122W
> somewhere on the middle of the left coast
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: Postgres and data warehouses

From
John R Pierce
Date:
On 3/8/2015 8:24 PM, Rob Sargent wrote:
> I strongly suspect many would like further details on your implementation. A user conference session well recorded
perhaps?

the details are proprietary and quite specific to our workload and
requirements.



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Postgres and data warehouses

From
Rob Sargent
Date:
I expected the latter but … some proprietors like to brag :)

Standing down.

> On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:
>
> On 3/8/2015 8:24 PM, Rob Sargent wrote:
>> I strongly suspect many would like further details on your implementation. A user conference session well recorded
perhaps?
>
> the details are proprietary and quite specific to our workload and requirements.
>
>
>
> --
> john r pierce                                      37N 122W
> somewhere on the middle of the left coast
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



Re: Postgres and data warehouses

From
Jerry Sievers
Date:
Nigel Gardiner <nigelgardiner@gmail.com> writes:

> I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm
lookingfirst at what it would take to make this a 
> real-time data warehouse, as opposed to batch-driven.
>
> One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all
databasewrites and echoing them to a program that 
> rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to
usetriggers on the tables of interest to write to 
> tables to journal the changes, which then get processed by a batch job to achieve the same thing.
>
> One of the problems of the trigger on the transactional database
> approach is that if there is ever a problem with the trigger, the main
> transaction is affected. I'm not sure if that is avoidable with proper
> exception handling in the trigger code? It does mean a lot of trigger
> code to maintain, and another schema to maintain (the journalled
> changes), so there were several drawbacks.

Firing a trigger on INS, UPD, DEL that simply loads a a journal table
with PK value and the event type is so utterly trivial as to be a
non-issue anywhere but the most high impact environments.

> I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of
Postgresdatabases could be used as a streaming 
> journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping.
I'venot dug into the async rep protocol yet, 
> before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some
betterapproach I should be looking at first 
> instead.

Consider if new Logical Change Set features of 9.4 might apply to your
case.  May accomplish same as trigger based solution without all the
extra supporting structures hitherto necessary.

> Any feedback much appreciated.
>
> Regards,
> Nigel
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: Postgres and data warehouses

From
Jim Nasby
Date:
On 3/8/15 11:02 PM, Rob Sargent wrote:
> I expected the latter but … some proprietors like to brag :)
>
> Standing down.
>
>> On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:
>>
>> On 3/8/2015 8:24 PM, Rob Sargent wrote:
>>> I strongly suspect many would like further details on your implementation. A user conference session well recorded
perhaps?
>>
>> the details are proprietary and quite specific to our workload and requirements.

You could build something similar with Skytools; it's what Skype
actually created it for. They ran many "real time" datamarts/warehouses
using it.

The big difference between it and Slony is that Skytools is modular, so
it's designed to allow for this kind of flexibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: Postgres and data warehouses

From
Brent Tubbs
Date:
The logical replication features in 9.4 look very promising as an alternative to the binary replication that requires an identical schema on both ends. There's no reason that the thing you're replicating to couldn't be something like RabbitMQ or Storm.

I've been playing with a JSON-based decoder plugin so Postgres could emit changes to non-SQL systems like this.  See https://bitbucket.org/btubbs/decoder_json/src.  It should be still considered experimental and only used by people unafraid of hacking on C at this point.  More eyeballs would be welcome.

I blogged about more options on streaming updates from Postgres at http://btubbs.com/streaming-updates-from-postgres.html.

On Tue, Mar 10, 2015 at 8:11 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 3/8/15 11:02 PM, Rob Sargent wrote:
I expected the latter but … some proprietors like to brag :)

Standing down.

On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:

On 3/8/2015 8:24 PM, Rob Sargent wrote:
I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?

the details are proprietary and quite specific to our workload and requirements.

You could build something similar with Skytools; it's what Skype actually created it for. They ran many "real time" datamarts/warehouses using it.

The big difference between it and Slony is that Skytools is modular, so it's designed to allow for this kind of flexibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general