Thread: Postgres and data warehouses
I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a real-time data warehouse, as opposed to batch-driven.
One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all database writes and echoing them to a program that rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to use triggers on the tables of interest to write to tables to journal the changes, which then get processed by a batch job to achieve the same thing.
One of the problems of the trigger on the transactional database approach is that if there is ever a problem with the trigger, the main transaction is affected. I'm not sure if that is avoidable with proper exception handling in the trigger code? It does mean a lot of trigger code to maintain, and another schema to maintain (the journalled changes), so there were several drawbacks.
I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of Postgres databases could be used as a streaming journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping. I've not dug into the async rep protocol yet, before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some better approach I should be looking at first instead.
Any feedback much appreciated.
Regards,
Nigel
* Nigel Gardiner (nigelgardiner@gmail.com) wrote: > I've had a quick search and haven't seen this approach used yet, but I was > thinking, the asynchronous replication of Postgres databases could be used > as a streaming journal of changes to be processed by a data warehouse. The > other approach that suggests itself is WAL file shipping. I've not dug into > the async rep protocol yet, before I do so I just wanted to get some brief > feedback on whether I'm on the wrong track or not, and if there's some > better approach I should be looking at first instead. Using a streaming replica for data warehouse queries is quite common.. The issue there is if you want to change the data structure or store data on the replica because a streaming replica based on WAL shipping is read-only. You can create FDW tables which you can write to (even on the replica) or you can create an independent database which has FDW tables to the replica. They have their own pros and cons, of course. Thanks! Stephen
Attachment
On Sun, 8 Mar 2015 11:20:16 -0400 Stephen Frost <sfrost@snowman.net> wrote: > * Nigel Gardiner (nigelgardiner@gmail.com) wrote: > > I've had a quick search and haven't seen this approach used yet, but I was > > thinking, the asynchronous replication of Postgres databases could be used > > as a streaming journal of changes to be processed by a data warehouse. The > > other approach that suggests itself is WAL file shipping. I've not dug into > > the async rep protocol yet, before I do so I just wanted to get some brief > > feedback on whether I'm on the wrong track or not, and if there's some > > better approach I should be looking at first instead. > > Using a streaming replica for data warehouse queries is quite common.. > The issue there is if you want to change the data structure or store > data on the replica because a streaming replica based on WAL shipping is > read-only. You can create FDW tables which you can write to (even on > the replica) or you can create an independent database which has FDW > tables to the replica. They have their own pros and cons, of course. Another option is to replicate using Slony instead of streaming, which allows you to create additional tables on the replica that are read/write in addition to triggers that only fire on the replica. It's complicated, but pretty damn powerful. -- Bill Moran
On 3/8/2015 7:40 AM, Nigel Gardiner wrote: > I'm looking at making a data warehouse to address our rapidly > spiralling report query times against the OLTP. I'm looking first at > what it would take to make this a real-time data warehouse, as opposed > to batch-driven. we use a hybrid architecture. we have a 'republisher' process that repeatedly slurps new data from the OLTP database and sends it to the back end databases, using a 'publish/subscribe' messaging bus. several back end databases subscribe to this data, and their subscriber processes insert the incoming data into those OLAP and various other reporting databases. this way the reporting databases can have completely different schemas optimized for their needs, and have different retention requirements than the OLTP database. this republisher is usually within a few seconds of live new data. in our case its made fairly easy to track 'new' because all our OLTP transactions are event-oriented. -- john r pierce 37N 122W somewhere on the middle of the left coast
I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? > On Mar 8, 2015, at 4:57 PM, John R Pierce <pierce@hogranch.com> wrote: > >> On 3/8/2015 7:40 AM, Nigel Gardiner wrote: >> I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm lookingfirst at what it would take to make this a real-time data warehouse, as opposed to batch-driven. > > we use a hybrid architecture. we have a 'republisher' process that repeatedly slurps new data from the OLTP databaseand sends it to the back end databases, using a 'publish/subscribe' messaging bus. several back end databasessubscribe to this data, and their subscriber processes insert the incoming data into those OLAP and various otherreporting databases. this way the reporting databases can have completely different schemas optimized for their needs,and have different retention requirements than the OLTP database. > > this republisher is usually within a few seconds of live new data. in our case its made fairly easy to track 'new' becauseall our OLTP transactions are event-oriented. > > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On 3/8/2015 8:24 PM, Rob Sargent wrote: > I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? the details are proprietary and quite specific to our workload and requirements. -- john r pierce 37N 122W somewhere on the middle of the left coast
I expected the latter but … some proprietors like to brag :) Standing down. > On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote: > > On 3/8/2015 8:24 PM, Rob Sargent wrote: >> I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? > > the details are proprietary and quite specific to our workload and requirements. > > > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Nigel Gardiner <nigelgardiner@gmail.com> writes: > I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm lookingfirst at what it would take to make this a > real-time data warehouse, as opposed to batch-driven. > > One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all databasewrites and echoing them to a program that > rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to usetriggers on the tables of interest to write to > tables to journal the changes, which then get processed by a batch job to achieve the same thing. > > One of the problems of the trigger on the transactional database > approach is that if there is ever a problem with the trigger, the main > transaction is affected. I'm not sure if that is avoidable with proper > exception handling in the trigger code? It does mean a lot of trigger > code to maintain, and another schema to maintain (the journalled > changes), so there were several drawbacks. Firing a trigger on INS, UPD, DEL that simply loads a a journal table with PK value and the event type is so utterly trivial as to be a non-issue anywhere but the most high impact environments. > I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of Postgresdatabases could be used as a streaming > journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping. I'venot dug into the async rep protocol yet, > before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some betterapproach I should be looking at first > instead. Consider if new Logical Change Set features of 9.4 might apply to your case. May accomplish same as trigger based solution without all the extra supporting structures hitherto necessary. > Any feedback much appreciated. > > Regards, > Nigel > -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
On 3/8/15 11:02 PM, Rob Sargent wrote: > I expected the latter but … some proprietors like to brag :) > > Standing down. > >> On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote: >> >> On 3/8/2015 8:24 PM, Rob Sargent wrote: >>> I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? >> >> the details are proprietary and quite specific to our workload and requirements. You could build something similar with Skytools; it's what Skype actually created it for. They ran many "real time" datamarts/warehouses using it. The big difference between it and Slony is that Skytools is modular, so it's designed to allow for this kind of flexibility. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
The logical replication features in 9.4 look very promising as an alternative to the binary replication that requires an identical schema on both ends. There's no reason that the thing you're replicating to couldn't be something like RabbitMQ or Storm.
I've been playing with a JSON-based decoder plugin so Postgres could emit changes to non-SQL systems like this. See https://bitbucket.org/btubbs/decoder_json/src. It should be still considered experimental and only used by people unafraid of hacking on C at this point. More eyeballs would be welcome.
I blogged about more options on streaming updates from Postgres at http://btubbs.com/streaming-updates-from-postgres.html.
On Tue, Mar 10, 2015 at 8:11 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 3/8/15 11:02 PM, Rob Sargent wrote:I expected the latter but … some proprietors like to brag :)
Standing down.On Mar 8, 2015, at 9:56 PM, John R Pierce <pierce@hogranch.com> wrote:
On 3/8/2015 8:24 PM, Rob Sargent wrote:I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps?
the details are proprietary and quite specific to our workload and requirements.
You could build something similar with Skytools; it's what Skype actually created it for. They ran many "real time" datamarts/warehouses using it.
The big difference between it and Slony is that Skytools is modular, so it's designed to allow for this kind of flexibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general