Thread: Very high latency, low bandwidth replication
Hi
I have been grappling with a problem for some time I would appreciate some advice on. We have a public health application which is web based with a postgresql backing store which is designed for use by the public sector ministry of health in a significant number of African, Asian and other countries (http//:dhis2.org). "Traditionally" it is hosted as a national data warehouse application with users dispersed amongst district offices and sometimes health facilities around the country.
Particularly in many countries in Africa the public sector typically has limited data centre infrastructure to reliably host the application in-house and so a good number have opted to use some global cloud service (infrastructure as a service) to ensure maximum availability of the application. Others have managed to make use of in-country resources such as national ISPs and mobile companies. There are many cost-benefit and governance considerations behind these decisions which I don't need to go into here.
Whereas ministries have been prepared to do this there are important to reasons to ensure that a backup of the database can be maintained in the ministry. So we attempt to grab the nightly snapshot backups from the database each night. In the past I have attempted this somewhat simplistically with rsync over ssh but it is a very inefficient approach and particularly so over weak internet connections.
What are people's thoughts about a more optimal solution? I would like to use a more incremental approach to replication. This does not have to be a "live" replication .. asynchronously triggering once every 24 hours is sufficient. Also there are only a subset of tables which are required (the rest consist of data which is generated).
Appreciate any advice.
Regards
Bob
Hi Bob. On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com> wrote: > What are people's thoughts about a more optimal solution? I would like to > use a more incremental approach to replication. This does not have to be a > "live" replication .. asynchronously triggering once every 24 hours is > sufficient. Also there are only a subset of tables which are required (the > rest consist of data which is generated). If you only need to replicate once every 24 hours, which means you can tolerate lags, you could try log shipping. Instead of sending the wal records from master to standby directly just spool them, compress them as much as you can ( I would try pglesslog plus an XZ on it's output ), and send it once a day. This for the 'incremental part'. For the only a subset of tables, you could try to set up a local partial mirror using any of the trigger based replication products and then do log-shipping of that. Also, the logical replication slot stuff added to the latest version seems really promissing for this kind of thing, but I'm not familiar enough with it to recommend anything. Also, depending on your data updating patterns, database sizes and other stuff, a trigger based replication approach can save a lot of traffic. I mean, if you have records which are heavily updated, but only replicate once a day, you can collapse all the day stuff in a single update. I once did a similar thing to transmit deltas over a 2400bps modem by making daily sorted dumps and sending daily deltas with previous day ( it needed a bit of coding, about a couple hundred lines, but produced ridiculously small deltas, and with a bit of care their application was idempotent, which simplified the recovery on errors ). Francisco Olarte.
On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote: > What are people's thoughts about a more optimal solution? I would like to > use a more incremental approach to replication. This does not have to be a > "live" replication .. asynchronously triggering once every 24 hours is > sufficient. Also there are only a subset of tables which are required (the > rest consist of data which is generated). WAL shipping is probably best here. Configure an archive_command on the master to compress and push logs to cloud storage, and configure a hot standby on site to pull and decompress the logs. The wal-e tool may make things simpler pushing to cloud storage, or just follow the PostgreSQL documentation to archive the WAL files to a filesystem. If that isn't good enough, you can look at more esoteric approaches (eg. nightly plaintext dumps to a git repository, pushing changes to disk on site). -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Thanks Francisco for these inputs. I hadn't considered log shipping as I knew I didn't want to track changes to all tables (and databases). Setting up a local partial mirror is an interesting thought which hadn't crossed my mind .. I'll giver that some consideration.
Though currently I am thinking to address the problem of generating deltas at the application level rather than to use postgresql features which are largely optimized for a slightly different set of circumstances and requirements.
Impressive what can be done witha 2400 baud modem when you set your mind to it. Fortunately this days are mostly behind us :-)
On 30 June 2014 13:05, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi Bob.If you only need to replicate once every 24 hours, which means you can
On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:
> What are people's thoughts about a more optimal solution? I would like to
> use a more incremental approach to replication. This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient. Also there are only a subset of tables which are required (the
> rest consist of data which is generated).
tolerate lags, you could try log shipping. Instead of sending the wal
records from master to standby directly just spool them, compress them
as much as you can ( I would try pglesslog plus an XZ on it's output
), and send it once a day. This for the 'incremental part'. For the
only a subset of tables, you could try to set up a local partial
mirror using any of the trigger based replication products and then do
log-shipping of that.
Also, the logical replication slot stuff added to the latest version
seems really promissing for this kind of thing, but I'm not familiar
enough with it to recommend anything.
Also, depending on your data updating patterns, database sizes and
other stuff, a trigger based replication approach can save a lot of
traffic. I mean, if you have records which are heavily updated, but
only replicate once a day, you can collapse all the day stuff in a
single update. I once did a similar thing to transmit deltas over a
2400bps modem by making daily sorted dumps and sending daily deltas
with previous day ( it needed a bit of coding, about a couple hundred
lines, but produced ridiculously small deltas, and with a bit of care
their application was idempotent, which simplified the recovery on
errors ).
Francisco Olarte.
Thanks Stuart. I'll do some measurements on plaintext dump to git.
On 2 July 2014 09:46, Stuart Bishop <stuart@stuartbishop.net> wrote:
On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote:WAL shipping is probably best here. Configure an archive_command on
> What are people's thoughts about a more optimal solution? I would like to
> use a more incremental approach to replication. This does not have to be a
> "live" replication .. asynchronously triggering once every 24 hours is
> sufficient. Also there are only a subset of tables which are required (the
> rest consist of data which is generated).
the master to compress and push logs to cloud storage, and configure
a hot standby on site to pull and decompress the logs. The wal-e tool
may make things simpler pushing to cloud storage, or just follow the
PostgreSQL documentation to archive the WAL files to a filesystem.
If that isn't good enough, you can look at more esoteric approaches
(eg. nightly plaintext dumps to a git repository, pushing changes to
disk on site).
--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/
Hi Bob: On Sat, Jul 5, 2014 at 3:34 PM, Bob Jolliffe <bobjolliffe@gmail.com> wrote: > Though currently I am thinking to address the problem of generating deltas > at the application level rather than to use postgresql features which are > largely optimized for a slightly different set of circumstances and > requirements. Yeah, special purpose can always be more efficient. Specially when you can tune for your special circunstances, like non full replication, and replicating only once a day, and you probably can minimizae the changesets knowing data dependencies. > Impressive what can be done witha 2400 baud modem when you set your mind to > it. Fortunately this days are mostly behind us :-) Not so behind. Then you did that kind of thing for 100Mb data sets, now you can use the same algorithms for 1Tb data sets. Line speed have gone up more than data set sizes, but this advantage has been eaten by customer delay expectations ( in my case, now they want at most 30 min delay, not 1440 ). Not knowing your exact data structure or size I cannot recommend anything, but I've been abble to make really good deltas using the same algorithms that were used on tapes, dump the table, sort it, diff with last days, the throughput of current machines is amazing, you can easily achieve full disk speed. Regards. Francisco Olarte.