Thread: Streaming Replication: Observations, Questions and Comments

Streaming Replication: Observations, Questions and Comments

From
Samba
Date:
Hi all,

We have a postgres-9.0 streaming replication set up where we keep the WAL segments on the master amounting to 10 GB so that we can survive longer periods of disconnect between master and slave. We do not use any shared storage space for archiving WAL logs. (the shared disk server may turn out to be another point of failure, which we would want to avoid)

Here is our basic configuration parameters in :

postgresql.conf on master:
wal_keep_segments = 640  # previously 32   |  # in logfile segments, min 1, 16MB each
archive_mode = on               # allows archiving to be done     # (change requires restart)
archive_command = 'cp -v %p /archives/data/pgsql/pg_xlog/%f' 
 

postgresql.conf on slave:
wal_level = hot_standby
hot_standby = on

recovery.conf on slave:
standby_mode = 'on'
primary_conninfo = 'host=ip.add.ress port=5432 user=repman'
trigger_file = '/var/lib/pgsql/data/stop.replication'
restore_command = 'cp -i /archives/data/pgsql/pg_xlog/%f "%p"'

Master and Slave servers are sperated by thousands of miles and the network bandwidth comprises just an ordinary 1 Mbps DSL line. Both the master server and the slave server have the /archives partition mounted and synced with csync2 between master and the slave systems. I'm not sure if this is the correct way of configuring streaming replication, but I will explain what worked for us and what we are still left wanting with:

Under heavy inserts/updates/deletes on the master (load generated by stored procedures), we  noticed that the slave went far behind the master and resulted into breakage of replication. Hence we changed from 32 log file segments to 640, which corresponds to 10 GB so that we can  survive either very heavy spikes of load or even a week's disconnect of the slave (although alarms would be raised appropriately for the same effect).

One strange thing I noticed is that the pg_xlogs on the master have outsized the actual data stored in the database by at least 3-4 times, which was quite surprising. I'm not sure if 'restore_command' has anything to do with it. I did not understand why transaction logs would need to be so many times larger than the actual size of the database, have I done something wrong somewhere?

Another interesting fact we noticed is that once the replication is broken for some longer time ( walsender and walreceiver processes have died by this time), we had to restart not only the slave server but also the master server, which was quite strange. Shouldn't the master server start (if there is none running) walsender process the moment it receives a request for streaming? Similarly, why should the slave be restarted just to start replication again? why can't these two processes be independently started and stopped by the postmaster process as and when necessary as per the need to replicate or not?

Another thing that I noticed was that the slave server has logged that it is out of sync and hence closing replication but the master did not say anything about this breakage of replication.

So summing up the above, I would like to have some pointers to understand the following, which I think will benefit many others as well:
  • Do I need to 'archive' since I'm storing quite a significant number of logfile segments that can help sustain disconnect for almost a week?
  • Why did the xlog data accumulate so much as to be 3-4 times that of the actual size of the database growth during the same period?
  • why should we restart the postgres service on slave and/or master if  we need to join a slave back into replication after a long disconnect?
  • why is the master not complaining about the loss of replication? (of course, slave did complain about the disconnect or its inability to continue accepting data from master on account difference in xlog location).
Some of the above might be because of wrong configuration, while some may be give hints for future enhancements. I hope this will start a healthy discussion on the areas where streaming replication needs to be improved and strengthened.

Thanks and Regards,
Samba

Re: Streaming Replication: Observations, Questions and Comments

From
Greg Smith
Date:
On 08/24/2011 11:33 AM, Samba wrote:
One strange thing I noticed is that the pg_xlogs on the master have outsized the actual data stored in the database by at least 3-4 times, which was quite surprising. I'm not sure if 'restore_command' has anything to do with it. I did not understand why transaction logs would need to be so many times larger than the actual size of the database, have I done something wrong somewhere?

That's common to see.  Systems that regularly UPDATE the same rows often can easily end up with a WAL stream much larger than the database.  The WAL data contains enough information to replay every point in time from the base backup until the current time.  That can be significantly larger than the database, which just holds the latest copy of the data.

One of the biggest things that makes your WAL large are the full page writes that protect against incomplete writes.  See "question regarding full_page_writes" thread happening on this list recently for details.  Each time you touch a page, per checkpoint, another full copy of that page is written out.

What I have to do in a lot of cases is significantly decrease the number of checkpoints in order to keep this overhead under control.  The default config has a checkpoint every checkpoint_segments of work, and every checkpoint_timeout of time.  That makes for a checkpoint every 5 minutes, and even more often under heavy load.

If you increase checkpoint_segments a whole lot, all of your checkpoints will be based on the timeout instead.  Then you can see how WAL load decreases as you increase checkpoint_timeout.  I've had to set checkpoint_timeout as high as 30 minutes before on busy systems, to lower the WAL overhead.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Streaming Replication: Observations, Questions and Comments

From
Alan Hodgson
Date:
On August 24, 2011 08:33:17 AM Samba wrote:
> One strange thing I noticed is that the pg_xlogs on the master have
> outsized the actual data stored in the database by at least 3-4 times,
> which was quite surprising. I'm not sure if 'restore_command' has anything
> to do with it. I did not understand why transaction logs would need to be
> so many times larger than the actual size of the database, have I done
> something wrong somewhere?

If you archive them instead of keeping them in pg_xlog, you can gzip them.
They compress reasonably well.

Re: Streaming Replication: Observations, Questions and Comments

From
Samba
Date:
The problem with maintaining a separate archive is that one need to write some additional scripts to periodically remove older log files from the archive and that gets complicated with a setup having one master and multiple slaves.

I think it is a better idea to club compression and clean up in the core itself, may at a later release. A better approach to cleanup is that the walsender process decides when to cleanup a particular logfile based on the feedback from the all the registered slaves. If a slave is not reachable or falls behind for too long, then that slave should be banned from the setup (log the event in pg_replication.log ???). The replication status for each slave can be maintained in something like pg_slave_replica_status catalog table.

When it comes to compression, walsender can compress the each chunk of data that it streams (increasing the streaming_delay may improve compression ratio, hence a balance has to be struck between compression and sustainable-data-loss-in-case-of-failure)

Although I could visualise this design would be much better than leaving it to external utilities, I'm not that good at C language and hence only proposing a design and not a patch. I hope my suggestion will be received in good spirit.

Thanks and Regards,
Samba

PS:
I have wrongly stated that master server had to be restarted in case of long disconnects, sorry that was not true. But I still feel that requiring restart of standby server to resume replication should be avoided, if possible.

And, I strongly feel that a breakage in replication must be logged by both master server and  the concerned slave servers.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
On Wed, Aug 24, 2011 at 11:03 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On August 24, 2011 08:33:17 AM Samba wrote:
> One strange thing I noticed is that the pg_xlogs on the master have
> outsized the actual data stored in the database by at least 3-4 times,
> which was quite surprising. I'm not sure if 'restore_command' has anything
> to do with it. I did not understand why transaction logs would need to be
> so many times larger than the actual size of the database, have I done
> something wrong somewhere?

If you archive them instead of keeping them in pg_xlog, you can gzip them.
They compress reasonably well.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general