Thread: Synchronous Log Shipping Replication

Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

05 September 2008, 11:21:32

Hi,

In PGCon 2008, I proposed synchronous log shipping replication.
Sorry for late posting, but I'd like to start the discussion
about its implementation from now.
http://www.pgcon.org/2008/schedule/track/Horizontal%20Scaling/76.en.html

First of all, I'm not planning to put the prototype which I demoed
in PGCon into core directly.

- Portability issues (using message queue, multi-threaded ...)
- Have too much dependency on Heartbeat

Yes, since the prototype is useful reference of implementation,
I plan to open it ASAP. But, I'm sorry - it still takes a month
to open it.

Pavan re-designed the sync replication based on the prototype
and I posted that design doc on wiki. Please check it if you
are interested in it.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

This design is too huge. In order to enhance the extensibility
of postgres, I'd like to divide the sync replication into
minimum hooks and some plugins and to develop it, respectively.
Plugins for the sync replication plan to be available at the
time of 8.4 release.

In my design, WAL sending is achieved as follow by WALSender.
WALSender is a new process which I introduce.
 1) On COMMIT, backend requests WALSender to send WAL. 2) WALSender reads WAL from walbuffers and send it to slave. 3)
WALSenderwaits for the response from slave and replies    backend.
 

I propose two hooks for WAL sending.

WAL-writing hook
----------------
This hook is for backend to communicate with WALSender.
WAL-writing hook intercepts write system call in XLogWrite.
That is, backend requests WAL sending whenever write is called.

WAL-writing hook is available also for other uses e.g.
Software RAID (writes WAL into two files for durability).

Hook for WALSender
------------------
This hook is for introducing WALSender. There are the following
three ideas of how to introduce WALSender. A required hook
differs by which idea is adopted.

a) Use WALWriter as WALSender
  This idea needs WALWriter hook which intercepts WALWriter  literally. WALWriter stops the local WAL write and focuses
on WAL sending. This idea is very simple, but I don't think of  the use of WALWriter hook other than WAL sending.
 

b) Use new background process as WALSender
  This idea needs background-process hook which enables users  to define new background processes. I think the design
ofthis  hook resembles that of rmgr hook proposed by Simon. I define  the table like RmgrTable. It's for registering
somefunctions  (e.g. main function and exit...) for operating a background  process. Postmaster calls the function from
thetable suitably,  and manages a start and end of background process. ISTM that  there are many uses in this hook,
e.g.performance monitoring  process like statspack.
 

c) Use one backend as WALSender
  In this idea, slave calls the user-defined function which  takes charge of WAL sending via SQL e.g. "SELECT
pg_walsender()". Compared with other ideas, it's easy to implement WALSender  because postmater handles the
establishmentand authentication  of connection. But, this SQL causes a long transaction which  prevents vacuum. So,
thisidea needs idle-state hook which  executes plugin before transaction starts. I don't think of  the use of this hook
otherthan WAL sending either.
 

Which idea should we adopt?

Comments welcome.

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

06 September 2008, 04:26:51

Hi,

Fujii Masao wrote:
> Pavan re-designed the sync replication based on the prototype
> and I posted that design doc on wiki. Please check it if you
> are interested in it.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

I've read that wiki page and allow myself to comment from a Postgres-R 
developer's perspective ;-)

R1: "without ... any negative performance overhead"? For fully 
synchronous replication, that's clearly not possible. I guess that 
applies only for async WAL shipping.

NR3: who is supposed to do failure detection and manage automatic 
failover? How does integration with such an additional tool work?

I got distracted by the SBY and ACT abbreviations. Why abbreviate 
standby or active at all? It's not like we don't already have enough 
three letter acronyms, but those stand for rather more complex terms 
than single words.

Standby Bootstrap: "stopping the archiving at the ACT" doesn't prevent 
overriding WAL files in pg_xlog. It just stops archiving a WAL file 
before it gets overridden - which clearly doesn't solve the problem here.

How is communication done? "Serialization of WAL shipping" should better 
not mean serialization on the network, i.e. the WAL Sender Process 
should be able to await acknowledgment of multiple WAL packets in 
parallel, otherwise the interconnect latency might turn into a 
bottleneck. How is communication done? What happens if the link between 
the active and standby goes down? Or if it's temporarily unavailable for 
some time?

The IPC mechanism reminds me a lot of what I did for Postgres-R, which 
also has a central "replication manager" process, which receives 
changesets from multiple backends. I've implemented an internal 
messaging mechanism based on shared memory and signals, using only 
Postgres methods. It allows arbitrary processes to send messages to each 
other by process id.

Moving the WAL Sender and WAL Receiver processes under the control of 
the postmaster certainly sounds like a good thing. After all, those are 
fiddling wiht Postgres internals.

> This design is too huge. In order to enhance the extensibility
> of postgres, I'd like to divide the sync replication into
> minimum hooks and some plugins and to develop it, respectively.
> Plugins for the sync replication plan to be available at the
> time of 8.4 release.

Hooks again? I bet you all know by now, that my excitement for hooks has 
always been pretty narrow. ;-)

> In my design, WAL sending is achieved as follow by WALSender.
> WALSender is a new process which I introduce.
> 
>   1) On COMMIT, backend requests WALSender to send WAL.
>   2) WALSender reads WAL from walbuffers and send it to slave.
>   3) WALSender waits for the response from slave and replies
>      backend.
> 
> I propose two hooks for WAL sending.
> 
> WAL-writing hook
> ----------------
> This hook is for backend to communicate with WALSender.
> WAL-writing hook intercepts write system call in XLogWrite.
> That is, backend requests WAL sending whenever write is called.
> 
> WAL-writing hook is available also for other uses e.g.
> Software RAID (writes WAL into two files for durability).
> 
> Hook for WALSender
> ------------------
> This hook is for introducing WALSender. There are the following
> three ideas of how to introduce WALSender. A required hook
> differs by which idea is adopted.
> 
> a) Use WALWriter as WALSender
> 
>    This idea needs WALWriter hook which intercepts WALWriter
>    literally. WALWriter stops the local WAL write and focuses on
>    WAL sending. This idea is very simple, but I don't think of
>    the use of WALWriter hook other than WAL sending.
> 
> b) Use new background process as WALSender
> 
>    This idea needs background-process hook which enables users
>    to define new background processes. I think the design of this
>    hook resembles that of rmgr hook proposed by Simon. I define
>    the table like RmgrTable. It's for registering some functions
>    (e.g. main function and exit...) for operating a background
>    process. Postmaster calls the function from the table suitably,
>    and manages a start and end of background process. ISTM that
>    there are many uses in this hook, e.g. performance monitoring
>    process like statspack.
> 
> c) Use one backend as WALSender
> 
>    In this idea, slave calls the user-defined function which
>    takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
>    Compared with other ideas, it's easy to implement WALSender
>    because postmater handles the establishment and authentication
>    of connection. But, this SQL causes a long transaction which
>    prevents vacuum. So, this idea needs idle-state hook which
>    executes plugin before transaction starts. I don't think of
>    the use of this hook other than WAL sending either.

The above cited wiki page sounds like you've already decided for b).

I'm unclear on what you want hooks for. If additional processes get 
integrated into Postgres, those certainly need to get integrated very 
much like we integrated other auxiliary processes. I wouldn't call that 
'hooking', but YMMV.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

06 September 2008, 13:57:58

On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote:

> Pavan re-designed the sync replication based on the prototype
> and I posted that design doc on wiki. Please check it if you
> are interested in it.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects

It's good to see the detailed design, many thanks.

I will begin looking at technical details next week.

> This design is too huge. In order to enhance the extensibility
> of postgres, I'd like to divide the sync replication into
> minimum hooks and some plugins and to develop it, respectively.
> Plugins for the sync replication plan to be available at the
> time of 8.4 release.

What is Core's commentary on this plan?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Bruce Momjian

Date:

06 September 2008, 23:09:25

Markus Wanner wrote:
> > Hook for WALSender
> > ------------------
> > This hook is for introducing WALSender. There are the following
> > three ideas of how to introduce WALSender. A required hook
> > differs by which idea is adopted.
> > 
> > a) Use WALWriter as WALSender
> > 
> >    This idea needs WALWriter hook which intercepts WALWriter
> >    literally. WALWriter stops the local WAL write and focuses on
> >    WAL sending. This idea is very simple, but I don't think of
> >    the use of WALWriter hook other than WAL sending.

The problem with this approach is that you are not sending WAL to the
disk _while_ you are, in parallel, sending WAL to the slave;  I think
this is useful for performance reasons in synrchonous replication.

> > b) Use new background process as WALSender
> > 
> >    This idea needs background-process hook which enables users
> >    to define new background processes. I think the design of this
> >    hook resembles that of rmgr hook proposed by Simon. I define
> >    the table like RmgrTable. It's for registering some functions
> >    (e.g. main function and exit...) for operating a background
> >    process. Postmaster calls the function from the table suitably,
> >    and manages a start and end of background process. ISTM that
> >    there are many uses in this hook, e.g. performance monitoring
> >    process like statspack.

I think starting/stopping a process for each WAL send is too much
overhead.

> > c) Use one backend as WALSender
> > 
> >    In this idea, slave calls the user-defined function which
> >    takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
> >    Compared with other ideas, it's easy to implement WALSender
> >    because postmater handles the establishment and authentication
> >    of connection. But, this SQL causes a long transaction which
> >    prevents vacuum. So, this idea needs idle-state hook which
> >    executes plugin before transaction starts. I don't think of
> >    the use of this hook other than WAL sending either.
> 
> The above cited wiki page sounds like you've already decided for b).

I assumed that there would be a background process like bgwriter that
would be notified during a commit and send the appropriate WAL files to
the slave.

> I'm unclear on what you want hooks for. If additional processes get 
> integrated into Postgres, those certainly need to get integrated very 
> much like we integrated other auxiliary processes. I wouldn't call that 
> 'hooking', but YMMV.

Yea, I am unclear how this is going to work using simple hooks.

It sounds like Fujii-san is basically saying they can only get the hooks
done for 8.4, not the actual solution.  But, as I said above, I am
unclear how a hook solution would even work long-term;  I am afraid it
would be thrown away once an integrated solution was developed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

07 September 2008, 08:47:38

On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote:
> Markus Wanner wrote:
> > > Hook for WALSender
> > > ------------------
> > > This hook is for introducing WALSender. There are the following
> > > three ideas of how to introduce WALSender. A required hook
> > > differs by which idea is adopted.
> > > 
> > > a) Use WALWriter as WALSender
> > > 
> > >    This idea needs WALWriter hook which intercepts WALWriter
> > >    literally. WALWriter stops the local WAL write and focuses on
> > >    WAL sending. This idea is very simple, but I don't think of
> > >    the use of WALWriter hook other than WAL sending.
> 
> The problem with this approach is that you are not sending WAL to the
> disk _while_ you are, in parallel, sending WAL to the slave;  I think
> this is useful for performance reasons in synrchonous replication.

Agreed

> > > b) Use new background process as WALSender
> > > 
> > >    This idea needs background-process hook which enables users
> > >    to define new background processes. I think the design of this
> > >    hook resembles that of rmgr hook proposed by Simon. I define
> > >    the table like RmgrTable. It's for registering some functions
> > >    (e.g. main function and exit...) for operating a background
> > >    process. Postmaster calls the function from the table suitably,
> > >    and manages a start and end of background process. ISTM that
> > >    there are many uses in this hook, e.g. performance monitoring
> > >    process like statspack.
> 
> I think starting/stopping a process for each WAL send is too much
> overhead.

I would agree with that, but I don't think that was being suggested was
it? See later.

> > > c) Use one backend as WALSender
> > > 
> > >    In this idea, slave calls the user-defined function which
> > >    takes charge of WAL sending via SQL e.g. "SELECT pg_walsender()".
> > >    Compared with other ideas, it's easy to implement WALSender
> > >    because postmater handles the establishment and authentication
> > >    of connection. But, this SQL causes a long transaction which
> > >    prevents vacuum. So, this idea needs idle-state hook which
> > >    executes plugin before transaction starts. I don't think of
> > >    the use of this hook other than WAL sending either.
> > 
> > The above cited wiki page sounds like you've already decided for b).
> 
> I assumed that there would be a background process like bgwriter that
> would be notified during a commit and send the appropriate WAL files to
> the slave.

ISTM that this last paragraph is actually what was meant by option b). 

I think it would work the other way around though, the WALSender would
send continuously and backends may choose to wait for it to reach a
certain LSN, or not. WALWriter really should work this way too.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

07 September 2008, 08:59:01

On Fri, 2008-09-05 at 23:21 +0900, Fujii Masao wrote:

> b) Use new background process as WALSender
> 
>    This idea needs background-process hook which enables users
>    to define new background processes. I think the design of this
>    hook resembles that of rmgr hook proposed by Simon. I define
>    the table like RmgrTable. It's for registering some functions
>    (e.g. main function and exit...) for operating a background
>    process. Postmaster calls the function from the table suitably,
>    and manages a start and end of background process. ISTM that
>    there are many uses in this hook, e.g. performance monitoring
>    process like statspack.

Sorry, but the comparison with the rmgr hook is mistaken. The rmgr hook
exists only within the Startup process and I go to some lengths to
ensure it is never called in normal backends. So it has got absolutely
nothing to do with generating WAL messages (existing/new/modified) or
sending them since it doesn't even exist during normal processing.

The intention of the rmgr hook is to allow WAL messages to be
manipulated in new ways in recovery mode. It isn't a sufficient change
to implement replication, and the functionality is orthogonal to
streaming WAL replication.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

07 September 2008, 09:13:02

On Sat, 2008-09-06 at 22:09 -0400, Bruce Momjian wrote:

> > I'm unclear on what you want hooks for. If additional processes get 
> > integrated into Postgres, those certainly need to get integrated very 
> > much like we integrated other auxiliary processes. I wouldn't call that 
> > 'hooking', but YMMV.
> 
> Yea, I am unclear how this is going to work using simple hooks.
> 
> It sounds like Fujii-san is basically saying they can only get the hooks
> done for 8.4, not the actual solution.  But, as I said above, I am
> unclear how a hook solution would even work long-term;  I am afraid it
> would be thrown away once an integrated solution was developed.

It will be interesting to have various hooks in streaming WAL code to
implement various additional features for enterprise integration.

But that doesn't mean I support hooks in every/all places.

For me, the proposed hook amounts to "we've only got time to implement
2/3 of the required features, so we'd like to circumvent the release
cycle by putting in a hook and providing the code later". For me, hooks
are for adding additional features, not for making up for the lack of
completed code. It's kinda hard to say "we now have WAL streaming"
without the streaming bit. We need either a fully working WAL streaming
feature, or we wait until next release.

We probably need to ask if there is anybody willing to complete the
middle part of this feature so we can get it into 8.4. It would be
sensible to share the code we have now, so we can see what remains to be
implemented. I just committed to delivering Hot Standby for 8.4, so I
can't now get involved to deliver this code.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

ITAGAKI Takahiro

Date:

08 September 2008, 07:19:56

Bruce Momjian <bruce@momjian.us> wrote:

> > > b) Use new background process as WALSender
> > > 
> > >    This idea needs background-process hook which enables users
> > >    to define new background processes

> I think starting/stopping a process for each WAL send is too much
> overhead.

Yes, of course slow. But I guess it is the only way to share one socket
in all backends. Postgres is not a multi-threaded architecture,
so each backend should use dedicated connections to send WAL buffers.
300 backends require 300 connections for each slave... it's not good at all.

> It sounds like Fujii-san is basically saying they can only get the hooks
> done for 8.4, not the actual solution.

No! He has an actual solution in his prototype ;-)
It is very similar to b) and the overhead was not so bad.
It's not so clean to be a part of postgres, though.

Are there any better idea to share one socket connection between
backends (and bgwriter)? The connections could be established after
fork() from postmaster, and number of them could be two or more.
This is one of the most complicated part of synchronous log shipping.
Switching-processes apporach like b) is just one idea for it.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

08 September 2008, 07:38:04

Hi,

ITAGAKI Takahiro wrote:
> Are there any better idea to share one socket connection between
> backends (and bgwriter)? The connections could be established after
> fork() from postmaster, and number of them could be two or more.
> This is one of the most complicated part of synchronous log shipping.
> Switching-processes apporach like b) is just one idea for it.

I fear I'm repeating myself, but I've had the same problem for 
Postgres-R and solved it with an internal message passing infrastructure 
which I've simply called imessages. It requires only standard Postgres 
shared memory, signals and locking and should thus be pretty portable.

In simple benchmarks, it's not quite as efficient as unix pipes, but 
doesn't require as many file descriptors, is independent of the 
parent-child relations of processes, maintains message borders and it is 
more portable (I hope). It could certainly be improved WRT efficiency 
and could theoretically even beat Unix pipes, because it involves less 
copying of data and less syscalls.

It has not been reviewed nor commented much. I'd still appreciate that.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

ITAGAKI Takahiro

Date:

08 September 2008, 08:15:00

Markus Wanner <markus@bluegap.ch> wrote:

> ITAGAKI Takahiro wrote:
> > Are there any better idea to share one socket connection between
> > backends (and bgwriter)?
>
> I fear I'm repeating myself, but I've had the same problem for 
> Postgres-R and solved it with an internal message passing infrastructure 
> which I've simply called imessages. It requires only standard Postgres 
> shared memory, signals and locking and should thus be pretty portable.

Imessage serves as a useful reference, but it is one of the detail parts
of the issue. I can break down the issue into three parts:
 1. Is process-switching approach the best way to share one socket?       Both Postgres-R and the log-shipping
prototypeuse the approach now.       Can I think there is no objection here?

 2. If 1 is reasonable, how should we add a new WAL sender process?       Just add a new process using a core-patch?
  Merge into WAL writer?       Consider framework to add any of user-defined auxiliary process?

 3. If 1 is reasonable, what should we use for the process-switching    primitive?       Postgres-R uses signals and
lockingand the log-shipping prototype       uses multi-threads and POSIX message queues now.

Signals and locking is possible choice for 3, but I want to use better
approach if any. Faster is always better.

I guess we could invent a new semaphore-like primitive at the same layer
as LWLocks using spinlock and PGPROC directly...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

08 September 2008, 08:44:45

Hi,

ITAGAKI Takahiro wrote:
>   1. Is process-switching approach the best way to share one socket?
>         Both Postgres-R and the log-shipping prototype use the approach now.
>         Can I think there is no objection here?

I don't see any appealing alternative. The postmaster certainly 
shouldn't need to worry with any such socket for replication. Threading 
falls pretty flat for Postgres. So the socket must be held by one of the 
child processes of the Postmaster.

>   2. If 1 is reasonable, how should we add a new WAL sender process?
>         Just add a new process using a core-patch?

Seems feasible to me, yes.

>         Merge into WAL writer?

Uh.. that would mean you'd loose parallelism between WAL writing to disk 
and WAL shipping via network. That does not sound appealing to me.

>         Consider framework to add any of user-defined auxiliary process?

What for? What do you miss in the existing framework?

>   3. If 1 is reasonable, what should we use for the process-switching
>      primitive?
>         Postgres-R uses signals and locking and the log-shipping prototype
>         uses multi-threads and POSIX message queues now.

AFAIK message queues are problematic WRT portability. At least Postgres 
doesn't currently use them and introducing dependencies on those might 
lead to problems, but I'm not sure. Others certainly know more about 
issues involved.

A multi-threaded approach is certainly out of bounds, at least within 
the Postgres core code.

> Signals and locking is possible choice for 3, but I want to use better
> approach if any. Faster is always better.

I think the approach can reach better throughput than POSIX message 
queues or unix pipes, because of the mentioned savings in copying around 
between system and application memory. However, that hasn't been proved, 
yet.

> I guess we could invent a new semaphore-like primitive at the same layer
> as LWLocks using spinlock and PGPROC directly...

Sure, but in what way would that differ from what I do with imessages?

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

08 September 2008, 10:55:08

On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:
>>        Merge into WAL writer?
>
> Uh.. that would mean you'd loose parallelism between WAL writing to disk and
> WAL shipping via network. That does not sound appealing to me.

That depends on the order of WAL writing and WAL shipping.
How about the following order?

1. A backend writes WAL to disk.
2. The backend wakes up WAL sender process and sleeps.
3. WAL sender process does WAL shipping and wakes up the backend.
4. The backend issues sync command.

>> I guess we could invent a new semaphore-like primitive at the same layer
>> as LWLocks using spinlock and PGPROC directly...
>
> Sure, but in what way would that differ from what I do with imessages?

Performance ;)

The timing of the process's receiving a signal is dependent on the scheduler
of kernel. The scheduler does not always handle a signal immediately.

Regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

08 September 2008, 12:15:34

Hi,

Fujii Masao wrote:
> 1. A backend writes WAL to disk.
> 2. The backend wakes up WAL sender process and sleeps.
> 3. WAL sender process does WAL shipping and wakes up the backend.
> 4. The backend issues sync command.

Right, that would work. But still, the WAL writer process would block 
during writing WAL blocks.

Are there compelling reasons for using the existing WAL writer process, 
as opposed to introducing a new process?

> The timing of the process's receiving a signal is dependent on the scheduler
> of kernel.

Sure, so are pipes or shmem queues.

> The scheduler does not always handle a signal immediately.

What exactly are you proposing to use instead of signals? Semaphores are 
pretty inconvenient when trying to wake up arbitrary processes or in 
conjunction with listening on sockets via select(), for example.

See src/backend/replication/manager.c from Postgres-R for a working 
implementation of such a process using select() and signaling.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

08 September 2008, 16:05:13

On Mon, 2008-09-08 at 19:19 +0900, ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> 
> > > > b) Use new background process as WALSender
> > > > 
> > > >    This idea needs background-process hook which enables users
> > > >    to define new background processes
> 
> > I think starting/stopping a process for each WAL send is too much
> > overhead.
> 
> Yes, of course slow. But I guess it is the only way to share one socket
> in all backends. Postgres is not a multi-threaded architecture,
> so each backend should use dedicated connections to send WAL buffers.
> 300 backends require 300 connections for each slave... it's not good at all.

So... don't have individual backends do the sending. Have them wait
while somebody else does it for them.

> > It sounds like Fujii-san is basically saying they can only get the hooks
> > done for 8.4, not the actual solution.
> 
> No! He has an actual solution in his prototype ;-)

The usual thing if you have a WIP patch you're not sure of is to post
the patch for feedback. 

If you guys aren't going to post any code to the project then I'm not
clear why it's being discussed here. Is this a community project or a
private project? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Bruce Momjian

Date:

08 September 2008, 18:40:50

Fujii Masao wrote:
> On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:
> >>        Merge into WAL writer?
> >
> > Uh.. that would mean you'd loose parallelism between WAL writing to disk and
> > WAL shipping via network. That does not sound appealing to me.
> 
> That depends on the order of WAL writing and WAL shipping.
> How about the following order?
> 
> 1. A backend writes WAL to disk.
> 2. The backend wakes up WAL sender process and sleeps.
> 3. WAL sender process does WAL shipping and wakes up the backend.
> 4. The backend issues sync command.

I am confused why this is considered so complicated.  Having individual
backends doing the wal transfer to the slave is never going to work
well.

I figured we would have a single WAL streamer that continues advancing
forward in the WAL file, streaming to the standby.  Backends would
update a shared memory variable specifying how far they want the wal
streamer to advance and send a signal to the wal streamer if necessary. 
Backends would monitor another shared memory variable that specifies how
far the wal streamer has advanced.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

08 September 2008, 19:52:14

Hi,

Bruce Momjian wrote:
> Backends would
> update a shared memory variable specifying how far they want the wal
> streamer to advance and send a signal to the wal streamer if necessary. 
> Backends would monitor another shared memory variable that specifies how
> far the wal streamer has advanced.

That sounds like WAL needs to be written to disk, before it can be sent 
to the standby. Except maybe with some sort of mmap'ing the WAL.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Bruce Momjian

Date:

08 September 2008, 20:10:48

Markus Wanner wrote:
> Hi,
> 
> Bruce Momjian wrote:
> > Backends would
> > update a shared memory variable specifying how far they want the wal
> > streamer to advance and send a signal to the wal streamer if necessary. 
> > Backends would monitor another shared memory variable that specifies how
> > far the wal streamer has advanced.
> 
> That sounds like WAL needs to be written to disk, before it can be sent 
> to the standby. Except maybe with some sort of mmap'ing the WAL.

Well, WAL is either on disk or in the wal_buffers in shared memory ---
in either case, a WAL streamer can get to it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

09 September 2008, 02:44:46

Hi,

I looked some comment for the synchronous replication and understood
the consensus of the community was that the sync replication should be
added using not hooks and plug-ins but core-patches. If my understanding
is right, I will change my development plan so that the sync replication
may be put into core.

But, I don't think every features should be put into core. Of course, the
high-availability features (like clustering, automatic failover, ...etc) are
out of postgres. The user who wants whole HA solution using the sync
replication must integrate postgres and clustering software like heartbeat.

WAL sending should be put into core. But, I'd like to separate WAL
receiving from core and provide it as a new contrib tool. Because,
there are some users who use the sync replication as only WAL
streaming. They don't want to start postgres on the slave. Of course,
the slave can replay WAL by using pg_standby and WAL receiver tool
which I'd like to provide as a new contrib tool. I think the patch against
recovery code is not necessary.

I arrange the development items below :

1) Patch around XLogWrite. It enables a backend to wake up the WAL sender process at the timing of COMMIT.

2) Patch for the communication between a backend and WAL sender process. There were some discussions about this
topic.Now, I decided to adopt imessages proposed by Markus.

3) Patch of introducing new background process which I've called WALSender. It takes charge of sending WAL to the
slave.
Now, I assume that WALSender also listens the connection from the slave, i.e. only one sender process manages
multipleslaves. The relation between WALSender and backend is 1:1. So, the communication mechanism between them
canbe simple. As other idea, I can introduce new listener process and fork new WALSender for every slave. Which
architectureis better? Or, should postmaster listen also the connection from the slave?

4) New contrib tool which I've called WALReceiver. It takes charge of receiving WAL from the master and writing it to
diskon the slave.

I will submit these patches and tool by Nov Commit Fest at the latest.

Any comment welcome!

best regards

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

ITAGAKI Takahiro

Date:

09 September 2008, 03:17:18

"Fujii Masao" <masao.fujii@gmail.com> wrote:

> 3) Patch of introducing new background process which I've called
>      WALSender. It takes charge of sending WAL to the slave.
> 
>      Now, I assume that WALSender also listens the connection from
>      the slave, i.e. only one sender process manages multiple slaves.

>      The relation between WALSender and backend is 1:1. So,
>      the communication mechanism between them can be simple.

I assume that he says only one backend communicates with WAL sender
at a time. The communication is done during WALWriteLock is held,
so other backends wait for the communicating backend on WALWriteLock.
WAL sender only needs to send one signal for each time it sends WAL
buffers to slave.

We could be split the LWLock to WALWriterLock and WALSenderLock,
but the essential point is same.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 05:11:56

On Mon, 2008-09-08 at 17:40 -0400, Bruce Momjian wrote:
> Fujii Masao wrote:
> > On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <markus@bluegap.ch> wrote:
> > >>        Merge into WAL writer?
> > >
> > > Uh.. that would mean you'd loose parallelism between WAL writing to disk and
> > > WAL shipping via network. That does not sound appealing to me.
> > 
> > That depends on the order of WAL writing and WAL shipping.
> > How about the following order?
> > 
> > 1. A backend writes WAL to disk.
> > 2. The backend wakes up WAL sender process and sleeps.
> > 3. WAL sender process does WAL shipping and wakes up the backend.
> > 4. The backend issues sync command.
> 
> I am confused why this is considered so complicated.  Having individual
> backends doing the wal transfer to the slave is never going to work
> well.

Agreed.

> I figured we would have a single WAL streamer that continues advancing
> forward in the WAL file, streaming to the standby.  Backends would
> update a shared memory variable specifying how far they want the wal
> streamer to advance and send a signal to the wal streamer if necessary. 
> Backends would monitor another shared memory variable that specifies how
> far the wal streamer has advanced.

Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
the send operation. The Write and Send operations can then continue
independently of one another. XLogInsert() cannot advance to a new page
while we are waiting to send or write. Notice that the Send process
might be the bottleneck - that is the price of synchronous replication.

Backends then wait
* not at all for asynch commit
* just for Write for local synch commit
* for both Write and Send for remote synch commit
(various additional options for what happens to confirm Send)

So normal backends neither write nor send. We have two dedicated
processes, one for write, one for send. We need to put an extra test
into WALWriter loop so that it will continue immediately (with no wait)
if there is an outstanding request for synchronous operation.

This gives us the Group Commit feature also, even if we are not using
replication. So we can drop the commit_delay stuff.

XLogBackgroundFlush() processes data page at a time if it can. That may
not be the correct batch size for XLogBackgroundSend(), so we may need a
tunable for the MTU. Under heavy load we need the Write and Send to act
in a way to maximise throughput rather than minimise response time, as
we do now.

If wal_buffers overflows, we continue to hold WALInsertLock while we
wait for WALWriter and WALSender to complete.

We should increase default wal_buffers to 64.

After (or during) XLogInsert backends will sleep in a proc queue,
similar to LWlocks and protected by a spinlock. When preparing to
write/send the WAL process should read the proc at the *tail* of the
queue to see what the next LogwrtRqst should be. Then it performs its
action and wakes procs up starting with the head of the queue. We would
add LSN into PGPROC, so WAL processes can check whether the backend
should be woken. The LSN field can be accessed without spinlocks since
it is only ever set by the backend itself and only read while a backend
is sleeping. So we access spinlock, find tail, drop spinlock then read
LSN of the backend that (was) the tail.

Another thought occurs that we might measure the time a Send takes and
specify a limit on how long we are prepared to wait for confirmation.
Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit.
This would give better user behaviour across a highly variable network
connection.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

09 September 2008, 06:24:47

Simon Riggs wrote:
> This gives us the Group Commit feature also, even if we are not using
> replication. So we can drop the commit_delay stuff.

Huh? How does that give us group commit?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 06:36:21

On Tue, 2008-09-09 at 12:24 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > This gives us the Group Commit feature also, even if we are not using
> > replication. So we can drop the commit_delay stuff.
> 
> Huh? How does that give us group commit?

Multiple backends waiting while we perform a write. Commits then happen
as a group (to WAL at least), hence Group Commit.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

09 September 2008, 06:54:18

Simon Riggs wrote:
> Multiple backends waiting while we perform a write. Commits then happen
> as a group (to WAL at least), hence Group Commit.

The problem with our current commit protocol is this:

1. Backend A inserts commit record A
2. Backend A starts to flush commit record A
3. Backend B inserts commit record B
4. Backend B waits until 2. finishes
5. Backend B starts to flush commit record B

Note that we already have the logic to flush all pending commit records 
at once. If there's also backend C that insert their commit records 
after step 2, B and C will be flushed at once:

1. Backend A inserts commit record A
2. Backend A starts to flush commit record A
3. Backend B inserts commit record B
4. Backend B waits until 2. finishes
5. Backend C inserts commit record C
6. Backend C waits until 2. finishes
5. Flush A finishes. Backend B starts to flush commit records A+B

The idea of group commit is to insert a small delay in backend A between 
steps 1 and 2, so that we can flush both commit records in one fsync:

1. Backend A inserts commit record A
2. Backend A waits
3. Backend B inserts commit record B
3. Backend B starts to flush commit record A + B

The tricky part is, how does A know if it should wait, and for how long? 
commit_delay sure isn't ideal, but AFAICS the log shipping proposal 
doesn't provide any solution to that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

ITAGAKI Takahiro

Date:

09 September 2008, 07:58:55

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> The tricky part is, how does A know if it should wait, and for how long? 
> commit_delay sure isn't ideal, but AFAICS the log shipping proposal 
> doesn't provide any solution to that.

They have no relation each other directly,
but they need similar synchronization modules.

In log shipping, backends need to wait for WAL Sender's job,
and should wake up as fast as possible after the job is done.
It is similar to requirement of the group commit.

Signals and locking, borrewed from Postgres-R, are now studied
for the purpose in the log shipping, but I'm not sure it can be
also used in the group commit.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

09 September 2008, 08:12:37

On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
> the send operation. The Write and Send operations can then continue
> independently of one another. XLogInsert() cannot advance to a new page
> while we are waiting to send or write.

Agreed.

For realizing various synchronous options, the Write and Send operations should
be treated separately. So, I'll introduce XLogCtlSend structure which is shared
state data for WAL sending. XLogCtlInsert might need new field LogsndResult
which indicates a byte position that we have already sended. As you say,
AdvanceXLInsertBuffer() must check both position that we have already written
(fsynced) and sended. I'm doing the detail design of this now :)

> Notice that the Send process
> might be the bottleneck - that is the price of synchronous replication.

Really? In the benchmark result of my prototype, the bottleneck is
still disk I/O.
The communication (between the master and the slave) latency is smaller than
WAL writing (fsyncing) one. Of course, I assume that we use not-poor network
like 1000BASE-T.

What makes the sender process bottleneck?

> Backends then wait
> * not at all for asynch commit
> * just for Write for local synch commit
> * for both Write and Send for remote synch commit
> (various additional options for what happens to confirm Send)

I'd like to introduce new parameter "synchronous_replication" which specifies
whether backends waits for the response from WAL sender process. By
combining synchronous_commit and synchronous_replication, users can
choose various options.

> After (or during) XLogInsert backends will sleep in a proc queue,
> similar to LWlocks and protected by a spinlock. When preparing to
> write/send the WAL process should read the proc at the *tail* of the
> queue to see what the next LogwrtRqst should be. Then it performs its
> action and wakes procs up starting with the head of the queue. We would
> add LSN into PGPROC, so WAL processes can check whether the backend
> should be woken. The LSN field can be accessed without spinlocks since
> it is only ever set by the backend itself and only read while a backend
> is sleeping. So we access spinlock, find tail, drop spinlock then read
> LSN of the backend that (was) the tail.

You mean only XLogInsert treating "commit record" or every XLogInsert?
Anyway, ISTM that the response time get worse :(

> Another thought occurs that we might measure the time a Send takes and
> specify a limit on how long we are prepared to wait for confirmation.
> Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit.
> This would give better user behaviour across a highly variable network
> connection.

In the viewpoint of detection of a network failure, this feature is necessary.
When the network goes down, WAL sender can be blocked until it detects
the network failure, i.e. WAL sender keeps waiting for the response which
never comes. A timeout notification is necessary in order to detect a
network failure soon.

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

09 September 2008, 08:22:17

Hi,

ITAGAKI Takahiro wrote:
> Signals and locking, borrewed from Postgres-R, are now studied
> for the purpose in the log shipping, but I'm not sure it can be
> also used in the group commit.

Yeah. As Heikki points out, there is a completely orthogonal question 
WRT group commit: how does transaction A know if or how long it should 
wait for other transactions to file their WAL?

If we decide to do all of the WAL writing from a separate WAL writer 
process and let the backends communicate with it, then imessages might 
help again. But I currently don't think that's feasible.

Apart from possibly having similar IPC requirements, group commit and 
log shipping have not much in common and should be considered separate 
features.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

09 September 2008, 08:42:59

Hi,

Fujii Masao wrote:
> Really? In the benchmark result of my prototype, the bottleneck is
> still disk I/O.
> The communication (between the master and the slave) latency is smaller than
> WAL writing (fsyncing) one. Of course, I assume that we use not-poor network
> like 1000BASE-T.

Sure. If you do WAL sending to standby and WAL writing to disk in 
parallel, only the slower one is relevant (in case you want to wait for 
both). If that happens to be the disk, you won't see any performance 
degradation compared to standalone operation.

If you want the standby to confirm having written (and flushed) the WAL 
to disk as well, that can't possibly be faster than the active node's 
local disk (assuming equally fast and busy disk subsystems).

> I'd like to introduce new parameter "synchronous_replication" which specifies
> whether backends waits for the response from WAL sender process. By
> combining synchronous_commit and synchronous_replication, users can
> choose various options.

Various config options have already been proposed. I personally don't 
think that helps us much. Instead, I'd prefer to see prototype code or 
at least concepts. We can juggle with the GUC variable names or other 
config options later on.

> In the viewpoint of detection of a network failure, this feature is necessary.
> When the network goes down, WAL sender can be blocked until it detects
> the network failure, i.e. WAL sender keeps waiting for the response which
> never comes. A timeout notification is necessary in order to detect a
> network failure soon.

That's one of the areas I'm missing from the overall concept. I'm glad 
it comes up. You certainly realize, that such a timeout must be set high 
enough so as not to trigger "false negatives" every now and then? Or do 
you expect some sort of retry loop in case the link to the standby comes 
up again? How about multiple standby servers?

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 09:07:55

On Tue, 2008-09-09 at 20:12 +0900, Fujii Masao wrote:

> What makes the sender process bottleneck?

In my experience, the Atlantic. But I guess the Pacific does it too. :-)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

09 September 2008, 09:08:45

Fujii Masao wrote:
> What makes the sender process bottleneck?

The keyword here is "might". There's many possibilities, like:
- Slow network.
- Ridiculously fast disk. Like a RAM disk. If you have a synchronous 
slave you can fail over to, putting WAL on a RAM disk isn't that crazy.
- slower WAL disk on the slave.
etc.

>> Backends then wait
>> * not at all for asynch commit
>> * just for Write for local synch commit
>> * for both Write and Send for remote synch commit
>> (various additional options for what happens to confirm Send)
> 
> I'd like to introduce new parameter "synchronous_replication" which specifies
> whether backends waits for the response from WAL sender process. By
> combining synchronous_commit and synchronous_replication, users can
> choose various options.

There's one thing I haven't figured out in this discussion. Does the 
write to the disk happen before or after the write to the slave? Can you 
guarantee that if a transaction is committed in the master, it's also 
committed in the slave, or vice versa?

>> Another thought occurs that we might measure the time a Send takes and
>> specify a limit on how long we are prepared to wait for confirmation.
>> Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit.
>> This would give better user behaviour across a highly variable network
>> connection.
> 
> In the viewpoint of detection of a network failure, this feature is necessary.
> When the network goes down, WAL sender can be blocked until it detects
> the network failure, i.e. WAL sender keeps waiting for the response which
> never comes. A timeout notification is necessary in order to detect a
> network failure soon.

Agreed. But what happens if you hit that timeout? Should we enforce that 
timeout within the server, or should we leave that to the external 
heartbeat system?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 09:09:13

On Tue, 2008-09-09 at 12:54 +0300, Heikki Linnakangas wrote:
> Note that we already have the logic to flush all pending commit
> records at once.

But only when you can grab WALInsertLock when flushing. If you look at
the way I suggested, it does not rely upon that lock being available.

So it is both responsive in low write rate conditions and yet efficient
in high write rate conditions and does not require we specify a wait
time.

IMHO the idea of a wait time is a confusion that comes from using a
simple example (with respect). If we imagine the example slightly
differently you'll see a different answer: 

High write rate: A stream of commits come so fast that by the time a
write completes there are always > 1 backends waiting to commit again.
In that case, there is never any need to wait because the arrival
pattern requries us to issues writes as quickly as we can.

Medium write rate: Commits occur relatively frequently, so that the mean
commits/flush is in the range 0.5 - 1. In this case, we can get better
I/O efficiency by introducing waits. But note that a wait is risky, and
at some point we may wait without another commit arriving. In this case,
if the disk can keep up with the write rate, why would we want to
improve I/O efficiency? There's no a priori way of calculating a useful
wait time, so waiting is always a risk. Why would we risk damage to our
response times when the disk can keep up with write rate? 

So for me, introducing a wait is something you might want to consider in
medium rate conditions. Anything more or less than that and a wait is
useless. So optimising for the case where the arrival rate is within a
certain fairly tight range seems not worthwhile.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 09:17:52

On Tue, 2008-09-09 at 20:12 +0900, Fujii Masao wrote:

> I'd like to introduce new parameter "synchronous_replication" which specifies
> whether backends waits for the response from WAL sender process. By
> combining synchronous_commit and synchronous_replication, users can
> choose various options.

We already discussed that on -hackers. See "Transaction Controlled
Robustness". But yes, something like that.

Please note the design mentions fsyncing after applying WAL. I'm sure
you're aware we don't fsync after *applying* WAL now, and I hope we
never do. You might want to fsync data to WAL files on the standby, but
that is a slightly different thing.

> > After (or during) XLogInsert backends will sleep in a proc queue,
> > similar to LWlocks and protected by a spinlock. When preparing to
> > write/send the WAL process should read the proc at the *tail* of the
> > queue to see what the next LogwrtRqst should be. Then it performs its
> > action and wakes procs up starting with the head of the queue. We would
> > add LSN into PGPROC, so WAL processes can check whether the backend
> > should be woken. The LSN field can be accessed without spinlocks since
> > it is only ever set by the backend itself and only read while a backend
> > is sleeping. So we access spinlock, find tail, drop spinlock then read
> > LSN of the backend that (was) the tail.
> 
> You mean only XLogInsert treating "commit record" or every XLogInsert?

Just the commit records, when synchronous_commit = on.

> Anyway, ISTM that the response time get worse :(

No, because it would have had to wait in the queue for the WALWriteLock
while prior writes occur. 

If the WALWriter sleeps on a semaphore, it too can be nudged into action
at the appropriate time, so no need for a delay between backend
beginning to wait and WALWriter beginning to act. (Well, IPC delay
between two processes, so some, but that is balanced against efficiency
of Send).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 09:21:27

On Tue, 2008-09-09 at 13:42 +0200, Markus Wanner wrote:

> How about multiple standby servers?

There are various ways for getting things to work with multiple servers.
I hope we can make this work with just a single standby before we try to
make it work on more. There are various options for synchronous and
asynchronous relay that will burden us if we try to consider all of that
in the remaining 7 weeks we have. So yes please, just not yet.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Tom Lane

Date:

09 September 2008, 09:24:24

"Fujii Masao" <masao.fujii@gmail.com> writes:
> On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> 
>> Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
>> the send operation. The Write and Send operations can then continue
>> independently of one another. XLogInsert() cannot advance to a new page
>> while we are waiting to send or write.

> Agreed.

"Agreed"?  That last restriction is a deal-breaker.
        regards, tom lane

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 09:39:24

On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote:
> "Fujii Masao" <masao.fujii@gmail.com> writes:
> > On Tue, Sep 9, 2008 at 5:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> 
> >> Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
> >> the send operation. The Write and Send operations can then continue
> >> independently of one another. XLogInsert() cannot advance to a new page
> >> while we are waiting to send or write.
> 
> > Agreed.
> 
> "Agreed"?  That last restriction is a deal-breaker.

OK, I should have said *if wal_buffers are full* XLogInsert() cannot
advance to a new page while we are waiting to send or write. So I don't
think its a deal breaker.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Tom Lane

Date:

09 September 2008, 09:50:50

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote:
>> "Agreed"?  That last restriction is a deal-breaker.

> OK, I should have said *if wal_buffers are full* XLogInsert() cannot
> advance to a new page while we are waiting to send or write. So I don't
> think its a deal breaker.

Oh, OK, that's obvious --- there's no place to put more data.
        regards, tom lane

Re: Synchronous Log Shipping Replication

From

Dimitri Fontaine

Date:

09 September 2008, 10:01:45

Hi,

Le mardi 09 septembre 2008, Heikki Linnakangas a écrit :
> The tricky part is, how does A know if it should wait, and for how long?
> commit_delay sure isn't ideal, but AFAICS the log shipping proposal
> doesn't provide any solution to that.

It might just be I'm not understanding what it's all about, but it seems to me
with WALSender process A will wait, whatever happens, either until the WAL is
sent to slave or written to disk on the slave.

I naively read Simon's proposition to consider GroupCommit done with this new
feature. A is already waiting (for some external event to complete), why
can't we use this for including some other transactions commits into the
local deal?

Regards,
--
dim

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

09 September 2008, 10:16:48

Hi,

Dimitri Fontaine wrote:
> It might just be I'm not understanding what it's all about, but it seems to me 
> with WALSender process A will wait, whatever happens, either until the WAL is 
> sent to slave or written to disk on the slave.

..and it will still has to wait until WAL is written to disk on the 
local node, as we do now. These are two different things to wait for. 
One is a network socket operation, the other is an fsync(). As these 
don't work together too well (blocking), you better run that in two 
different processes.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Dimitri Fontaine

Date:

09 September 2008, 10:34:22

Le mardi 09 septembre 2008, Markus Wanner a écrit :
> ..and it will still has to wait until WAL is written to disk on the
> local node, as we do now. These are two different things to wait for.
> One is a network socket operation, the other is an fsync(). As these
> don't work together too well (blocking), you better run that in two
> different processes.

Exactly the point. The process is now already waiting in all cases, so maybe
we could just force waiting some WALSender signal before sending the fsync()
order, so we now have Group Commit.
I'm not sure this is a good idea at all, it's just the way I understand how
adding WALSender process in the mix could give Group Commit feature for free.

Regards,
--
dim

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 10:43:54

On Tue, 2008-09-09 at 15:32 +0200, Dimitri Fontaine wrote:
> The process is now already waiting in all cases

If the WALWriter|Sender is available, it can begin the task immediately.
There is no need for it to wait if you want synchronous behaviour.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

09 September 2008, 10:55:55

Hi,

ITAGAKI Takahiro wrote:
> Signals and locking, borrewed from Postgres-R, are now studied
> for the purpose in the log shipping,

Cool. Let me know if you have any questions WRT this imessages stuff.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

09 September 2008, 10:58:58

Hi,

Dimitri Fontaine wrote:
> Exactly the point. The process is now already waiting in all cases, so maybe 
> we could just force waiting some WALSender signal before sending the fsync() 
> order, so we now have Group Commit.

A single process can only wait on either fsync() or on select(), but not 
on both concurrently, because both syscalls are blocking. So mixing 
these into a single process is an inherently bad idea due to lack of 
parallelism.

I fail to see how log shipping would ease or have any other impact on a 
Group Commit feature, which should clearly also work for stand alone 
servers, i.e. where there is no WAL sender process.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Dimitri Fontaine

Date:

09 September 2008, 11:06:31

Le mardi 09 septembre 2008, Simon Riggs a écrit :
> If the WALWriter|Sender is available, it can begin the task immediately.
> There is no need for it to wait if you want synchronous behaviour.

Ok. Now I'm as lost as anyone with respect to how you get Group Commit :)
--
dim

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 11:17:39

On Tue, 2008-09-09 at 16:05 +0200, Dimitri Fontaine wrote:
> Le mardi 09 septembre 2008, Simon Riggs a écrit :
> > If the WALWriter|Sender is available, it can begin the task immediately.
> > There is no need for it to wait if you want synchronous behaviour.
> 
> Ok. Now I'm as lost as anyone with respect to how you get Group Commit :)

OK, sorry. Pls read my reply to Heikki on different subthread of this
topic, he had same question of me.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

09 September 2008, 11:17:45

Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote:
>>> "Agreed"?  That last restriction is a deal-breaker.
> 
>> OK, I should have said *if wal_buffers are full* XLogInsert() cannot
>> advance to a new page while we are waiting to send or write. So I don't
>> think its a deal breaker.
> 
> Oh, OK, that's obvious --- there's no place to put more data.

Each WAL sender can keep at most one page locked at a time, right? So, 
that should never happen if wal_buffers > 1 + n_wal_senders.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 12:07:53

On Tue, 2008-09-09 at 17:17 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> >> On Tue, 2008-09-09 at 08:24 -0400, Tom Lane wrote:
> >>> "Agreed"?  That last restriction is a deal-breaker.
> > 
> >> OK, I should have said *if wal_buffers are full* XLogInsert() cannot
> >> advance to a new page while we are waiting to send or write. So I don't
> >> think its a deal breaker.
> > 
> > Oh, OK, that's obvious --- there's no place to put more data.
> 
> Each WAL sender can keep at most one page locked at a time, right? So, 
> that should never happen if wal_buffers > 1 + n_wal_senders.

Don't understand. I am referring to the logic at the top of
AdvanceXLInsertBuffer(). We would need to wait for all people reading
the contents of wal_buffers. 

Currently, there is no page locking on the WAL buffers, though I have
suggested some for increasing XLogInsert() performance.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

09 September 2008, 12:26:46

Simon Riggs wrote:
> Don't understand. I am referring to the logic at the top of
> AdvanceXLInsertBuffer(). We would need to wait for all people reading
> the contents of wal_buffers. 

Oh, I see.

If a slave falls behind, how does it catch up? I guess you're saying 
that it can't fall behind, because the master will block before that 
happens. Also in asynchronous replication? And what about when the slave 
is first set up, and needs to catch up with the master?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

09 September 2008, 12:43:58

On Tue, 2008-09-09 at 18:26 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Don't understand. I am referring to the logic at the top of
> > AdvanceXLInsertBuffer(). We would need to wait for all people reading
> > the contents of wal_buffers. 
> 
> Oh, I see.
> 
> If a slave falls behind, how does it catch up? 

That is the right question.

> I guess you're saying 
> that it can't fall behind, because the master will block before that 
> happens. Also in asynchronous replication? 

Yes, it can fall behind in async mode. sysadmin must not let it.

> And what about when the slave 
> is first set up, and needs to catch up with the master?

We need an initial joining mode while they "match speed". We must allow
for the case where the standby has been recycled, or the network has
been down for a medium-long period of time.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Zeugswetter Andreas OSB sIT

Date:

09 September 2008, 16:00:22

> > Don't understand. I am referring to the logic at the top of
> > AdvanceXLInsertBuffer(). We would need to wait for all people reading
> > the contents of wal_buffers.
>
> Oh, I see.
>
> If a slave falls behind, how does it catch up? I guess you're saying
> that it can't fall behind, because the master will block before that
> happens. Also in asynchronous replication? And what about
> when the slave
> is first set up, and needs to catch up with the master?

I think the WAL Sender needs the ability to read the WAL files directly.
In cases where it falls behind, or just started, it needs to be able to catch up.
So, it seems we eighter need to copy the WAL buffer into local memory before sending,
or "lock" the WAL buffer until send finished.
Useful network timeouts are in the >= 5-10 sec range (even for GbE lan), so I don't
think locking WAL buffers is feasible. Thus the WAL sender needs to copy (the needed
portion of the current WAL buffer) before send (or use async send that immediately
returns when the buffer is copied into the network stack).

When the WAL sender is ready to continue it eighter still finds the next WAL buffer
(or the rest of the current buffer) or it needs to fall back to Plan B and
read the WAL files again. A sync client could still wait for the replicate, even if
local WAL has already advanced massively. The checkpointer would need the LSN
info from WAL senders to not reuse any still needed WAL files, although in that case
it might be time to declare the replicate broken.

Ideally the WAL sender also knows whether the client waits, so it can decide to send
a part of a buffer. The WAL sender should wake and act whenever a "network packet"
full of WAL buffer is ready, regardless of commits. Whatever size of send seems
appropriate here (might be one WAL page).
The WAL Sender should only need to expect a response, when it sent a commit record,
ideally only if a client is waiting (and once in a while at least for every log switch).

All in all a useful streamer seems like a lot of work.

Andreas

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

10 September 2008, 01:28:17

On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> There's one thing I haven't figured out in this discussion. Does the write
> to the disk happen before or after the write to the slave? Can you guarantee
> that if a transaction is committed in the master, it's also committed in the
> slave, or vice versa?

We can guarantee that a transaction is committed in both the master and
the slave if we can wait for that one fsyncs WAL to disk and the other holds
it to memory or disk. Even if one fails, the other can continue service.
Even if both fail, the node which wrote WAL can continue service. A transaction
is lost in neither of the cases.

> Agreed. But what happens if you hit that timeout?

The stand-alone master continues service when it it that timeout. On the other
hand, the slave waits for the order by the sysadmin or the clustering software,
then it exits or becomes master.

> Should we enforce that
> timeout within the server, or should we leave that to the external heartbeat
> system?

Within the server. All users do not use such an external system. It's not simple
for the external system to leave the master stand-alone.

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

10 September 2008, 01:56:33

On Tue, Sep 9, 2008 at 8:42 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> In the viewpoint of detection of a network failure, this feature is
>> necessary.
>> When the network goes down, WAL sender can be blocked until it detects
>> the network failure, i.e. WAL sender keeps waiting for the response which
>> never comes. A timeout notification is necessary in order to detect a
>> network failure soon.
>
> That's one of the areas I'm missing from the overall concept. I'm glad it
> comes up. You certainly realize, that such a timeout must be set high enough
> so as not to trigger "false negatives" every now and then?

Yes.
And, as you know, there is trade-off between the false detection of the network
failure and how long WAL sender is blocked.

I'll provide not only that timeout but also keepalive for the network between
the master and the slave. I expect that keepalive eases that trade-off.

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

10 September 2008, 03:15:36

On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> If a slave falls behind, how does it catch up? I guess you're saying that it
> can't fall behind, because the master will block before that happens. Also
> in asynchronous replication? And what about when the slave is first set up,
> and needs to catch up with the master?

The mechanism for the slave to catch up with the master should be
provided on the outside of postgres. I think that postgres should provide
only WAL streaming, i.e. the master always sends *current* WAL data
to the slave.

Of course, the master has to send also the current WAL *file* in the
initial sending just after the slave starts and connects with it.
Because, at the time, current WAL position might be in the middle of
WAL file. Even if the master sends only current WAL data, the slave
which don't have the corresponding WAL file can not handle it.

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Hannu Krosing

Date:

10 September 2008, 03:39:45

On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote:
> On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > If a slave falls behind, how does it catch up? I guess you're saying that it
> > can't fall behind, because the master will block before that happens. Also
> > in asynchronous replication? And what about when the slave is first set up,
> > and needs to catch up with the master?
> 
> The mechanism for the slave to catch up with the master should be
> provided on the outside of postgres. 

So you mean that we still need to do initial setup (copy backup files
and ship and replay WAL segments generated during copy) by external
WAL-shipping tools, like walmgr.py, and then at some point switch over
to internal WAL-shipping, when we are sure that we are within same WAL
file on both master and slave ?

> I think that postgres should provide
> only WAL streaming, i.e. the master always sends *current* WAL data
> to the slave.
>
> Of course, the master has to send also the current WAL *file* in the
> initial sending just after the slave starts and connects with it.

I think that it needs to send all WAL files which slave does not yet
have, as else the slave will have gaps. On busy system you will generate
several new WAL files in the time it takes to make master copy, transfer
it to slave and apply WAL files generated during initial setup.

> Because, at the time, current WAL position might be in the middle of
> WAL file. Even if the master sends only current WAL data, the slave
> which don't have the corresponding WAL file can not handle it.

I agree, that making initial copy may be outside the scope of
Synchronous Log Shipping Replication, but slave catching up by
requesting all missing WAL files and applying these up to a point when
it can switch to Sync mode should be in. Else we gain very little from
this patch.

---------------
Hannu

Re: Synchronous Log Shipping Replication

From

"Pavan Deolasee"

Date:

10 September 2008, 03:54:54

On Wed, Sep 10, 2008 at 12:05 PM, Hannu Krosing <hannu@krosing.net> wrote:
>
>
>> Because, at the time, current WAL position might be in the middle of
>> WAL file. Even if the master sends only current WAL data, the slave
>> which don't have the corresponding WAL file can not handle it.
>
> I agree, that making initial copy may be outside the scope of
> Synchronous Log Shipping Replication, but slave catching up by
> requesting all missing WAL files and applying these up to a point when
> it can switch to Sync mode should be in. Else we gain very little from
> this patch.
>

I agree. We should leave the initial backup acquisition out of the
scope atleast for the first phase, but provide mechanism to do initial
catch up, as it may get messy to do it completely outside of the core.

The slave will need to able to buffer the *current* WAL until it gets
the missing WAL files and then continue. Also we may not want the
master to be stuck while slave is doing the catchup.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 04:16:22

On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote:
> On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote:
> > On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas
> > <heikki.linnakangas@enterprisedb.com> wrote:
> > > If a slave falls behind, how does it catch up? I guess you're saying that it
> > > can't fall behind, because the master will block before that happens. Also
> > > in asynchronous replication? And what about when the slave is first set up,
> > > and needs to catch up with the master?
> > 
> > The mechanism for the slave to catch up with the master should be
> > provided on the outside of postgres. 
> 
> So you mean that we still need to do initial setup (copy backup files
> and ship and replay WAL segments generated during copy) by external
> WAL-shipping tools, like walmgr.py, and then at some point switch over
> to internal WAL-shipping, when we are sure that we are within same WAL
> file on both master and slave ?
> 
> > I think that postgres should provide
> > only WAL streaming, i.e. the master always sends *current* WAL data
> > to the slave.
> >
> > Of course, the master has to send also the current WAL *file* in the
> > initial sending just after the slave starts and connects with it.
> 
> I think that it needs to send all WAL files which slave does not yet
> have, as else the slave will have gaps. On busy system you will generate
> several new WAL files in the time it takes to make master copy, transfer
> it to slave and apply WAL files generated during initial setup.
> 
> > Because, at the time, current WAL position might be in the middle of
> > WAL file. Even if the master sends only current WAL data, the slave
> > which don't have the corresponding WAL file can not handle it.
> 
> I agree, that making initial copy may be outside the scope of
> Synchronous Log Shipping Replication, but slave catching up by
> requesting all missing WAL files and applying these up to a point when
> it can switch to Sync mode should be in. Else we gain very little from
> this patch.

I agree with Hannu.

Any working solution needs to work for all required phases. If you did
it this way, you'd never catch up at all.

When you first make the copy, it will be made at time X. The point of
consistency will be sometime later and requires WAL data to make it
consistent. So you would need to do a PITR to get it to the point of
consistency. While you've been doing that, the primary server has moved
on and now there is a gap between primary and standby. You *must*
provide a facility to allow the standby to catch up with the primary.
Only sending *current* WAL is not a solution, and not acceptable.

So there must be mechanisms for sending past *and* current WAL data to
the standby, and an exact and careful mechanism for switching between
the two modes when the time is right. Replication is only synchronous
*after* the change in mode.

So the protocol needs to be something like:

1. Standby contacts primary and says it would like to catch up, but is
currently at point X (which is a point at, or after the first consistent
stopping point in WAL after standby has performed its own crash
recovery, if any was required).
2. primary initiates data transfer of old data to standby, starting at
point X
3. standby tells primary where it has got to periodically
4. at some point primary decides primary and standby are close enough
that it can now begin streaming "current WAL" (which is always the WAL
up to wal_buffers behind the the current WAL insertion point).

Bear in mind that unless wal_buffers > 16MB the final catchup will
*always* be less than one WAL file, so external file based mechanisms
alone could never be enough. So you would need wal_buffers >= 2000 to
make an external catch up facility even work at all.

This also probably means that receipt of WAL data on the standby cannot
be achieved by placing it in wal_buffers. So we probably need to write
it directly to the WAL files, then rely on the filesystem cache on the
standby to buffer the data for use by ReadRecord.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 04:20:09

On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote:
> On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > There's one thing I haven't figured out in this discussion. Does the write
> > to the disk happen before or after the write to the slave? Can you guarantee
> > that if a transaction is committed in the master, it's also committed in the
> > slave, or vice versa?
> 

The write happens concurrently and independently on both.

Yes, you wait for the write *and* send pointer to be "flushed" before
you allow a synch commit with synch replication. (Definition of flushed
is changeable by parameters).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 05:04:07

On Wed, 2008-09-10 at 12:24 +0530, Pavan Deolasee wrote:
> Also we may not want the master to be stuck while slave is doing the catchup.

No, since it may take hours, not seconds.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

10 September 2008, 05:06:26

Hi,

Simon Riggs wrote:
> 1. Standby contacts primary and says it would like to catch up, but is
> currently at point X (which is a point at, or after the first consistent
> stopping point in WAL after standby has performed its own crash
> recovery, if any was required).
> 2. primary initiates data transfer of old data to standby, starting at
> point X
> 3. standby tells primary where it has got to periodically
> 4. at some point primary decides primary and standby are close enough
> that it can now begin streaming "current WAL" (which is always the WAL
> up to wal_buffers behind the the current WAL insertion point).

Hm.. wouldn't it be simpler, to start streaming right away and "cache" 
that on the standby until it can be applied? I.e. a protocol like:

1. - same as above -
2. primary starts streaming from live or hot data from it's current 
position Y in the WAL stream, which is certainly after (or probably 
equal to) X.
3. standby receives the hot stream from point Y on. It now knows it 
misses 'cold' portions of the WAL from X to Y and requests that.
4. primary serves remaining 'cold' WAL chunks from its xlog / archive 
from between X and Y.
5. standby applies 'cold' WAL, until done. Then proceeds with the cached 
WAL segments from 'hot' streaming.

> Bear in mind that unless wal_buffers > 16MB the final catchup will
> *always* be less than one WAL file, so external file based mechanisms
> alone could never be enough.

Agreed.

> This also probably means that receipt of WAL data on the standby cannot
> be achieved by placing it in wal_buffers. So we probably need to write
> it directly to the WAL files, then rely on the filesystem cache on the
> standby to buffer the data for use by ReadRecord.

Makes sense, especially in case of cached WAL as outlined above. Is this 
a problem in any way?

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

10 September 2008, 05:10:29

Simon Riggs wrote:
> On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote:
>> On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> There's one thing I haven't figured out in this discussion. Does the write
>>> to the disk happen before or after the write to the slave? Can you guarantee
>>> that if a transaction is committed in the master, it's also committed in the
>>> slave, or vice versa?
> 
> The write happens concurrently and independently on both.
> 
> Yes, you wait for the write *and* send pointer to be "flushed" before
> you allow a synch commit with synch replication. (Definition of flushed
> is changeable by parameters).

The thing that bothers me is the behavior when the synchronous slave 
doesn't respond. A timeout has been discussed, after which the master 
just gives up on sending, and starts acting as if there's no slave. 
How's that different from asynchronous mode where WAL is sent to the 
server concurrently when it's flushed to disk, but we don't wait for the 
send to finish? ISTM that in both cases the only guarantee we can give 
is that when a transaction is acknowledged as committed, it's committed 
in the master but not necessarily in the slave.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

"Pavan Deolasee"

Date:

10 September 2008, 05:16:15

On Wed, Sep 10, 2008 at 1:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
>
> The thing that bothers me is the behavior when the synchronous slave doesn't
> respond. A timeout has been discussed, after which the master just gives up
> on sending, and starts acting as if there's no slave. How's that different
> from asynchronous mode where WAL is sent to the server concurrently when
> it's flushed to disk, but we don't wait for the send to finish? ISTM that in
> both cases the only guarantee we can give is that when a transaction is
> acknowledged as committed, it's committed in the master but not necessarily
> in the slave.
>

I think there is one difference. Assuming that the timeouts happen
infrequently, most of the time the slave is in sync with the master
and that can be reported to the user. Whereas in async mode, the slave
will *always* be out of sync.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 05:25:55

On Wed, 2008-09-10 at 11:10 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote:
> >> On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas
> >> <heikki.linnakangas@enterprisedb.com> wrote:
> >>> There's one thing I haven't figured out in this discussion. Does the write
> >>> to the disk happen before or after the write to the slave? Can you guarantee
> >>> that if a transaction is committed in the master, it's also committed in the
> >>> slave, or vice versa?
> > 
> > The write happens concurrently and independently on both.
> > 
> > Yes, you wait for the write *and* send pointer to be "flushed" before
> > you allow a synch commit with synch replication. (Definition of flushed
> > is changeable by parameters).
> 
> The thing that bothers me is the behavior when the synchronous slave 
> doesn't respond. A timeout has been discussed, after which the master 
> just gives up on sending, and starts acting as if there's no slave. 
> How's that different from asynchronous mode where WAL is sent to the 
> server concurrently when it's flushed to disk, but we don't wait for the 
> send to finish? ISTM that in both cases the only guarantee we can give 
> is that when a transaction is acknowledged as committed, it's committed 
> in the master but not necessarily in the slave.

We should differentiate between what the WALsender does and what the
user does in response to a network timeout.

Saying "I want to wait for a synchronous commit and I am willing to wait
for ever to ensure it" leads to long hangs in some cases.

I was suggesting that some users may wish to wait up to time X before
responding to the commit. The WALSender may keep retrying long after
that point, but that doesn't mean all current users need to do that
also. The user would need to say whether the response to the timeout was
an error, or just accept and get on with it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

10 September 2008, 05:40:01

Simon Riggs wrote:
> Saying "I want to wait for a synchronous commit and I am willing to wait
> for ever to ensure it" leads to long hangs in some cases.

Sure. That's the fundamental problem with synchronous replication. 
That's why many people choose asynchronous replication instead. Clearly 
at some point you'll want to give up and continue without the slave, or 
kill the master and fail over to the slave. I'm wondering how that's 
different than the lag between master and server in asynchronous 
replication from the client's point of view.

> I was suggesting that some users may wish to wait up to time X before
> responding to the commit. The WALSender may keep retrying long after
> that point, but that doesn't mean all current users need to do that
> also. The user would need to say whether the response to the timeout was
> an error, or just accept and get on with it.

I'm not sure I understand that paragraph. Who's the user? Do we need to 
expose some new information to the client so that it can do something?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Csaba Nagy

Date:

10 September 2008, 05:44:53

On Tue, 2008-09-09 at 20:59 +0200, Zeugswetter Andreas OSB sIT wrote:
> All in all a useful streamer seems like a lot of work.

I mentioned some time ago an alternative idea of having the slave
connect through a normal SQL connection and call a function which
streams the WAL file from the point requested by the slave... wouldn't
that be feasible ? All the connection part would be already there, only
the streaming function should be implemented. It even could use SSL
connections if needed. Then you would have one normal backend per slave,
and they should access either the files directly or possibly some shared
area where the WAL is buffered for this purpose... the streaming
function could also take care of signaling the "up-to-dateness" of the
slaves in case of synchronous replication.

There could also be some system table infrastructure to track the
slaves. There could also be some functions to stream the files of the DB
through normal backends, so a slave could be bootstrapped all the way
from copying the files through a simple postgres backend connection...
that would make for the easiest possible setup of a slave: configure a
connection to the master, and hit "run"... and last but not least the
same interface could be used by a PITR repository client for archiving
the WAL stream and occasional file system snapshots.

Cheers,
Csaba.

Re: Synchronous Log Shipping Replication

From

Hannu Krosing

Date:

10 September 2008, 05:56:06

On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote:
> Hi,
> 
> Simon Riggs wrote:
> > 1. Standby contacts primary and says it would like to catch up, but is
> > currently at point X (which is a point at, or after the first consistent
> > stopping point in WAL after standby has performed its own crash
> > recovery, if any was required).
> > 2. primary initiates data transfer of old data to standby, starting at
> > point X
> > 3. standby tells primary where it has got to periodically
> > 4. at some point primary decides primary and standby are close enough
> > that it can now begin streaming "current WAL" (which is always the WAL
> > up to wal_buffers behind the the current WAL insertion point).
> 
> Hm.. wouldn't it be simpler, to start streaming right away and "cache" 
> that on the standby until it can be applied? I.e. a protocol like:

Good idea! 

This makes everything simpler, as user has to do only 4 things

1. start slave in "receive WAL, dont apply" mode
2. start walshipping on master 
3. copy files from master to slave.
4. restart slave in "receive WAL" mode

all else will happen automatically.

---------------
Hannu

Re: Synchronous Log Shipping Replication

From

Hannu Krosing

Date:

10 September 2008, 05:56:07

On Wed, 2008-09-10 at 08:15 +0100, Simon Riggs wrote:

> Any working solution needs to work for all required phases. If you did
> it this way, you'd never catch up at all.
> 
> When you first make the copy, it will be made at time X. The point of
> consistency will be sometime later and requires WAL data to make it
> consistent. So you would need to do a PITR to get it to the point of
> consistency. While you've been doing that, the primary server has moved
> on and now there is a gap between primary and standby. You *must*
> provide a facility to allow the standby to catch up with the primary.
> Only sending *current* WAL is not a solution, and not acceptable.
> 
> So there must be mechanisms for sending past *and* current WAL data to
> the standby, and an exact and careful mechanism for switching between
> the two modes when the time is right. Replication is only synchronous
> *after* the change in mode.
> 
> So the protocol needs to be something like:
> 
> 1. Standby contacts primary and says it would like to catch up, but is
> currently at point X (which is a point at, or after the first consistent
> stopping point in WAL after standby has performed its own crash
> recovery, if any was required).
> 2. primary initiates data transfer of old data to standby, starting at
> point X
> 3. standby tells primary where it has got to periodically
> 4. at some point primary decides primary and standby are close enough
> that it can now begin streaming "current WAL" (which is always the WAL
> up to wal_buffers behind the the current WAL insertion point).
> 
> Bear in mind that unless wal_buffers > 16MB the final catchup will
> *always* be less than one WAL file, so external file based mechanisms
> alone could never be enough. So you would need wal_buffers >= 2000 to
> make an external catch up facility even work at all.
> 
> This also probably means that receipt of WAL data on the standby cannot
> be achieved by placing it in wal_buffers. So we probably need to write
> it directly to the WAL files, then rely on the filesystem cache on the
> standby to buffer the data for use by ReadRecord.

And this catchup may be needed to be done repeatedly, in case of network
failure.

I don't think that slave automatically becoming a master if it detects
network failure (as suggested elsewhere in this thread) is acceptable
solution, as it will more often than not result in two masters.

A better solution would be:

1. Slave just keeps waiting for new WAL records and confirming receipt
storing to disk and application.

2. Master is in one of at least two states
2.1 - Catchup - Async mode where it is sending old logs and wal records to slave
2.2 - Sync Replication - Sync mode, where COMMIT does not return before
confirmation from WALSender.

Initial mode is Catchup which is promoted to Sync Replication when delay
of WAL application is reasonably small.

When Master detects network outage (== delay bigger than acceptable) it
will either just Send a NOTICE to all clients and fall back to Catchup,
or raise an ERROR (and still fall back to cathup)

This is the point where external HA / Heartbeat etc. solutions would
intervene and decide, what to do.

-----------------
Hannu

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

10 September 2008, 05:57:45

On Wed, Sep 10, 2008 at 4:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote:
>> On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote:
>> > On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas
>> > <heikki.linnakangas@enterprisedb.com> wrote:
>> > > If a slave falls behind, how does it catch up? I guess you're saying that it
>> > > can't fall behind, because the master will block before that happens. Also
>> > > in asynchronous replication? And what about when the slave is first set up,
>> > > and needs to catch up with the master?
>> >
>> > The mechanism for the slave to catch up with the master should be
>> > provided on the outside of postgres.
>>
>> So you mean that we still need to do initial setup (copy backup files
>> and ship and replay WAL segments generated during copy) by external
>> WAL-shipping tools, like walmgr.py, and then at some point switch over
>> to internal WAL-shipping, when we are sure that we are within same WAL
>> file on both master and slave ?
>>
>> > I think that postgres should provide
>> > only WAL streaming, i.e. the master always sends *current* WAL data
>> > to the slave.
>> >
>> > Of course, the master has to send also the current WAL *file* in the
>> > initial sending just after the slave starts and connects with it.
>>
>> I think that it needs to send all WAL files which slave does not yet
>> have, as else the slave will have gaps. On busy system you will generate
>> several new WAL files in the time it takes to make master copy, transfer
>> it to slave and apply WAL files generated during initial setup.
>>
>> > Because, at the time, current WAL position might be in the middle of
>> > WAL file. Even if the master sends only current WAL data, the slave
>> > which don't have the corresponding WAL file can not handle it.
>>
>> I agree, that making initial copy may be outside the scope of
>> Synchronous Log Shipping Replication, but slave catching up by
>> requesting all missing WAL files and applying these up to a point when
>> it can switch to Sync mode should be in. Else we gain very little from
>> this patch.
>
> I agree with Hannu.
>
> Any working solution needs to work for all required phases. If you did
> it this way, you'd never catch up at all.
>
> When you first make the copy, it will be made at time X. The point of
> consistency will be sometime later and requires WAL data to make it
> consistent. So you would need to do a PITR to get it to the point of
> consistency. While you've been doing that, the primary server has moved
> on and now there is a gap between primary and standby. You *must*
> provide a facility to allow the standby to catch up with the primary.
> Only sending *current* WAL is not a solution, and not acceptable.
>
> So there must be mechanisms for sending past *and* current WAL data to
> the standby, and an exact and careful mechanism for switching between
> the two modes when the time is right. Replication is only synchronous
> *after* the change in mode.
>
> So the protocol needs to be something like:
>
> 1. Standby contacts primary and says it would like to catch up, but is
> currently at point X (which is a point at, or after the first consistent
> stopping point in WAL after standby has performed its own crash
> recovery, if any was required).
> 2. primary initiates data transfer of old data to standby, starting at
> point X
> 3. standby tells primary where it has got to periodically
> 4. at some point primary decides primary and standby are close enough
> that it can now begin streaming "current WAL" (which is always the WAL
> up to wal_buffers behind the the current WAL insertion point).
>
> Bear in mind that unless wal_buffers > 16MB the final catchup will
> *always* be less than one WAL file, so external file based mechanisms
> alone could never be enough. So you would need wal_buffers >= 2000 to
> make an external catch up facility even work at all.
>
> This also probably means that receipt of WAL data on the standby cannot
> be achieved by placing it in wal_buffers. So we probably need to write
> it directly to the WAL files, then rely on the filesystem cache on the
> standby to buffer the data for use by ReadRecord.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>

Umm.. I disagree with you ;)

Here is my initial setup sequence.

1) Start WAL receiver.    The current WAL file and subsequent ones will be transmitted by    WAL sender and WAL
receiver.This transmission will not block    the following operation for initial setup, and vice versa. That is,    the
slavecan catch up with the master without blocking the master.    I cannot accept that WAL sender is blocked for
initialsetup.
 

2) Copy the missing history files from the master to the slave.

3) Prepare recovery.conf on the slave.    You have to configure pg_standby and set recovery_target_timeline to
'latest'or the current TLI of the master.
 

4) Start postgres.    The startup process and pg_standby start archive recovery. If there    are missing WAL files,
pg_standbywaits for it and WAL replay is    suspended.
 

5) Copy the missing WAL files from the master and the slave.   Of course, we don't need to copy the WAL files which are
transmitted  by WAL sender and WAL receiver. Then, the recovery is resumed.
 

My sequence covers several cases :

* There is no missing WAL file.
* There is a lot of missing WAL file.
* There are missing history files. Failover always generates the gap of  history file because TLI is incremented when
archiverecovery is completed.
 
...

In your design, does not initial setup block the master?
Does your design cover above-mentioned case?

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 05:58:04

On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote:
> Hi,
> 
> Simon Riggs wrote:
> > 1. Standby contacts primary and says it would like to catch up, but is
> > currently at point X (which is a point at, or after the first consistent
> > stopping point in WAL after standby has performed its own crash
> > recovery, if any was required).
> > 2. primary initiates data transfer of old data to standby, starting at
> > point X
> > 3. standby tells primary where it has got to periodically
> > 4. at some point primary decides primary and standby are close enough
> > that it can now begin streaming "current WAL" (which is always the WAL
> > up to wal_buffers behind the the current WAL insertion point).
> 
> Hm.. wouldn't it be simpler, to start streaming right away and "cache" 

The standby server won't come up until you have:
* copied the base backup
* sent it to standby server
* bring up standby, have it realise it is a replication partner and
begin requesting WAL from primary (in some way)

There will be a gap (probably) between the initial WAL files and the
current tail of wal_buffers by the time all of the above has happened.
We will then need to copy more WAL across until we get to a point where
the most recent WAL record available on standby is ahead of the tail of
wal_buffers on primary so that streaming can start.

If we start caching WAL right away we would need to have two receivers.
One to receive the missing WAL data and one to receive the current WAL
data. We can't apply the WAL until we have the earlier missing WAL data,
so cacheing it seems difficult. On a large server this might be GBs of
data. Seems easier to not cache current WAL and to have just a single
WALReceiver process that performs a mode change once it has caught up.
(And I should say "if it catches up", since it is possible that it never
actually will catch up, in practical terms, since this depends upon the
relative power of the servers involved.). So there's no need to store
more WAL on standby than is required to restart recovery from last
restartpoint. i.e. we stream WAL at all times, not just in normal
running mode.

Seems easiest to have:
* Startup process only reads next WAL record when the ReceivedLogPtr >
ReadRecPtr, so it knows nothing of how WAL is received. Startup process
reads directly from WAL files in *all* cases. ReceivedLogPtr is in
shared memory and accessed via spinlock. Startup process only ever reads
this pointer. (Notice that Startup process is modeless).
* WALReceiver reads data from primary and writes it to WAL files,
fsyncing (if ever requested to do so). WALReceiver updates
ReceivedLogPtr.

That is much simpler and more modular. Buffering of the WAL files is
handled by filesystem buffering.

If standby crashes, all data is safely written to WAL files and we
restart from correct place.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 06:04:45

On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote:

>      I cannot accept that WAL sender is blocked for initial setup.

Yes, very important point. We definitely agree on that. The primary must
be able to continue working while all this setup is happening. No tables
are locked, all commits are allowed etc..

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Dimitri Fontaine

Date:

10 September 2008, 06:09:28

Hi,

Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit :
> Sure. That's the fundamental problem with synchronous replication.
> That's why many people choose asynchronous replication instead. Clearly
> at some point you'll want to give up and continue without the slave, or
> kill the master and fail over to the slave. I'm wondering how that's
> different than the lag between master and server in asynchronous
> replication from the client's point of view.

As a future user of this new facilities, the difference from client's POV is
simple : in normal mode of operation, we want a strong guarantee that any
COMMIT has made it to both the master and the slave at commit time. No lag
whatsoever.

You're considering lag as an option in case of failure, but I don't see this
as acceptable when you need sync commit. In case of network timeout, cluster
is down. So you want to either continue servicing in degraged mode or get the
service down while you repair the cluster, but neither of those choice can be
transparent to the admins, I'd argue.

Of course, main use case is high availability, which tends to say you do not
have the option to stop service, and seems to dictate continue servicing in
degraded mode: slave can't keep up (whatever the error domain), master is
alone, "advertise" to monitoring solutions and continue servicing.
And provide some way for the slave to "rejoin", maybe, too.

> I'm not sure I understand that paragraph. Who's the user? Do we need to
> expose some new information to the client so that it can do something?

Maybe with some GUCs where to set the acceptable "timeout" for WAL sync
process, and if reaching timeout is a warning or an error. With a userset GUC
we could event have replication-error-level transaction concurrent to non
critical ones...

Now what to do exactly in case of error remains to be decided...

HTH, Regards,
--
dim

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 06:12:22

On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote:

> Umm.. I disagree with you ;)

That's no problem and I respect your knowledge.

If we disagree, it is very likely because we have misunderstood each
other. Much has been written, so I will wait for it to all be read and
understood by you and others, and for me to read other posts and replies
also. I feel sure that after some thought a clear consensus will emerge,
and I feel hopeful that the feature can be done in the time available
with simple code changes.

So I will stop replying for a few hours to give everyone time (incl me).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

10 September 2008, 06:37:18

Hi,

Simon Riggs wrote:
> The standby server won't come up until you have:
> * copied the base backup
> * sent it to standby server
> * bring up standby, have it realise it is a replication partner and
> begin requesting WAL from primary (in some way)

Right. That was your assumption as well. Required before step 1 in both 
cases.

> If we start caching WAL right away we would need to have two receivers.
> One to receive the missing WAL data and one to receive the current WAL
> data. We can't apply the WAL until we have the earlier missing WAL data,
> so cacheing it seems difficult.

You could use the same receiver process and just handle different 
packets differently. I see no need for two separate receiver processes here.

> On a large server this might be GBs of
> data.

..if served from a log archive, correct. Without archiving, we are 
limited to xlog anyway.

> Seems easier to not cache current WAL and to have just a single
> WALReceiver process that performs a mode change once it has caught up.
> (And I should say "if it catches up", since it is possible that it never
> actually will catch up, in practical terms, since this depends upon the
> relative power of the servers involved.). So there's no need to store
> more WAL on standby than is required to restart recovery from last
> restartpoint. i.e. we stream WAL at all times, not just in normal
> running mode.

Switching between streaming from files and 'live' streaming on the 
active node seems difficult to me, because you need to make sure there's 
no gap. That problem could be circumvented by handling this on the 
standby. If you think switching on the active is simple enough, that's fine.

> Seems easiest to have:
> * Startup process only reads next WAL record when the ReceivedLogPtr >
> ReadRecPtr, so it knows nothing of how WAL is received. Startup process
> reads directly from WAL files in *all* cases. ReceivedLogPtr is in
> shared memory and accessed via spinlock. Startup process only ever reads
> this pointer. (Notice that Startup process is modeless).

Well, that's certainly easier for the standby, but requires mode 
switching on the active.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 07:17:36

On Wed, 2008-09-10 at 11:07 +0200, Dimitri Fontaine wrote:
> Hi,
> 
> Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit :
> > Sure. That's the fundamental problem with synchronous replication.
> > That's why many people choose asynchronous replication instead. Clearly
> > at some point you'll want to give up and continue without the slave, or
> > kill the master and fail over to the slave. I'm wondering how that's
> > different than the lag between master and server in asynchronous
> > replication from the client's point of view.
> 
> As a future user of this new facilities, the difference from client's POV is 
> simple : in normal mode of operation, we want a strong guarantee that any 
> COMMIT has made it to both the master and the slave at commit time. No lag 
> whatsoever.

Agreed.

> You're considering lag as an option in case of failure, but I don't see this 
> as acceptable when you need sync commit. In case of network timeout, cluster 
> is down. So you want to either continue servicing in degraged mode or get the 
> service down while you repair the cluster, but neither of those choice can be 
> transparent to the admins, I'd argue.
> 
> Of course, main use case is high availability, which tends to say you do not 
> have the option to stop service,

We have a number of choices, at the point of failure:
* Does the whole primary server stay up (probably)?
* Do we continue to allow new transactions in degraded mode? (which
increases the risk of transaction loss if we continue at that time).
(The answer sounds like it will be "of course, stupid" but this cluster
may be part of an even higher level HA mechanism, so the answer isn't
always clear).
* For each transaction that is trying to commit: do we want to wait
forever? If not, how long? If we stop waiting, do we throw ERROR, or do
we say, lets get on with another transaction.

If the server is up, yet all connections in a session pool are stuck
waiting for their last commits to complete then most sysadmins would
agree that the server is actually "down". Since no useful work is
happening, or can be initiated - even read only. We don't need to
address that issue in the same way for all transactions, is all I'm
saying.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Aidan Van Dyk

Date:

10 September 2008, 10:37:09

* Simon Riggs <simon@2ndQuadrant.com> [080910 06:18]:

> We have a number of choices, at the point of failure:
> * Does the whole primary server stay up (probably)?

The only sane choice is the one the admin makes.  Any "predetermined" choice
PG makes can (and will) be wrong in some situations.

> * Do we continue to allow new transactions in degraded mode? (which
> increases the risk of transaction loss if we continue at that time).
> (The answer sounds like it will be "of course, stupid" but this cluster
> may be part of an even higher level HA mechanism, so the answer isn't
> always clear).

The only sane choice is the one the admin makes.  Any "predetermined" choice
PG makes can (and will) be wrong in some situations.

> * For each transaction that is trying to commit: do we want to wait
> forever? If not, how long? If we stop waiting, do we throw ERROR, or do
> we say, lets get on with another transaction.

The only sane choice is the one the admin makes.  Any "predetermined" choice
PG makes can (and will) be wrong in some situations.

> If the server is up, yet all connections in a session pool are stuck
> waiting for their last commits to complete then most sysadmins would
> agree that the server is actually "down". Since no useful work is
> happening, or can be initiated - even read only. We don't need to
> address that issue in the same way for all transactions, is all I'm
> saying.

Sorry to sound like a broken record here, but the whole point is to
guarantee data safety.  You can only start trading ACID for HA if you
have the ACID guarantees in the first place (and for replication, this
means across the cluster, including slaves)

So in that light, I think it's pretty obvious that if a slave is
considered part of an active synchronous replication cluster, in the
face of "network lag", or even network failure, the master *must* pretty
much halt all new commits in their tracks until that slave acknowledges
the commit.  Yes that's going to cause a backup.  That's the cost of a
synchronous replication.

But that means the admin has to be able to control whether a slave is
part of an active synchronous replication cluster or not.  I hope that
control eventually is a lot more than a GUC that says "when a slave is X
seconds behind, abandon him).

I'ld dream of a "replication" interface where I could add new slaves on
the fly (and a nice tool that pg_start_backup()/sync/apply WAL to sync
then subscribe), get slave status (maybe syncing/active/abandoned), and
some average latency (i.e. something like svctm of iostat on your WAL
disk) and some way to control the slave degradation from active to
abandoned (like the above GUC, or maybe a callout/hook/script that runs
when latency > X, etc, or both).

And for async replication, you just have a "proxy" slave which does
nothing but subscribe to your master, always acknowledge WAL right away
so the master doesn't wait, and keep a local backlog of WAL it's
sending out to many clients.   This proxy slave doesn't slow down the
master, but can feed clients accross slow WAN links (that may not have
the burst bandwidth to keep up with bursty master writes, but have agregate
bandwidth to keep pretty close to the master), or networks that drop out
for a period, etc.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 10:43:06

On Wed, 2008-09-10 at 09:36 -0400, Aidan Van Dyk wrote:
> * Simon Riggs <simon@2ndQuadrant.com> [080910 06:18]:
> 
> > We have a number of choices, at the point of failure:
> > * Does the whole primary server stay up (probably)?
> 
> The only sane choice is the one the admin makes.  Any "predetermined" choice
> PG makes can (and will) be wrong in some situations.

We are in agreement then. Those questions were listed as arguments in
favour of a parameter to let the sysadmin choose. More than that, I was
saying this can be selected for individual transactions, not just for
the whole server as a whole (as other vendors do).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

10 September 2008, 11:01:17

On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> ITAGAKI Takahiro wrote:
>>
>> Signals and locking, borrewed from Postgres-R, are now studied
>> for the purpose in the log shipping,
>
> Cool. Let me know if you have any questions WRT this imessages stuff.

If you're sure it's all right, I have a trivial question.

Which signal should we use for the notification to the backend from
WAL sender? The notable signals are already used.

Or, since a backend don't need to wait on select() unlike WAL sender,
ISTM that it's not so inconvenient to use a semaphore for that notification.

Your thought?

regards

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

10 September 2008, 11:13:46

Hi,

Fujii Masao wrote:
> On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> Hi,
>>
>> ITAGAKI Takahiro wrote:
>>> Signals and locking, borrewed from Postgres-R, are now studied
>>> for the purpose in the log shipping,
>> Cool. Let me know if you have any questions WRT this imessages stuff.
> 
> If you're sure it's all right, I have a trivial question.

Well, I know it works for me and I think it could work for you, too. 
That's all I'm saying.

> Which signal should we use for the notification to the backend from
> WAL sender? The notable signals are already used.

I'm using SIGUSR1, see src/backend/storage/ipc/imsg.c from Postgres-R, 
line 232. That isn't is use for backends or the postmaster, AFAIK.

> Or, since a backend don't need to wait on select() unlike WAL sender,
> ISTM that it's not so inconvenient to use a semaphore for that notification.

They probably could, but not the WAL sender.

What's the benefit of semaphores? It seems pretty ugly to set up a 
semaphore, lock that on the WAL sender, then claim it on the backend to 
wait for it, and then release it on the WAL sender to notify the backend.

If all you want to do is to signal the backend, why not use signals ;-)  But maybe I'm missing something?

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

10 September 2008, 15:17:42

On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote:

> My sequence covers several cases :
> 
> * There is no missing WAL file.
> * There is a lot of missing WAL file.

This is the likely case for any medium+ sized database.

> * There are missing history files. Failover always generates the gap
> of
>    history file because TLI is incremented when archive recovery is
> completed.

Yes, but failover doesn't happen while we are configuring replication,
it can only happen after we have configured replication. It would be
theoretically possible to take a copy from one server and then try to
synchronise with a 3rd copy of the same server, but that seems perverse
and bug prone. So I advise that we only allow replication when the
timeline of the standby matches the timeline of the master, having it as
an explicit check.

> In your design, does not initial setup block the master?
> Does your design cover above-mentioned case?

The way I described it does not block the master. It does defer the
point at which we can start using synchronous replication, so perhaps
that is your objection. I think it is acceptable: good food takes time
to cook.

I have thought about the approach you've outlined, though it seems to me
now like a performance optimisation rather than something we must have.

IMHO it will be confusing to be transferring both old and new data at
the same time from master to slave. We will have two different processes
sending and two different processes receiving. You'll need to work
through about four times as many failure modes, all of which will need
testing. Diagnosing problems in it via the log hurts my head just
thinking about it. ISTM that will severely impact the initial robustness
of the software for this feature. Perhaps in time it is the right way.

Anyway, feels like we're getting close to some good designs. There isn't
much difference between what we're discussing here.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

11 September 2008, 09:46:01

On Wed, Sep 10, 2008 at 11:13 PM, Markus Wanner <markus@bluegap.ch> wrote:
>> Which signal should we use for the notification to the backend from
>> WAL sender? The notable signals are already used.
>
> I'm using SIGUSR1, see src/backend/storage/ipc/imsg.c from Postgres-R, line
> 232. That isn't is use for backends or the postmaster, AFAIK.

Umm... backends have already used SIGUSR1. PostgresMain() sets up a signal
handler for SIGUSR1 as follows.
            pqsignal(SIGUSR1, CatchupInterruptHandler);

Which signal should WAL sender send to backends?

>> Or, since a backend don't need to wait on select() unlike WAL sender,
>> ISTM that it's not so inconvenient to use a semaphore for that
>> notification.
>
> They probably could, but not the WAL sender.

Yes, since WAL sender waits on select(), it's convenient to use signal
for the notification *from backends to WAL sender*, I think too.

Best regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Tom Lane

Date:

11 September 2008, 09:55:49

"Fujii Masao" <masao.fujii@gmail.com> writes:
> Which signal should WAL sender send to backends?

Sooner or later we shall have to bite the bullet and set up a
multiplexing system to transmit multiple event types to backends with
just one signal.  We already did it for signals to the postmaster.
        regards, tom lane

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

11 September 2008, 10:00:29

Hi,

Fujii Masao wrote:
> Umm... backends have already used SIGUSR1. PostgresMain() sets up a signal
> handler for SIGUSR1 as follows.

Uh.. right. Thanks for pointing that out. Maybe just use SIGPIPE for now?

> Yes, since WAL sender waits on select(), it's convenient to use signal
> for the notification *from backends to WAL sender*, I think too.

..and I'd say you better use the same for WAL sender to backend 
communication, just for the sake of simplicity (and thus maintainability).

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

11 September 2008, 10:08:33

Hi,

Tom Lane wrote:
> Sooner or later we shall have to bite the bullet and set up a
> multiplexing system to transmit multiple event types to backends with
> just one signal.  We already did it for signals to the postmaster.

Agreed. However, it's non-trivial if you want reliable queues (i.e. no 
message skipped, as with signals) for varying message sizes. My 
imessages stuff is certainly not perfect, yet. But it works to some 
extent and provides exactly that functionality.

However, I'd be happy to work on improving it, if other projects start 
using it as well. Anybody else interested? Use cases within Postgres 
itself as of now?

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

11 September 2008, 12:04:00

On Thu, Sep 11, 2008 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote:
>
>> My sequence covers several cases :
>>
>> * There is no missing WAL file.
>> * There is a lot of missing WAL file.
>
> This is the likely case for any medium+ sized database.

I'm sorry, but I could not understand what you mean.

>
>> * There are missing history files. Failover always generates the gap
>> of
>>    history file because TLI is incremented when archive recovery is
>> completed.
>
> Yes, but failover doesn't happen while we are configuring replication,
> it can only happen after we have configured replication. It would be
> theoretically possible to take a copy from one server and then try to
> synchronise with a 3rd copy of the same server, but that seems perverse
> and bug prone. So I advise that we only allow replication when the
> timeline of the standby matches the timeline of the master, having it as
> an explicit check.

Umm... my explanation seems to have been unclear:(
Here is the case which I assume.

1) Configuration of replication, i.e. the master and the slave work fine.
2) The master fails down, then failover happens. When the slave becomes    the master, TLI is incremented, and new
historyfile is generated.

3) In order to catch up with the new master, the server which was the master   from the first needs missing history
file.At this time, it's

because there is   the gap of TLI in between two servers.

I think that this case would often happen. So, we should establish a certain
solution or procedure to the case where TLI of the master doesn't match
TLI of the slave. If we only allow the case where TLI of both servers is the
same, the configuration after failover always needs to get the base backup
on the new master. It's unacceptable for many users. But, I think that it's
the role of admin or external tools to copy history files to the slave from
the master.

>> In your design, does not initial setup block the master?
>> Does your design cover above-mentioned case?
>
> The way I described it does not block the master. It does defer the
> point at which we can start using synchronous replication, so perhaps
> that is your objection. I think it is acceptable: good food takes time
> to cook.

Yes. I understood your design.

> IMHO it will be confusing to be transferring both old and new data at
> the same time from master to slave. We will have two different processes
> sending and two different processes receiving. You'll need to work
> through about four times as many failure modes, all of which will need
> testing. Diagnosing problems in it via the log hurts my head just
> thinking about it. ISTM that will severely impact the initial robustness
> of the software for this feature. Perhaps in time it is the right way.

In my procedure, old WAL files are copyed by admin using scp, rsync
or other external tool. So, I don't think that my procedure makes a
problem more difficult. Since there are many setup cases, we should
not leave all procedures to postgres, I think.

> Anyway, feels like we're getting close to some good designs. There isn't
> much difference between what we're discussing here.

Yes. Thank you for your great ideas.

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

11 September 2008, 12:18:07

Fujii Masao wrote:
> I think that this case would often happen. So, we should establish a certain
> solution or procedure to the case where TLI of the master doesn't match
> TLI of the slave. If we only allow the case where TLI of both servers is the
> same, the configuration after failover always needs to get the base backup
> on the new master. It's unacceptable for many users. But, I think that it's
> the role of admin or external tools to copy history files to the slave from
> the master.

Hmm. There's more problems than the TLI with that. For the original 
master to catch up by replaying WAL from the new slave, without 
restoring from a full backup, the original master must not write to disk 
*any* WAL that hasn't made it to the slave yet. That is certainly not 
true for asynchronous replication, but it also throws off the idea of 
flushing the WAL concurrently to the local disk and to the slave in 
synchronous mode.

I agree that having to get a new base backup to get the old master catch 
up with the new master sucks, so I hope someone sees a way around that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Tom Lane

Date:

11 September 2008, 12:42:28

Markus Wanner <markus@bluegap.ch> writes:
> Tom Lane wrote:
>> Sooner or later we shall have to bite the bullet and set up a
>> multiplexing system to transmit multiple event types to backends with
>> just one signal.  We already did it for signals to the postmaster.

> Agreed. However, it's non-trivial if you want reliable queues (i.e. no 
> message skipped, as with signals) for varying message sizes.

No, that's not what I had in mind at all, just the ability to deliver
one of a specified set of event notifications --- ie, get around the
fact that Unix only gives us two user-definable signal types.

For signals sent from other backends, it'd be sufficient to put a
bitmask field into PGPROC entries, which the sender could OR bits into
before sending the one "real" signal event (either SIGUSR1 or SIGUSR2).

I'm not sure what to do if we need signals sent from processes that
aren't connected to shared memory; but maybe we need not cross that
bridge here.

(Also, I gather that the Windows implementation could already support
a bunch more signal types without much trouble.)
        regards, tom lane

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

11 September 2008, 13:31:53

Hi,

Tom Lane wrote:
> No, that's not what I had in mind at all, just the ability to deliver
> one of a specified set of event notifications --- ie, get around the
> fact that Unix only gives us two user-definable signal types.

Ah, okay. And I already thought you'd like imessages :-(

> For signals sent from other backends, it'd be sufficient to put a
> bitmask field into PGPROC entries, which the sender could OR bits into
> before sending the one "real" signal event (either SIGUSR1 or SIGUSR2).

That might work for expanding the number of available signals, yes.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Gregory Stark

Date:

11 September 2008, 13:57:22

Tom Lane <tgl@sss.pgh.pa.us> writes:

> I'm not sure what to do if we need signals sent from processes that
> aren't connected to shared memory; but maybe we need not cross that
> bridge here.

Such as signals coming from the postmaster? Isn't that where most of them come
from though?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

11 September 2008, 14:07:43

Gregory Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
>> I'm not sure what to do if we need signals sent from processes that
>> aren't connected to shared memory; but maybe we need not cross that
>> bridge here.
> 
> Such as signals coming from the postmaster? Isn't that where most of them come
> from though?

Uh.. no, such as signals *going to* the postmaster. That's where we 
already have such a multiplexer in place, but not the other way around IIRC.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

12 September 2008, 07:42:32

On Thu, 2008-09-11 at 18:17 +0300, Heikki Linnakangas wrote:
> Fujii Masao wrote:
> > I think that this case would often happen. So, we should establish a certain
> > solution or procedure to the case where TLI of the master doesn't match
> > TLI of the slave. If we only allow the case where TLI of both servers is the
> > same, the configuration after failover always needs to get the base backup
> > on the new master. It's unacceptable for many users. But, I think that it's
> > the role of admin or external tools to copy history files to the slave from
> > the master.
> 
> Hmm. There's more problems than the TLI with that. For the original 
> master to catch up by replaying WAL from the new slave, without 
> restoring from a full backup, the original master must not write to disk 
> *any* WAL that hasn't made it to the slave yet. That is certainly not 
> true for asynchronous replication, but it also throws off the idea of 
> flushing the WAL concurrently to the local disk and to the slave in 
> synchronous mode.
> 
> I agree that having to get a new base backup to get the old master catch 
> up with the new master sucks, so I hope someone sees a way around that.

If we were going to recover from failed-over standby back to original
master just via WAL logs we would need all of the WAL files from the
point of failover. So you'd need to be storing all WAL file just in case
the old master recovers. I can't believe doing that would be the common
case, because its so impractical and most people would run out of disk
space and need to delete WAL files.

It should be clear that to make this work you must run with a base
backup that was derived correctly on the current master. You can do that
by re-copying everything, or you can do that by just shipping changed
blocks (rsync etc). So I don't see a problem in the first place.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

12 September 2008, 08:16:13

On Fri, 2008-09-12 at 00:03 +0900, Fujii Masao wrote:

> In my procedure, old WAL files are copyed by admin using scp, rsync
> or other external tool. So, I don't think that my procedure makes a
> problem more difficult. Since there are many setup cases, we should
> not leave all procedures to postgres, I think.

So the procedure is

1. Startup WALReceiver to begin receiving WAL
2. Do some manual stuff
3. Initiate recovery

So either

* WALReceiver is not started by postmaster. 
I don't think its acceptable that WALReceiver is not under the
postmaster. You haven't reduced the number of failure modes by doing
that, you've just swept the problem under the carpet and pretended its
not Postgres' problem.

* Postgres startup requires some form of manual process, as an
**intermediate** stage.

I don't think either of those is acceptable. It must just work.

Why not:
1. Same procedure as Warm Standby now
a) WAL archiving to standby starts
b) base backup

2. Startup standby, with additional option to stream WAL. WALReceiver
starts, connects to Primary. Primary issues log switch. Archiver turns
itself off after sending that last file. WALSender starts streaming
current WAL immediately after log switch.

3. Startup process on standby begins reading WAL from point mentioned by
backup_label. When it gets to last logfile shipped by primary's
archiver, it switches to reading WAL files written by WALReceiver.

So all automatic. Uses existing code. Synchronous replication starts
immediately. Also has the advantage that we do not get WAL bloat on
primary. Configuration is almost identical to current Warm Standby, so
little change for existing Postgres sysadmins.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

12 September 2008, 11:09:06

Simon Riggs wrote:
> If we were going to recover from failed-over standby back to original
> master just via WAL logs we would need all of the WAL files from the
> point of failover. So you'd need to be storing all WAL file just in case
> the old master recovers. I can't believe doing that would be the common
> case, because its so impractical and most people would run out of disk
> space and need to delete WAL files.

Depends on the transaction volume and database size of course. It's 
actually not any different from the scenario where the slave goes 
offline for some reason. You have the the same decision there of how 
long to keep the WAL files in the master, in case the slave wakes up.

I think we'll need an option to specify a maximum for the number of WAL 
files to keep around. The DBA should set that to the size of the WAL 
drive, minus some safety factor.

> It should be clear that to make this work you must run with a base
> backup that was derived correctly on the current master. You can do that
> by re-copying everything, or you can do that by just shipping changed
> blocks (rsync etc). So I don't see a problem in the first place.

Hmm, built-in rsync capability would be cool. Probably not in the first 
phase, though..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Hannu Krosing

Date:

12 September 2008, 11:24:53

On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> 
> > It should be clear that to make this work you must run with a base
> > backup that was derived correctly on the current master. You can do that
> > by re-copying everything, or you can do that by just shipping changed
> > blocks (rsync etc). So I don't see a problem in the first place.
> 
> Hmm, built-in rsync capability would be cool. Probably not in the first 
> phase, though..

We have it for WAL shipping, in form of GUC "archive_command"  :)

Why not add full_backup_command ?

--------------
Hannu

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

12 September 2008, 11:52:37

On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:

> I think we'll need an option to specify a maximum for the number of WAL 
> files to keep around. The DBA should set that to the size of the WAL 
> drive, minus some safety factor.
> 
> > It should be clear that to make this work you must run with a base
> > backup that was derived correctly on the current master. You can do that
> > by re-copying everything, or you can do that by just shipping changed
> > blocks (rsync etc). So I don't see a problem in the first place.
> 
> Hmm, built-in rsync capability would be cool. Probably not in the first 
> phase, though..

Built-in? Why? I mean make base backup using rsync. That way only
changed data blocks need be migrated, so much faster.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Csaba Nagy

Date:

12 September 2008, 12:11:20

On Fri, 2008-09-12 at 17:24 +0300, Hannu Krosing wrote:
> On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote:
> > Hmm, built-in rsync capability would be cool. Probably not in the first 
> > phase, though..
> 
> We have it for WAL shipping, in form of GUC "archive_command"  :)
> 
> Why not add full_backup_command ?

I see the current design is all master-push centered, i.e. the master is
in control of everything WAL related. That makes it hard to create a
slave which is simply pointed to the server and takes all it's data from
there...

Why not have a design where the slave is in control for it's own data ?
I mean the slave could ask for the base files (possibly through a
special function deployed on the master), then ask for the WAL stream
and so on. That would easily let a slave cascade too, as it could relay
the WAL stream and serve the base backup too... or have a special WAL
repository software with the same interface as a normal master, but
having a choice of base backups and WAL streams. Plus that a slave in
control approach would also allow multiple slaves at the same time for a
given master...

The way it would work would be something like:

* configure the slave with a postgres connection to the master;
* the slave will connect and set up some meta data on the master
identifying itself and telling the master to keep the WAL needed by this
slave, and also get some meta data about the master's details if needed;
* the slave will call a special function on the slave and ask for the
base backup to be streamed (potentially compressed with special
knowledge of postgres internals);
* once the base backup is streamed, or possibly in parallel,  ask for
streaming the WAL files;
* when the base backup is finished, start applying the WAL stream, which
is cached in the meantime, and it it's streaming continues;
* keep the master updated about the state of the slave, so the master
can know if it needs to keep the WAL files which were not yet streamed;
* in case of network error, the slave connects again and starts to
stream the WAL from where it was left;
* in case of extended network outage, the master could decide to
unsubscribe the slave when a certain time-out happened;
* when the slave finds itself unsubscribed after a longer disconnection,
it could ask for a new base backup based on differences only... some
kind of built in rsync thingy;

The only downside of this approach is that the slave machine needs a
full postgres super user connection to the master. That could be a
security problem in certain scenarios. The master-centric scenario needs
a connection in the other direction, which might be seen as more secure,
I don't know for sure...

Cheers,
Csaba.

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

12 September 2008, 12:40:43

Simon Riggs wrote:
> Built-in? Why? I mean make base backup using rsync. That way only
> changed data blocks need be migrated, so much faster.

Yes, what I meant is that it would be cool to have that functionality 
built-in, so that you wouldn't need to configure extra rsync scripts and 
authentication etc.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Andrew Dunstan

Date:

12 September 2008, 13:01:46

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> Built-in? Why? I mean make base backup using rsync. That way only
>> changed data blocks need be migrated, so much faster.
>
> Yes, what I meant is that it would be cool to have that functionality 
> built-in, so that you wouldn't need to configure extra rsync scripts 
> and authentication etc.
>

If this were a nice pluggable library I'd agree, but AFAIK it's not, and 
I don't see great value in reinventing the wheel.

cheers

andrew

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

12 September 2008, 13:25:23

Csaba Nagy wrote:
> Why not have a design where the slave is in control for it's own data ?
> I mean the slave could ask for the base files (possibly through a
> special function deployed on the master), then ask for the WAL stream
> and so on. That would easily let a slave cascade too, as it could relay
> the WAL stream and serve the base backup too... or have a special WAL
> repository software with the same interface as a normal master, but
> having a choice of base backups and WAL streams. Plus that a slave in
> control approach would also allow multiple slaves at the same time for a
> given master...

I totally agree with that.

> The only downside of this approach is that the slave machine needs a
> full postgres super user connection to the master. That could be a
> security problem in certain scenarios. 

I think the master-slave protocol needs to be separate from the normal 
FE/BE protocol, with commands like "send a new base backup", or 
"subscribe to new WAL that's generated". A master-slave connection isn't 
associated with any individual database, for example. We can keep the 
permissions required for establishing a master-slave connection 
different from super-userness. In particular, while the slave will be 
able to read all data from the whole cluster, by receiving it in the WAL 
and base backups, it doesn't need to be able to modify anything in the 
master.

> The master-centric scenario needs
> a connection in the other direction, which might be seen as more secure,
> I don't know for sure...

Which one initiates the connection, the master or slave, is a different 
question. I believe we've all assumed that it's the slave that connects 
to the master, and I think that makes the most sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Markus Wanner

Date:

12 September 2008, 13:26:05

Hi,

Andrew Dunstan wrote:
> If this were a nice pluggable library I'd agree, but AFAIK it's not, and 
> I don't see great value in reinventing the wheel.

I certainly agree.

However, I thought of it more like the archive_command, as proposed by 
Hannu. That way we don't need to reinvent any wheel and still the 
standby could trigger the base data synchronization itself.

Regards

Markus Wanner

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

12 September 2008, 13:45:54

On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote:

> Why not have a design where the slave is in control for it's own data ?
> I mean the slave...

The slave only exists because it is a copy of the master. If you try to
"startup" a slave without first having taken a copy, how would you
bootstrap the slave? With what? To what? It sounds cool, but its not
practical.

I posted a workable suggestion today on another subthread.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Synchronous Log Shipping Replication

From

Hannu Krosing

Date:

12 September 2008, 17:07:09

On Fri, 2008-09-12 at 17:45 +0100, Simon Riggs wrote:
> On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote:
> 
> > Why not have a design where the slave is in control for it's own data ?
> > I mean the slave...
> 
> The slave only exists because it is a copy of the master. If you try to
> "startup" a slave without first having taken a copy, how would you
> bootstrap the slave? With what? To what? 

As I understand it, Csaba meant that slave would "bootstrap itself" by
connecting to master in some early phase of startup, requesting a
physical filesystem level copy of data, then commencing the startup in
Hot Standby mode.

If done that way, all the slave needs is a superuser level connection to
master database.

Of course this can also be done using little hot standby startup script
from slave, if shell access to master is provided,.
------------------
Hannu

Re: Synchronous Log Shipping Replication

From

Alvaro Herrera

Date:

12 September 2008, 17:13:36

Hannu Krosing escribió:
> On Fri, 2008-09-12 at 17:45 +0100, Simon Riggs wrote:
> > On Fri, 2008-09-12 at 17:11 +0200, Csaba Nagy wrote:
> > 
> > > Why not have a design where the slave is in control for it's own data ?
> > > I mean the slave...
> > 
> > The slave only exists because it is a copy of the master. If you try to
> > "startup" a slave without first having taken a copy, how would you
> > bootstrap the slave? With what? To what? 
> 
> As I understand it, Csaba meant that slave would "bootstrap itself" by
> connecting to master in some early phase of startup, requesting a
> physical filesystem level copy of data, then commencing the startup in
> Hot Standby mode.

Interesting ...

This doesn't seem all that difficult -- all you need is to start one
connection to get the WAL stream and save it somewhere; meanwhile a
second connection uses a combination of pg_file_read on master +
pg_file_write on slave to copy the data files over.  When this step is
complete, recovery of the stored WAL commences.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

15 September 2008, 09:42:46

On Fri, Sep 12, 2008 at 7:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-09-11 at 18:17 +0300, Heikki Linnakangas wrote:
>> Fujii Masao wrote:
>> > I think that this case would often happen. So, we should establish a certain
>> > solution or procedure to the case where TLI of the master doesn't match
>> > TLI of the slave. If we only allow the case where TLI of both servers is the
>> > same, the configuration after failover always needs to get the base backup
>> > on the new master. It's unacceptable for many users. But, I think that it's
>> > the role of admin or external tools to copy history files to the slave from
>> > the master.
>>
>> Hmm. There's more problems than the TLI with that. For the original
>> master to catch up by replaying WAL from the new slave, without
>> restoring from a full backup, the original master must not write to disk
>> *any* WAL that hasn't made it to the slave yet. That is certainly not
>> true for asynchronous replication, but it also throws off the idea of
>> flushing the WAL concurrently to the local disk and to the slave in
>> synchronous mode.
>>
>> I agree that having to get a new base backup to get the old master catch
>> up with the new master sucks, so I hope someone sees a way around that.
>
> If we were going to recover from failed-over standby back to original
> master just via WAL logs we would need all of the WAL files from the
> point of failover. So you'd need to be storing all WAL file just in case
> the old master recovers. I can't believe doing that would be the common
> case, because its so impractical and most people would run out of disk
> space and need to delete WAL files.

No. The original master doesn't need all WAL files. It needs WAL file which
its pg_control points as latest checkpoint location and subsequent files.

> It should be clear that to make this work you must run with a base
> backup that was derived correctly on the current master. You can do that
> by re-copying everything, or you can do that by just shipping changed
> blocks (rsync etc). So I don't see a problem in the first place.

PITR doesn't always need a base backup. We can do PITR from the data
files just after crash if they aren't corrupted (i.e. not media crash).

As the situation demands, most users would like to choose the setup
procedure that bad influence on the cluster is smaller. They would choose
the procedure without a base backup if there are few WAL files to be
replayed. Meanwhile, they would use a base backup if the indispensable
WAL files have already been deleted. But, in that case, they might not take
new base backup and use old one (e.g. taken 2 days before).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

"Fujii Masao"

Date:

15 September 2008, 10:06:09

On Fri, Sep 12, 2008 at 12:17 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>>
>> I think that this case would often happen. So, we should establish a
>> certain
>> solution or procedure to the case where TLI of the master doesn't match
>> TLI of the slave. If we only allow the case where TLI of both servers is
>> the
>> same, the configuration after failover always needs to get the base backup
>> on the new master. It's unacceptable for many users. But, I think that
>> it's
>> the role of admin or external tools to copy history files to the slave
>> from
>> the master.
>
> Hmm. There's more problems than the TLI with that. For the original master
> to catch up by replaying WAL from the new slave, without restoring from a
> full backup, the original master must not write to disk *any* WAL that
> hasn't made it to the slave yet. That is certainly not true for asynchronous
> replication, but it also throws off the idea of flushing the WAL
> concurrently to the local disk and to the slave in synchronous mode.

Yes.

If the master fails after writing WAL to disk and before sending it to
the slave,
at least latest WAL file would be inconsistent between both servers. So,
regardless of using a base backup, in a setup procedure, we need to delete
those inconsistent WAL files or overwrite them.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous Log Shipping Replication

From

Alvaro Herrera

Date:

15 September 2008, 10:11:00

Simon Riggs escribió:

> On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote:

> > > It should be clear that to make this work you must run with a base
> > > backup that was derived correctly on the current master. You can do that
> > > by re-copying everything, or you can do that by just shipping changed
> > > blocks (rsync etc). So I don't see a problem in the first place.
> > 
> > Hmm, built-in rsync capability would be cool. Probably not in the first 
> > phase, though..
> 
> Built-in? Why? I mean make base backup using rsync. That way only
> changed data blocks need be migrated, so much faster.

Why rsync?  Just compare the LSNs ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

15 September 2008, 10:20:11

Alvaro Herrera wrote:
> Simon Riggs escribió:
>> On Fri, 2008-09-12 at 17:08 +0300, Heikki Linnakangas wrote:
>>>> It should be clear that to make this work you must run with a base
>>>> backup that was derived correctly on the current master. You can do that
>>>> by re-copying everything, or you can do that by just shipping changed
>>>> blocks (rsync etc). So I don't see a problem in the first place.
>>> Hmm, built-in rsync capability would be cool. Probably not in the first 
>>> phase, though..
>> Built-in? Why? I mean make base backup using rsync. That way only
>> changed data blocks need be migrated, so much faster.
> 
> Why rsync?  Just compare the LSNs ...

True, that's much better. Only works for data files, though, so we'll 
still need something else for clog etc. But the volume of the other 
stuff is much smaller, so I support we don't need to bother delta 
compressing them.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Heikki Linnakangas

Date:

15 September 2008, 11:16:56

Fujii Masao wrote:
> On Fri, Sep 12, 2008 at 12:17 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Hmm. There's more problems than the TLI with that. For the original master
>> to catch up by replaying WAL from the new slave, without restoring from a
>> full backup, the original master must not write to disk *any* WAL that
>> hasn't made it to the slave yet. That is certainly not true for asynchronous
>> replication, but it also throws off the idea of flushing the WAL
>> concurrently to the local disk and to the slave in synchronous mode.
> 
> Yes.
> 
> If the master fails after writing WAL to disk and before sending it to
> the slave,
> at least latest WAL file would be inconsistent between both servers. So,
> regardless of using a base backup, in a setup procedure, we need to delete
> those inconsistent WAL files or overwrite them.

And if you're unlucky, the changes in the latest WAL file might already 
have been flushed to data files as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous Log Shipping Replication

From

Bruce Momjian

Date:

18 September 2008, 20:36:16

Simon Riggs wrote:
> Why not:
> 1. Same procedure as Warm Standby now
> a) WAL archiving to standby starts
> b) base backup
> 
> 2. Startup standby, with additional option to stream WAL. WALReceiver
> starts, connects to Primary. Primary issues log switch. Archiver turns
> itself off after sending that last file. WALSender starts streaming
> current WAL immediately after log switch.
> 
> 3. Startup process on standby begins reading WAL from point mentioned by
> backup_label. When it gets to last logfile shipped by primary's
> archiver, it switches to reading WAL files written by WALReceiver.
> 
> So all automatic. Uses existing code. Synchronous replication starts
> immediately. Also has the advantage that we do not get WAL bloat on
> primary. Configuration is almost identical to current Warm Standby, so
> little change for existing Postgres sysadmins.

I totally agree.  Requiring the master to be down for a significant time
to add a slave isn't going to keep people happy very long.  We have the
technology now to allow warm standby slaves by using PITR, and it seems
a similar system can be used to setup slaves, and for cases when the
slave drops off and has to rejoin.  The slave can use the existing
'restore_command' command to pull all WAL files it needs, and then the
slave needs to connect to the master and say it is ready for WAL files. 
The master is going to need to send perhaps everything from the start of
the existing WAL file so the slave is sure to get all changes during the
switch from 'restore_command' to network-passed WAL info.

I can imagine the slave going in and out of network connectivity as long
as the required PITR files are still available.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Synchronous Log Shipping Replication

From

Simon Riggs

Date:

19 September 2008, 07:06:00

On Tue, 2008-09-09 at 09:11 +0100, Simon Riggs wrote:

> This gives us the Group Commit feature also, even if we are not using
> replication. So we can drop the commit_delay stuff.
> 
> XLogBackgroundFlush() processes data page at a time if it can. That may
> not be the correct batch size for XLogBackgroundSend(), so we may need a
> tunable for the MTU. Under heavy load we need the Write and Send to act
> in a way to maximise throughput rather than minimise response time, as
> we do now.
> 
> If wal_buffers overflows, we continue to hold WALInsertLock while we
> wait for WALWriter and WALSender to complete.
> 
> We should increase default wal_buffers to 64.
> 
> After (or during) XLogInsert backends will sleep in a proc queue,
> similar to LWlocks and protected by a spinlock. When preparing to
> write/send the WAL process should read the proc at the *tail* of the
> queue to see what the next LogwrtRqst should be. Then it performs its
> action and wakes procs up starting with the head of the queue. We would
> add LSN into PGPROC, so WAL processes can check whether the backend
> should be woken. The LSN field can be accessed without spinlocks since
> it is only ever set by the backend itself and only read while a backend
> is sleeping. So we access spinlock, find tail, drop spinlock then read
> LSN of the backend that (was) the tail.

I left off mentioning one other aspect of "Group Commit" behaviour that
is possible with the above design.

If we use a proc queue, then the we only wake up the *first* backend on
the queue. That lets other WAL processes continue quickly.

Reason for doing this is that the first backend can walk the commit
queue collecting xids. When we update the ProcArray we can then update
multiple backend's entries with a single request, rather than forcing
all of the backends to form a queue all queueing for exclusive lock.

When the first backend has updated procarray, then all backends updated
will be released at once.

Doing it that way will significantly reduce the number of exclusive lock
requests for commits, which is the main source of contention on the
procarray.

So that puts in batch setting behaviour for WALWriteLock and
ProcArrayLock. And I'm submitting patch for batch setting of clog
entries around ClogControlLock. So we should get a scalability boost
from all of this.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support