Thread: Synchronous replication, reading WAL for sending

Synchronous replication, reading WAL for sending

From
Heikki Linnakangas
Date:
As the patch stands, whenever XLOG segment is switched in XLogInsert, we 
wait for the segment to be sent to the standby server. That's not good. 
Particularly in asynchronous mode, you'd expect the standby to not have 
any significant ill effect on the master. But in case of a flaky network 
connection, or a busy or dead standby, it can take a long time for the 
standby to respond, or the primary to give up. During that time, all WAL 
insertions on the primary are blocked. (How long is the default TCP 
timeout again?)

Another point is that in the future, we really shouldn't require setting 
up archiving and file-based log shipping using external scripts, when 
all you want is replication. It should be enough to restore a base 
backup on the standby, and point it to the IP address of the primary, 
and have it catch up. This is very important, IMHO. It's quite a lot of 
work to set up archiving and log-file shipping, for no obvious reason. 
It's really only needed at the moment because we're building this 
feature from spare parts.

For those reasons, we need a way to send arbitrary ranges of WAL from 
primary to standby. The current method where the WAL is read from 
wal_buffers obviously only works for very recent WAL pages that are 
still in wal_buffers. The design should be changed so that instead of 
reading from wal_buffers, the WAL is read from filesystem.

Sending directly from wal_buffers can be provided as a fastpath when 
sending recent enough WAL range, but I wouldn't bother complicating the 
code for now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Synchronous replication, reading WAL for sending

From
Simon Riggs
Date:
On Tue, 2008-12-23 at 17:42 +0200, Heikki Linnakangas wrote:

> As the patch stands, whenever XLOG segment is switched in XLogInsert, we 
> wait for the segment to be sent to the standby server. That's not good. 
> Particularly in asynchronous mode, you'd expect the standby to not have 
> any significant ill effect on the master. But in case of a flaky network 
> connection, or a busy or dead standby, it can take a long time for the 
> standby to respond, or the primary to give up. During that time, all WAL 
> insertions on the primary are blocked. (How long is the default TCP 
> timeout again?)

Ugh, didn't see that. Get rid of that. We managed to get rid of the
fsync of the control file when we changed WAL file at start of 8.3. That
had a major effect on performance, via reduced response time profiles.
No need to re-introduce a delay in the same place.

> Another point is that in the future, we really shouldn't require setting 
> up archiving and file-based log shipping using external scripts, when 
> all you want is replication. It should be enough to restore a base 
> backup on the standby, and point it to the IP address of the primary, 
> and have it catch up. This is very important, IMHO. It's quite a lot of 
> work to set up archiving and log-file shipping, for no obvious reason. 
> It's really only needed at the moment because we're building this 
> feature from spare parts.

Happy for that to be hidden more from users.

> For those reasons, we need a way to send arbitrary ranges of WAL from 
> primary to standby. The current method where the WAL is read from 
> wal_buffers obviously only works for very recent WAL pages that are 
> still in wal_buffers. The design should be changed so that instead of 
> reading from wal_buffers, the WAL is read from filesystem.

There are two basic ways: from memory and from files. Sure we can hide
the two mechanisms in code better, but they will remain fairly distinct.

> Sending directly from wal_buffers can be provided as a fastpath when 
> sending recent enough WAL range, but I wouldn't bother complicating the 
> code for now.

Sounds like you are saying completely replace the write-from-buffers and
replace it with write-from-file?

Sending from wal_buffers is OK if wal_buffers is large enough. If
streaming replication falls so far behind that we have problems then
there are larger issues to worry about, like is the primary being driven
too hard for the network to cope.

Copying direct from memory means that a disk problem that occurs on the
primary will never cause corruption on the standby. Reading WAL files
can mean that corruptions get propagated.

The current design allows for file based WAL sending, if the connection
is so poor that streaming won't work.


If you are seriously suggesting these things now then I'd like to see
some diagrams, designs and descriptions so we can all understand what is
being suggested, how it will cope with all the current requirements.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Synchronous replication, reading WAL for sending

From
"Fujii Masao"
Date:
Hi,

On Wed, Dec 24, 2008 at 1:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Tue, 2008-12-23 at 17:42 +0200, Heikki Linnakangas wrote:
>
>> As the patch stands, whenever XLOG segment is switched in XLogInsert, we
>> wait for the segment to be sent to the standby server. That's not good.
>> Particularly in asynchronous mode, you'd expect the standby to not have
>> any significant ill effect on the master. But in case of a flaky network
>> connection, or a busy or dead standby, it can take a long time for the
>> standby to respond, or the primary to give up. During that time, all WAL
>> insertions on the primary are blocked. (How long is the default TCP
>> timeout again?)
>
> Ugh, didn't see that. Get rid of that. We managed to get rid of the
> fsync of the control file when we changed WAL file at start of 8.3. That
> had a major effect on performance, via reduced response time profiles.
> No need to re-introduce a delay in the same place.

Yes, I will get rid of it. It's only async case? both(async & sync)?

>> For those reasons, we need a way to send arbitrary ranges of WAL from
>> primary to standby. The current method where the WAL is read from
>> wal_buffers obviously only works for very recent WAL pages that are
>> still in wal_buffers. The design should be changed so that instead of
>> reading from wal_buffers, the WAL is read from filesystem.

Filesystem you say is only pg_xlog? If it includes archive, we might have
to execute restore command in order to read WAL from filesystem?

> If you are seriously suggesting these things now then I'd like to see
> some diagrams, designs and descriptions so we can all understand what is
> being suggested, how it will cope with all the current requirements.

I also want.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Synchronous replication, reading WAL for sending

From
"Pavan Deolasee"
Date:
On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> As the patch stands, whenever XLOG segment is switched in XLogInsert, we
> wait for the segment to be sent to the standby server. That's not good.
> Particularly in asynchronous mode, you'd expect the standby to not have any
> significant ill effect on the master. But in case of a flaky network
> connection, or a busy or dead standby, it can take a long time for the
> standby to respond, or the primary to give up. During that time, all WAL
> insertions on the primary are blocked. (How long is the default TCP timeout
> again?)
>
> Another point is that in the future, we really shouldn't require setting up
> archiving and file-based log shipping using external scripts, when all you
> want is replication. It should be enough to restore a base backup on the
> standby, and point it to the IP address of the primary, and have it catch
> up. This is very important, IMHO. It's quite a lot of work to set up
> archiving and log-file shipping, for no obvious reason. It's really only
> needed at the moment because we're building this feature from spare parts.
>

I had similar suggestions when I first wrote the high level design doc.
From the wiki page:

- WALSender reads from WAL buffers and/or WAL files and sends the
buffers to WALReceiver. In phase one, we may assume that WALSender can
only read from WAL buffers and WAL files in pg_xlog directory. Later
on, this can be improved so that WALSender can temporarily restore
archived files and read from that too.

I am not so sure about whether we must support archive files or not,
but I agree that at least supporting pg_xlog files will be necessary
if we want to support seamless catchup after restart.

> For those reasons, we need a way to send arbitrary ranges of WAL from
> primary to standby. The current method where the WAL is read from
> wal_buffers obviously only works for very recent WAL pages that are still in
> wal_buffers. The design should be changed so that instead of reading from
> wal_buffers, the WAL is read from filesystem.
>
> Sending directly from wal_buffers can be provided as a fastpath when sending
> recent enough WAL range, but I wouldn't bother complicating the code for
> now.
>

How would that work for sync replication ? Or are you suggesting that
the WAL first written to the disk and then again read back to be sent
to the standby ? I think the reading from files is addition work in
the sync path when we already have access to the WAL buffers.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Synchronous replication, reading WAL for sending

From
"Fujii Masao"
Date:
Hi,

On Wed, Dec 24, 2008 at 2:34 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> As the patch stands, whenever XLOG segment is switched in XLogInsert, we
>> wait for the segment to be sent to the standby server. That's not good.
>> Particularly in asynchronous mode, you'd expect the standby to not have any
>> significant ill effect on the master. But in case of a flaky network
>> connection, or a busy or dead standby, it can take a long time for the
>> standby to respond, or the primary to give up. During that time, all WAL
>> insertions on the primary are blocked. (How long is the default TCP timeout
>> again?)
>>
>> Another point is that in the future, we really shouldn't require setting up
>> archiving and file-based log shipping using external scripts, when all you
>> want is replication. It should be enough to restore a base backup on the
>> standby, and point it to the IP address of the primary, and have it catch
>> up. This is very important, IMHO. It's quite a lot of work to set up
>> archiving and log-file shipping, for no obvious reason. It's really only
>> needed at the moment because we're building this feature from spare parts.
>>
>
> I had similar suggestions when I first wrote the high level design doc.
> From the wiki page:
>
> - WALSender reads from WAL buffers and/or WAL files and sends the
> buffers to WALReceiver. In phase one, we may assume that WALSender can
> only read from WAL buffers and WAL files in pg_xlog directory. Later
> on, this can be improved so that WALSender can temporarily restore
> archived files and read from that too.

You mean that only walsender performs xlog streaming and copying
from pg_xlog serially? I think that this would degrade the performance.
And, I'm worried about the situation that the speed to generate xlog
on the primary is higher than that to copy them to the standby. We
might not be able to start xlog streaming forever.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Synchronous replication, reading WAL for sending

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>
> And, I'm worried about the situation that the speed to generate xlog
> on the primary is higher than that to copy them to the standby. We
> might not be able to start xlog streaming forever.
>

If that's the case, how do you expect the standby to keep pace with
the primary after initial sync up ? Frankly, I myself have every doubt
that on a relatively high load setup, the standby will not be able
keep pace with the primary for two reasons:

- Lack of read ahead of data blocks (Suzuki-san's work may help this)
- Single threaded recovery

But then these are general problems which may impact any log-based replication.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Synchronous replication, reading WAL for sending

From
"Fujii Masao"
Date:
Hi,

On Wed, Dec 24, 2008 at 5:48 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>>
>> And, I'm worried about the situation that the speed to generate xlog
>> on the primary is higher than that to copy them to the standby. We
>> might not be able to start xlog streaming forever.
>>
>
> If that's the case, how do you expect the standby to keep pace with
> the primary after initial sync up ?

Good question. If streaming and copying are performed parallelly,
such situation doesn't happen because the speed to generate xlog
also depends on streaming. This is a price to pay. I think that the
serial operations would need a "pace maker". And, I don't know
better pace maker than concurrent streaming.

> Frankly, I myself have every doubt
> that on a relatively high load setup, the standby will not be able
> keep pace with the primary for two reasons:
>
> - Lack of read ahead of data blocks (Suzuki-san's work may help this)
> - Single threaded recovery
>
> But then these are general problems which may impact any log-based replication.

Right. Completely high load setup is probably impossible. There is
certainly a price to pay. But, in order to reduce a price as much as
possible, I think that we should not focus two or more operations
on single process (walsender) just like single threaded recovery.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Synchronous replication, reading WAL for sending

From
Simon Riggs
Date:
On Wed, 2008-12-24 at 18:31 +0900, Fujii Masao wrote:

> > Frankly, I myself have every doubt
> > that on a relatively high load setup, the standby will not be able
> > keep pace with the primary for two reasons:
> >
> > - Lack of read ahead of data blocks (Suzuki-san's work may help this)
> > - Single threaded recovery
> >
> > But then these are general problems which may impact any log-based replication.
> 
> Right. Completely high load setup is probably impossible. There is
> certainly a price to pay. But, in order to reduce a price as much as
> possible, I think that we should not focus two or more operations
> on single process (walsender) just like single threaded recovery.

I think we may be pleasantly surprised. 

In 8.3 there were two main sources of wait:

* restartpoints
* waiting for archive files

Restartpoints will now be handled by bgwriter, giving probably 20% gain,
plus the WAL data is streamed directly into memory by walreceiver. So I
think the startup process may achieve a better steady state and perform
very quickly.

Suzuki-san's numbers show that full_page_writes = on does not benefit
significantly from having read ahead and we already know that is
effective in reducing the I/O bottleneck during recovery.

If we want to speed up recovery more, I think we'll see the need for an
additional process to do WAL CRC checks. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Synchronous replication, reading WAL for sending

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 3:01 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>
> Good question. If streaming and copying are performed parallelly,
> such situation doesn't happen because the speed to generate xlog
> also depends on streaming. This is a price to pay. I think that the
> serial operations would need a "pace maker". And, I don't know
> better pace maker than concurrent streaming.
>

These operations need not be even parallel. My apologies if this has
been discussed before, but what we are talking about is just a stream
of WAL starting at some LSN. The only difference is that the LSN
itself may be in buffers or in the files. So walsender would send as
much as it can from the files and then switch to read from buffers.

Also, I think you are underestimating the power of network for most
practical purposes. Networks are usually not bottlenecks unless we are
talking about slow WAN setups which I am not sure how common for PG
users.
.
>
> Right. Completely high load setup is probably impossible.

If that's the case, I don't think you need to worry too much about
network or the walsender being a bottleneck for initial sync up (and
note that we are only talking about WAL sync up and not the base
backup).

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Synchronous replication, reading WAL for sending

From
"Pavan Deolasee"
Date:
On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
> If we want to speed up recovery more, I think we'll see the need for an
> additional process to do WAL CRC checks.
>

Yeah, any such helper process along with other optimizations would
certainly help. But I can't still believe that on a high load, high
end setup, single recovery process without any read-ahead for data
blocks, can keep pace with the WAL generated by hundreds of processes
at the primary and shipped over a high speed link to standby.

BTW, on a completely different note, given that the entire recovery is
based on physical redo, are there any inherent limitations that we
can't do parallel recovery where different recovery processes apply
redo logs to completely independent set of data blocks ? I also
sometimes wonder why we don't have block level recovery when a single
block in the database is corrupted. Can't this be done by just
selectively applying WAL records to that particular block ? If it's
just because nobody had time/interest to do this, then it's OK, but I
wonder if there are any design issues.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: Synchronous replication, reading WAL for sending

From
Simon Riggs
Date:
On Wed, 2008-12-24 at 15:51 +0530, Pavan Deolasee wrote:
> On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >
> > If we want to speed up recovery more, I think we'll see the need for an
> > additional process to do WAL CRC checks.
> >
> 
> Yeah, any such helper process along with other optimizations would
> certainly help. But I can't still believe that on a high load, high
> end setup, single recovery process without any read-ahead for data
> blocks, can keep pace with the WAL generated by hundreds of processes
> at the primary and shipped over a high speed link to standby.

Suzuki-san has provided measurements. I think we need more. With
bgwriter performing restartpoints, we'll find that more RAM helps much
more than it did previously.

> BTW, on a completely different note, given that the entire recovery is
> based on physical redo, are there any inherent limitations that we
> can't do parallel recovery where different recovery processes apply
> redo logs to completely independent set of data blocks ? 

That's possible, but will significantly complicate the recovery code.
Retaining the ability to do standby queries would be almost impossible
in that case, since you would need to parallelise the WAL stream without
changing the commit order of transactions.

The main CPU bottleneck is CRC, by a long way. If we move effort away
from startup process that is the best next action, AFAICS.

> I also
> sometimes wonder why we don't have block level recovery when a single
> block in the database is corrupted. Can't this be done by just
> selectively applying WAL records to that particular block ? If it's
> just because nobody had time/interest to do this, then it's OK, but I
> wonder if there are any design issues.

You'll be able to do this with my rmgr patch. Selective application of
WAL records is one of the primary use cases, but there are many others.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Synchronous replication, reading WAL for sending

From
Mark Mielke
Date:
Fujii Masao wrote:
>> - WALSender reads from WAL buffers and/or WAL files and sends the
>> buffers to WALReceiver. In phase one, we may assume that WALSender can
>> only read from WAL buffers and WAL files in pg_xlog directory. Later
>> on, this can be improved so that WALSender can temporarily restore
>> archived files and read from that too.
>>     
> You mean that only walsender performs xlog streaming and copying
> from pg_xlog serially? I think that this would degrade the performance.
> And, I'm worried about the situation that the speed to generate xlog
> on the primary is higher than that to copy them to the standby. We
> might not be able to start xlog streaming forever.
>   

I've seen a few references to this. Somebody else mentioned how a single 
TCP/IP stream might not have the bandwidth to match changes to the database.

TCP/IP streams do have a window size that adjusts with the load, and 
unless one gets into aggressive networking such as bittorrent which 
arguably reduce performance of the entire network, why shouldn't one 
TCP/IP stream be enough? And if one TCP/IP stream isn't enough, doesn't 
this point to much larger problems, that won't be solved by streaming it 
some other way over the network? As in, it doesn't matter what you do - 
your network pipe isn't big enough?

Over the Internet from my house to a co-located box, I can reliably get 
1.1+ Mbyte/s using a single TCP/IP connection.  The network connection 
at the co-lo is 10Mbit/s and my Internet connection to my house is also 
10Mbit/s. One TCP/IP connection seems pretty capable to stream data to 
the full potential of the network...

Also, I assume that most database loads have peaks and lows. Especially 
for very larger updates, perhaps end of day processing, I see it as a 
guarantee that all of the stand bys will fall "more behind" for a period 
(a few seconds to a minute?), but they will catch up shortly after the 
peak is over.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>