Thread: Parallelize stream replication process

Parallelize stream replication process

From
Li Japin
Date:
Hi, hackers

For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
is there any point to parallelize this process? If it does, how do we start?

Any thoughts?

Best regards

Japin Li


Re: Parallelize stream replication process

From
Bharath Rupireddy
Date:
On Tue, Sep 15, 2020 at 9:27 AM Li Japin <japinli@hotmail.com> wrote:
>
> For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
> is there any point to parallelize this process? If it does, how do we start?
>
> Any thoughts?
>

I think we must ask few questions:

1. What's the major gain we get out of this? Is it that the time to
stream gets reduced or something else?
If the answer to the above point is something solid, then
2. How do we distribute the work to multiple processes?
3. Do we need all of the workers to maintain the order in which they
read WAL files(on the publisher) and apply the changes(on the
subscriber?)
3. Do we want to map the sender/publisher workers to
receiver/subscriber workers on a one-to-one basis? If not, how do we
do it?
4. How do sender and receiver workers communicate?
5. What if we have multiple subscribers/receivers?

I'm no expert in replication, I may be wrong as well. Others may have
better thoughts.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallelize stream replication process

From
Fujii Masao
Date:

On 2020/09/15 13:41, Bharath Rupireddy wrote:
> On Tue, Sep 15, 2020 at 9:27 AM Li Japin <japinli@hotmail.com> wrote:
>>
>> For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
>> is there any point to parallelize this process? If it does, how do we start?
>>
>> Any thoughts?

Probably this is another parallelism than what you're thinking,
but I was thinking to start up walwriter process in the standby server
and make it fsync the streamed WAL data. This means that we leave
a part of tasks of walreceiver process to walwriter. Walreceiver
performs WAL receive and write, and walwriter performs WAL flush,
in parallel. I'm just expecting that this change would improve
the replication performance, e.g., reduce the time to wait for
sync replication.

Without this change (i.e., originally), only walreceiver performs
WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
it cannot receive newly-arrived WAL data. If WAL flush takes a time,
which means that the time to wait for sync replication in the primary
would be enlarged.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Parallelize stream replication process

From
Li Japin
Date:
Thanks for clarifying the questions!

> On Sep 15, 2020, at 12:41 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Sep 15, 2020 at 9:27 AM Li Japin <japinli@hotmail.com> wrote:
>>
>> For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
>> is there any point to parallelize this process? If it does, how do we start?
>>
>> Any thoughts?
>>
>
> I think we must ask few questions:
>
> 1. What's the major gain we get out of this? Is it that the time to
> stream gets reduced or something else?

I think when the database failover, we might shorten the recovery time from the parallel stream replication.

> If the answer to the above point is something solid, then
> 2. How do we distribute the work to multiple processes?
> 3. Do we need all of the workers to maintain the order in which they
> read WAL files(on the publisher) and apply the changes(on the
> subscriber?)
> 3. Do we want to map the sender/publisher workers to
> receiver/subscriber workers on a one-to-one basis? If not, how do we
> do it?
> 4. How do sender and receiver workers communicate?
> 5. What if we have multiple subscribers/receivers?
>
> I'm no expert in replication, I may be wrong as well. Others may have
> better thoughts.
>

Maybe we can distribute the work to multiple processes according by the WAL record type.

In the first step, I think we can parallel the replay process. We can classify the WAL by WAL type or RmgrId,
and then parallel those WAL replay if possible.

Then, we can think how to parallel WalReceiver and WalSender.

Best regards
Japin Li.





Re: Parallelize stream replication process

From
Li Japin
Date:

> On Sep 15, 2020, at 3:41 PM, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 
> 
> 
> On 2020/09/15 13:41, Bharath Rupireddy wrote:
>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin <japinli@hotmail.com> wrote:
>>> 
>>> For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
>>> is there any point to parallelize this process? If it does, how do we start?
>>> 
>>> Any thoughts?
> 
> Probably this is another parallelism than what you're thinking,
> but I was thinking to start up walwriter process in the standby server
> and make it fsync the streamed WAL data. This means that we leave
> a part of tasks of walreceiver process to walwriter. Walreceiver
> performs WAL receive and write, and walwriter performs WAL flush,
> in parallel. I'm just expecting that this change would improve
> the replication performance, e.g., reduce the time to wait for
> sync replication.
> 
> Without this change (i.e., originally), only walreceiver performs
> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
> it cannot receive newly-arrived WAL data. If WAL flush takes a time,
> which means that the time to wait for sync replication in the primary
> would be enlarged.
> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Yeah, this might be a direction. 

Now I am thinking about how to parallel WAL log replay. If we can improve the efficiency
of replay, then we can shorten the database recovery time (streaming replication or database
crash recovery). 

For streaming replication, we may need to improve the transmission of WAL logs to improve
the entire recovery process.

I’m not sure if this is correct.

Regards,
Japin Li.


Re: Parallelize stream replication process

From
Paul Guo
Date:

> On Sep 16, 2020, at 11:15 AM, Li Japin <japinli@hotmail.com> wrote:
> 
> 
> 
>> On Sep 15, 2020, at 3:41 PM, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> 
>> 
>> 
>> On 2020/09/15 13:41, Bharath Rupireddy wrote:
>>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin <japinli@hotmail.com> wrote:
>>>> 
>>>> For now, postgres use single process to send, receive and replay the WAL when we use stream replication,
>>>> is there any point to parallelize this process? If it does, how do we start?
>>>> 
>>>> Any thoughts?
>> 
>> Probably this is another parallelism than what you're thinking,
>> but I was thinking to start up walwriter process in the standby server
>> and make it fsync the streamed WAL data. This means that we leave
>> a part of tasks of walreceiver process to walwriter. Walreceiver
>> performs WAL receive and write, and walwriter performs WAL flush,
>> in parallel. I'm just expecting that this change would improve
>> the replication performance, e.g., reduce the time to wait for
>> sync replication.

Yes this should be able to improve that in theory. 

>> 
>> Without this change (i.e., originally), only walreceiver performs
>> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
>> it cannot receive newly-arrived WAL data. If WAL flush takes a time,
>> which means that the time to wait for sync replication in the primary
>> would be enlarged.
>> 
>> Regards,
>> 
>> -- 
>> Fujii Masao
>> Advanced Computing Technology Center
>> Research and Development Headquarters
>> NTT DATA CORPORATION
> 
> Yeah, this might be a direction. 
> 
> Now I am thinking about how to parallel WAL log replay. If we can improve the efficiency
> of replay, then we can shorten the database recovery time (streaming replication or database
> crash recovery). 

Yes, parallelization should be able to help the scenario that cpu util rate is 100% or if it is not
100% but continues blocking in some operations which could be improved by running in
parallel. There are a lot of scenarios of WAL replay (memory operation, system calls, locking, etc).
I do not have the experience of real production environment, so not sure whether or how
the recovery suffer, but I believe parallel recovery should help to accelerate and help failover
which is quite important especially in cloud environment. To do this may need to carefully
analyze the dependency of various WALL at first. It won’t be a small effort. I’ve heard some
databases have implemented this though not sure how much that helps.

For parallel wal receiver/sender, its functionality is simple so after decoupling the fsync operation to
a separate process not sure how beneficial parallelization would be (surely if wal receiver/sender
Is 100% cpu utilized that is needed).


> 
> For streaming replication, we may need to improve the transmission of WAL logs to improve
> the entire recovery process.
> 
> I’m not sure if this is correct.
> 
> Regards,
> Japin Li.
> 


Re: Parallelize stream replication process

From
Asim Praveen
Date:

> On 16-Sep-2020, at 8:32 AM, Li Japin <japinli@hotmail.com> wrote:
>
> Thanks for clarifying the questions!
>
>> On Sep 15, 2020, at 12:41 PM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>>
>> I think we must ask few questions:
>>
>> 1. What's the major gain we get out of this? Is it that the time to
>> stream gets reduced or something else?
>
> I think when the database failover, we might shorten the recovery time from the parallel stream replication.
>
>> If the answer to the above point is something solid, then
>> 2. How do we distribute the work to multiple processes?
>> 3. Do we need all of the workers to maintain the order in which they
>> read WAL files(on the publisher) and apply the changes(on the
>> subscriber?)
>> 3. Do we want to map the sender/publisher workers to
>> receiver/subscriber workers on a one-to-one basis? If not, how do we
>> do it?
>> 4. How do sender and receiver workers communicate?
>> 5. What if we have multiple subscribers/receivers?
>>
>> I'm no expert in replication, I may be wrong as well. Others may have
>> better thoughts.
>>
>
> Maybe we can distribute the work to multiple processes according by the WAL record type.
>
> In the first step, I think we can parallel the replay process. We can classify the WAL by WAL type or RmgrId,
> and then parallel those WAL replay if possible.
>

This is a rather hard problem to solve, mainly because the (partial)
order inherent in the WAL stream must be preserved when distributing
subsets of WAL records for parallel replay.  The order can be
characterised as follows:

(1) All records emitted by a transaction must be replayed before
replaying the commit record emitted by that transaction.

(2) Commit records emitted by different transactions must be replayed
in the order in which they appear in the WAL stream.

Asim


Re: Parallelize stream replication process

From
Jakub Wartak
Date:
Li Japin wrote:

> If we can improve the efficiency of replay, then we can shorten the database recovery time (streaming replication or
databasecrash recovery).  
(..)
> For streaming replication, we may need to improve the transmission of WAL logs to improve the entire recovery
process.
> I’m not sure if this is correct.

Hi,

If you are interested in increased efficiency of WAL replay internals/startup performance then you might be interested
infollowing threads: 

Cache relation sizes in recovery -
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com#feace7ccbb8e3df8b086d0a2217df91f
Faster compactify_tuples() -
https://www.postgresql.org/message-id/flat/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com
Handing off SLRU fsyncs to the checkpointer -
https://www.postgresql.org/message-id/flat/CA%2BhUKGLJ%3D84YT%2BNvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ%40mail.gmail.com
Optimizing compactify_tuples() -
https://www.postgresql.org/message-id/flat/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
Background bgwriter during crash recovery -
https://www.postgresql.org/message-id/flat/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@mail.gmail.com
WIP: WAL prefetch (another approach) -
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Division in dynahash.c due to HASH_FFACTOR -
https://www.postgresql.org/message-id/flat/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
[PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send -
https://www.postgresql.org/message-id/flat/CACJqAM2uAUnEAy0j2RRJOSM1UHPdGxCr%3DU-HbqEf0aAcdhUoEQ%40mail.gmail.com
Unnecessary delay in streaming replication due to replay lag -
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
WAL prefetching in future combined with AIO (IO_URING) - longer term future,
https://anarazel.de/talks/2020-05-28-pgcon-aio/2020-05-28-pgcon-aio.pdf

Good way to start is to profile the system what is taking time during Your failover situation OR Your normal
hot-standbybehavior  
and then proceed to identifying and characterizing the main bottleneck - there can be many depending on the situation
(inefficientsingle processes PostgreSQL code,  
CPU-bound startup/recovering, IOPS/VFS/ syscall/s / API limitations, single TCP stream limitations  single TCP stream
latencyimpact in WAN, contention on locks in hot-standby case...) . 

Some of the above are already commited in for 14/master, some are not and require further discussions and testing.
Without real identification of the bottleneck and WAL stream statistics you are facing , it's hard to say how would
parallelWAL recovery improve the situation. 

-J.



Re: Parallelize stream replication process

From
Li Japin
Date:
Thanks for your advice!  This help me a lot.

> On Sep 17, 2020, at 9:18 PM, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> 
> Li Japin wrote:
> 
>> If we can improve the efficiency of replay, then we can shorten the database recovery time (streaming replication or
databasecrash recovery). 
 
> (..)
>> For streaming replication, we may need to improve the transmission of WAL logs to improve the entire recovery
process.
>> I’m not sure if this is correct.
> 
> Hi, 
> 
> If you are interested in increased efficiency of WAL replay internals/startup performance then you might be
interestedin following threads:
 
> 
> Cache relation sizes in recovery -
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com#feace7ccbb8e3df8b086d0a2217df91f
> Faster compactify_tuples() -
https://www.postgresql.org/message-id/flat/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com
> Handing off SLRU fsyncs to the checkpointer -
https://www.postgresql.org/message-id/flat/CA%2BhUKGLJ%3D84YT%2BNvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ%40mail.gmail.com
> Optimizing compactify_tuples() -
https://www.postgresql.org/message-id/flat/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
> Background bgwriter during crash recovery -
https://www.postgresql.org/message-id/flat/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@mail.gmail.com
> WIP: WAL prefetch (another approach) -
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
> Division in dynahash.c due to HASH_FFACTOR -
https://www.postgresql.org/message-id/flat/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
> [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send -
https://www.postgresql.org/message-id/flat/CACJqAM2uAUnEAy0j2RRJOSM1UHPdGxCr%3DU-HbqEf0aAcdhUoEQ%40mail.gmail.com
> Unnecessary delay in streaming replication due to replay lag -
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
> WAL prefetching in future combined with AIO (IO_URING) - longer term future,
https://anarazel.de/talks/2020-05-28-pgcon-aio/2020-05-28-pgcon-aio.pdf
> 
> Good way to start is to profile the system what is taking time during Your failover situation OR Your normal
hot-standbybehavior 
 
> and then proceed to identifying and characterizing the main bottleneck - there can be many depending on the situation
(inefficientsingle processes PostgreSQL code, 
 
> CPU-bound startup/recovering, IOPS/VFS/ syscall/s / API limitations, single TCP stream limitations  single TCP stream
latencyimpact in WAN, contention on locks in hot-standby case...) .
 
> 
> Some of the above are already commited in for 14/master, some are not and require further discussions and testing. 
> Without real identification of the bottleneck and WAL stream statistics you are facing , it's hard to say how would
parallelWAL recovery improve the situation.
 
> 
> -J.