Thread: Reduce the time required for a database recovery from archive.

Reduce the time required for a database recovery from archive.

From
Dmitry Shulga
Date:
Hello hackers,

Currently, database recovery from archive is performed sequentially,
by reading archived WAL files and applying their records to the database.

Overall archive file processing is done one by one, and this might
create a performance bottleneck if archived WAL files are delivered slowly,
because the database server has to wait for arrival of the next
WAL segment before applying its records.

To address this issue it is proposed to receive archived WAL files in parallel
so that when the next WAL segment file is required for processing of redo log
records it would be already available.

Implementation of this approach assumes running several background processes (bgworkers)
each of which runs a shell command specified by the parameter restore_command
to deliver an archived WAL file. Number of running parallel processes is limited
by the new parameter max_restore_command_workers. If this parameter has value 0
then WAL files delivery is performed using the original algorithm, that is in
one-by-one manner. If this parameter has value greater than 0 then the database
server starts several bgworker processes up to the limit specified by
the parameter max_restore_command_workers and passes to every process
WAL file name to deliver. Active processes start prefetching of specified
WAL files and store received files in the directory pg_wal/pgsql_tmp. After
bgworker process finishes receiving a file it marks itself as a free process
and waits for a new request to receive a next WAL file. The main process
performing database recovery still handles WAL files in one-by-one manner,
but instead of waiting for a next required WAL file's availability it checks for
that file in the prefetched directory. If a new file is present there,
the main process starts its processing.

The patch implemeting the described approach is attached to this email.
The patch contains a test in the file src/test/recovery/t/021_xlogrestore.pl
Although the test result depends on real execution time and hardly could be
approved for including to the repository it was added in order to show
a positive effect from applying the new algorithm. In my environment restoring
from archive with parallel prefetching is twice as faster than in original
mode.

Regards,
Dmitry.


Attachment

Re: Reduce the time required for a database recovery from archive.

From
Pavel Stehule
Date:


út 8. 9. 2020 v 6:51 odesílatel Dmitry Shulga <d.shulga@postgrespro.ru> napsal:
Hello hackers,

Currently, database recovery from archive is performed sequentially,
by reading archived WAL files and applying their records to the database.

Overall archive file processing is done one by one, and this might
create a performance bottleneck if archived WAL files are delivered slowly,
because the database server has to wait for arrival of the next
WAL segment before applying its records.

To address this issue it is proposed to receive archived WAL files in parallel
so that when the next WAL segment file is required for processing of redo log
records it would be already available.

Implementation of this approach assumes running several background processes (bgworkers)
each of which runs a shell command specified by the parameter restore_command
to deliver an archived WAL file. Number of running parallel processes is limited
by the new parameter max_restore_command_workers. If this parameter has value 0
then WAL files delivery is performed using the original algorithm, that is in
one-by-one manner. If this parameter has value greater than 0 then the database
server starts several bgworker processes up to the limit specified by
the parameter max_restore_command_workers and passes to every process
WAL file name to deliver. Active processes start prefetching of specified
WAL files and store received files in the directory pg_wal/pgsql_tmp. After
bgworker process finishes receiving a file it marks itself as a free process
and waits for a new request to receive a next WAL file. The main process
performing database recovery still handles WAL files in one-by-one manner,
but instead of waiting for a next required WAL file's availability it checks for
that file in the prefetched directory. If a new file is present there,
the main process starts its processing.

The patch implemeting the described approach is attached to this email.
The patch contains a test in the file src/test/recovery/t/021_xlogrestore.pl
Although the test result depends on real execution time and hardly could be
approved for including to the repository it was added in order to show
a positive effect from applying the new algorithm. In my environment restoring
from archive with parallel prefetching is twice as faster than in original
mode.

+1

it is interesting feature

Regards

Pavel


Regards,
Dmitry.

Re: Reduce the time required for a database recovery from archive.

From
Stephen Frost
Date:
Greetings,

* Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
> Overall archive file processing is done one by one, and this might
> create a performance bottleneck if archived WAL files are delivered slowly,
> because the database server has to wait for arrival of the next
> WAL segment before applying its records.
>
> To address this issue it is proposed to receive archived WAL files in parallel
> so that when the next WAL segment file is required for processing of redo log
> records it would be already available.

Yes, pgbackrest already does exactly this (if configured)- uses parallel
processes to fetch the WAL and have it be available ahead of time.

> Implementation of this approach assumes running several background processes (bgworkers)
> each of which runs a shell command specified by the parameter restore_command
> to deliver an archived WAL file. Number of running parallel processes is limited
> by the new parameter max_restore_command_workers. If this parameter has value 0
> then WAL files delivery is performed using the original algorithm, that is in
> one-by-one manner. If this parameter has value greater than 0 then the database
> server starts several bgworker processes up to the limit specified by
> the parameter max_restore_command_workers and passes to every process
> WAL file name to deliver. Active processes start prefetching of specified
> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> bgworker process finishes receiving a file it marks itself as a free process
> and waits for a new request to receive a next WAL file. The main process
> performing database recovery still handles WAL files in one-by-one manner,
> but instead of waiting for a next required WAL file's availability it checks for
> that file in the prefetched directory. If a new file is present there,
> the main process starts its processing.

I'm a bit confused about this description- surely it makes sense for the
parallel workers to continue to loop and fetch up to some specified
max..?  Then to monitor and to fetch more when the amount pre-fetched so
far drops before that level?  The description above makes it sound like
X WAL will be fetched ahead of time, and then the recovery process will
go through those until it runs out and then it'll have to wait for the
next X WAL to be fetched, which means it's still going to end up being
delayed even with these parallel processes, which isn't good.

Does this also properly handle timeline switches..?

Thanks,

Stephen

Attachment

Re: Reduce the time required for a database recovery from archive.

From
Dmitry Shulga
Date:
Hello Stephen

> On 9 Sep 2020, at 21:26, Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
>> Overall archive file processing is done one by one, and this might
>> create a performance bottleneck if archived WAL files are delivered slowly,
>> because the database server has to wait for arrival of the next
>> WAL segment before applying its records.
>>
>> To address this issue it is proposed to receive archived WAL files in parallel
>> so that when the next WAL segment file is required for processing of redo log
>> records it would be already available.
>
> Yes, pgbackrest already does exactly this (if configured)- uses parallel
> processes to fetch the WAL and have it be available ahead of time.

pgbackrest is a third-party software that should be additionally installed on customer's premises.

On the other hand, built-in support of this optimization in PostgresSQL is a good argument to add
this feature  and provide it to customers just out of the box.

>
>> Implementation of this approach assumes running several background processes (bgworkers)
>> each of which runs a shell command specified by the parameter restore_command
>> to deliver an archived WAL file. Number of running parallel processes is limited
>> by the new parameter max_restore_command_workers. If this parameter has value 0
>> then WAL files delivery is performed using the original algorithm, that is in
>> one-by-one manner. If this parameter has value greater than 0 then the database
>> server starts several bgworker processes up to the limit specified by
>> the parameter max_restore_command_workers and passes to every process
>> WAL file name to deliver. Active processes start prefetching of specified
>> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
>> bgworker process finishes receiving a file it marks itself as a free process
>> and waits for a new request to receive a next WAL file. The main process
>> performing database recovery still handles WAL files in one-by-one manner,
>> but instead of waiting for a next required WAL file's availability it checks for
>> that file in the prefetched directory. If a new file is present there,
>> the main process starts its processing.
>
> I'm a bit confused about this description- surely it makes sense for the
OK. The description I originally provided was probably pretty misleading so I will try to clarify it a bit.

So, as soon as a bgworker process finishes delivering a WAL file it marks itself as a free.

WAL records applier working in parallel and processing the WAL files in sequential manner.
Once it finishes handling of the current WAL file, it checks whether it is possible to run extra bgworker processes
to deliver WAL files which will be required a bit later. If there are free bgworker processes then applier requests
to start downloading of one or more extra WAL files. After that applier determines a name of next WAL file to handle
and checks whether it exist in the prefetching directory. If it does exist then applier starts handling it and
processing loop is repeated.

> parallel workers to continue to loop and fetch up to some specified
> max..?  Then to monitor and to fetch more when the amount pre-fetched so
> far drops before that level?  The description above makes it sound like
> X WAL will be fetched ahead of time, and then the recovery process will
> go through those until it runs out and then it'll have to wait for the
> next X WAL to be fetched, which means it's still going to end up being
> delayed even with these parallel processes, which isn't good.
>

> Does this also properly handle timeline switches..?
>
> Thanks,
>
> Stephen

Regards,
Dmitry





Re: Reduce the time required for a database recovery from archive.

From
Stephen Frost
Date:
Greetings,

* Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
> > On 9 Sep 2020, at 21:26, Stephen Frost <sfrost@snowman.net> wrote:
> > * Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
> >> Overall archive file processing is done one by one, and this might
> >> create a performance bottleneck if archived WAL files are delivered slowly,
> >> because the database server has to wait for arrival of the next
> >> WAL segment before applying its records.
> >>
> >> To address this issue it is proposed to receive archived WAL files in parallel
> >> so that when the next WAL segment file is required for processing of redo log
> >> records it would be already available.
> >
> > Yes, pgbackrest already does exactly this (if configured)- uses parallel
> > processes to fetch the WAL and have it be available ahead of time.
>
> pgbackrest is a third-party software that should be additionally installed on customer's premises.
>
> On the other hand, built-in support of this optimization in PostgresSQL is a good argument to add
> this feature  and provide it to customers just out of the box.

Sure, having core do pre-fetching could be useful, though there's the
downside that it, unfortunately, can't know how much WAL is actually
going to be needed as we play forward since we don't know where we will
end up finding the target we've been asked for.  Unlikely that'll be too
much of an issue with the traditional 16 MB WAL files, but having a more
integrated backup/restore solution would be able to address that by
tracking the restore targets that are in each WAL (which is something
we've had on our todo list for pgbackrest for a while, and that would
also let someone ask "am I able to reach this restore target?").

> >> Implementation of this approach assumes running several background processes (bgworkers)
> >> each of which runs a shell command specified by the parameter restore_command
> >> to deliver an archived WAL file. Number of running parallel processes is limited
> >> by the new parameter max_restore_command_workers. If this parameter has value 0
> >> then WAL files delivery is performed using the original algorithm, that is in
> >> one-by-one manner. If this parameter has value greater than 0 then the database
> >> server starts several bgworker processes up to the limit specified by
> >> the parameter max_restore_command_workers and passes to every process
> >> WAL file name to deliver. Active processes start prefetching of specified
> >> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> >> bgworker process finishes receiving a file it marks itself as a free process
> >> and waits for a new request to receive a next WAL file. The main process
> >> performing database recovery still handles WAL files in one-by-one manner,
> >> but instead of waiting for a next required WAL file's availability it checks for
> >> that file in the prefetched directory. If a new file is present there,
> >> the main process starts its processing.
> >
> > I'm a bit confused about this description- surely it makes sense for the
> OK. The description I originally provided was probably pretty misleading so I will try to clarify it a bit.
>
> So, as soon as a bgworker process finishes delivering a WAL file it marks itself as a free.
>
> WAL records applier working in parallel and processing the WAL files in sequential manner.
> Once it finishes handling of the current WAL file, it checks whether it is possible to run extra bgworker processes
> to deliver WAL files which will be required a bit later. If there are free bgworker processes then applier requests
> to start downloading of one or more extra WAL files. After that applier determines a name of next WAL file to handle
> and checks whether it exist in the prefetching directory. If it does exist then applier starts handling it and
> processing loop is repeated.

Ok- so the idea is that each time the applying process finishes with a
WAL file then it'll see if there's an available worker and, if so, will
give it the next file to go get (which would presumably be some number
in the future and the actual next file the applying process needs is
already available).  That sounds better, at least, though I'm not sure
why we're making it the job of the applying process to push the workers
each time..?  Also, I'm not sure about the interface- wouldn't it make
more sense to have a "pre-fetch this amount of WAL" kind of parameter
directly instead of tying that to the number of background workers?  You
might only need one or two processes doing WAL fetching to be able to
fetch faster than the applying process is able to apply it, but you
probably want to pre-fetch more than just one or two 16 MB WAL files.

In other words, I would have thought we'd have:

wal_prefetch_amount = 1GB
max_restore_command_workers = 2

and then you'd have up to 2 worker processes running and they'd be
keeping 1GB of WAL pre-fetched at all times.  If we have just
'max_restore_command_workers' and you want to pre-fetch 1GB of WAL then
you'd have to have a pretty high value there and you'd end up with
a bunch of threads that all spike to go do work each time the applying
process finishes a WAL file but then just sit around doing nothing while
waiting for the applying process to finish another segment.

Thanks,

Stephen

Attachment

Re: Reduce the time required for a database recovery from archive.

From
Dmitry Shulga
Date:
Hello Stephen,

> On 19 Oct 2020, at 23:25, Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
>>>> Implementation of this approach assumes running several background processes (bgworkers)
>>>> each of which runs a shell command specified by the parameter restore_command
>>>> to deliver an archived WAL file. Number of running parallel processes is limited
>>>> by the new parameter max_restore_command_workers. If this parameter has value 0
>>>> then WAL files delivery is performed using the original algorithm, that is in
>>>> one-by-one manner. If this parameter has value greater than 0 then the database
>>>> server starts several bgworker processes up to the limit specified by
>>>> the parameter max_restore_command_workers and passes to every process
>>>> WAL file name to deliver. Active processes start prefetching of specified
>>>> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
>>>> bgworker process finishes receiving a file it marks itself as a free process
>>>> and waits for a new request to receive a next WAL file. The main process
>>>> performing database recovery still handles WAL files in one-by-one manner,
>>>> but instead of waiting for a next required WAL file's availability it checks for
>>>> that file in the prefetched directory. If a new file is present there,
>>>> the main process starts its processing.
>>>
>>> I'm a bit confused about this description- surely it makes sense for the
>> OK. The description I originally provided was probably pretty misleading so I will try to clarify it a bit.
>>
>> So, as soon as a bgworker process finishes delivering a WAL file it marks itself as a free.
>>
>> WAL records applier working in parallel and processing the WAL files in sequential manner.
>> Once it finishes handling of the current WAL file, it checks whether it is possible to run extra bgworker processes
>> to deliver WAL files which will be required a bit later. If there are free bgworker processes then applier requests
>> to start downloading of one or more extra WAL files. After that applier determines a name of next WAL file to handle
>> and checks whether it exist in the prefetching directory. If it does exist then applier starts handling it and
>> processing loop is repeated.
>
> Ok- so the idea is that each time the applying process finishes with a
> WAL file then it'll see if there's an available worker and, if so, will
> give it the next file to go get (which would presumably be some number
> in the future and the actual next file the applying process needs is
> already available).  That sounds better, at least, though I'm not sure
> why we're making it the job of the applying process to push the workers
> each time..?
Every bgwork serves as a task to deliver a WAL file. Considering a task as an active entity is well-known approach in
softwaredesign. 
So I don't see any issues with such implementation. Moreover, implementation of this approach is probably simpler than
anyother alternatives 
and still providing positive performance impact in comparing with current (non optimized) implementation.

>  Also, I'm not sure about the interface- wouldn't it make
> more sense to have a "pre-fetch this amount of WAL" kind of parameter
> directly instead of tying that to the number of background workers?
This approach was originally considered and closely discussed.
Finally, it was decided that introducing an extra GUC parameter to control pre-fetch limit is not practical since it
shiftsresponsibility for tuning prefetching 
mechanism from postgres server to a user.
From my point of view the fewer parameters exist to set up some feature the better.

>  You
> might only need one or two processes doing WAL fetching to be able to
> fetch faster than the applying process is able to apply it, but you
> probably want to pre-fetch more than just one or two 16 MB WAL files.


Every time when prefetching is started a number of potentially prefetched files is calculated by expression
     PREFETCH_RATION * max_restore_command_workers - 'number of already prefetched files'
where PREFETCH_RATION is compiled-in constant and has value 16.

After that a task for delivering a next WAL file is placed to a current free bgworker process up until no more free
bgworkerprocesses. 


> In other words, I would have thought we'd have:
>
> wal_prefetch_amount = 1GB
> max_restore_command_workers = 2
>
> and then you'd have up to 2 worker processes running and they'd be
> keeping 1GB of WAL pre-fetched at all times.  If we have just
> 'max_restore_command_workers' and you want to pre-fetch 1GB of WAL then
> you'd have to have a pretty high value there and you'd end up with
> a bunch of threads that all spike to go do work each time the applying
Sorry, I don't see how we can end up with a bunch of threads?
max_restore_command_workers has value 2 in your example meaning that no more than 2 bgworkers could be run concurrently
forthe sake of WAL files prefetching  

> process finishes a WAL file but then just sit around doing nothing while
> waiting for the applying process to finish another segment.

I believe that for typical set-up the parameter max_restore_command_workers would have value 2 or 3 in order to supply
a delivered WAL file on time just before it be started processing.

This use case is for environment where time required for delivering WAL file from archive is greater  than time
requiredfor applying records contained in the WAL file. 
If time required for WAL file delivering lesser than than time required for handling records contained in it then
max_restore_command_workersshouldn't be specified at all 

>
> Thanks,
>
> Stephen




Re: Reduce the time required for a database recovery from archive.

From
Stephen Frost
Date:
Greetings,

* Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
> > On 19 Oct 2020, at 23:25, Stephen Frost <sfrost@snowman.net> wrote:
> >>>> Implementation of this approach assumes running several background processes (bgworkers)
> >>>> each of which runs a shell command specified by the parameter restore_command
> >>>> to deliver an archived WAL file. Number of running parallel processes is limited
> >>>> by the new parameter max_restore_command_workers. If this parameter has value 0
> >>>> then WAL files delivery is performed using the original algorithm, that is in
> >>>> one-by-one manner. If this parameter has value greater than 0 then the database
> >>>> server starts several bgworker processes up to the limit specified by
> >>>> the parameter max_restore_command_workers and passes to every process
> >>>> WAL file name to deliver. Active processes start prefetching of specified
> >>>> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> >>>> bgworker process finishes receiving a file it marks itself as a free process
> >>>> and waits for a new request to receive a next WAL file. The main process
> >>>> performing database recovery still handles WAL files in one-by-one manner,
> >>>> but instead of waiting for a next required WAL file's availability it checks for
> >>>> that file in the prefetched directory. If a new file is present there,
> >>>> the main process starts its processing.
> >>>
> >>> I'm a bit confused about this description- surely it makes sense for the
> >> OK. The description I originally provided was probably pretty misleading so I will try to clarify it a bit.
> >>
> >> So, as soon as a bgworker process finishes delivering a WAL file it marks itself as a free.
> >>
> >> WAL records applier working in parallel and processing the WAL files in sequential manner.
> >> Once it finishes handling of the current WAL file, it checks whether it is possible to run extra bgworker
processes
> >> to deliver WAL files which will be required a bit later. If there are free bgworker processes then applier
requests 
> >> to start downloading of one or more extra WAL files. After that applier determines a name of next WAL file to
handle
> >> and checks whether it exist in the prefetching directory. If it does exist then applier starts handling it and
> >> processing loop is repeated.
> >
> > Ok- so the idea is that each time the applying process finishes with a
> > WAL file then it'll see if there's an available worker and, if so, will
> > give it the next file to go get (which would presumably be some number
> > in the future and the actual next file the applying process needs is
> > already available).  That sounds better, at least, though I'm not sure
> > why we're making it the job of the applying process to push the workers
> > each time..?
> Every bgwork serves as a task to deliver a WAL file. Considering a task as an active entity is well-known approach in
softwaredesign. 
> So I don't see any issues with such implementation. Moreover, implementation of this approach is probably simpler
thanany other alternatives 
> and still providing positive performance impact in comparing with current (non optimized) implementation.

I don't think we look only at if something is an improvement or not over
the current situation when we consider changes.

The relatively simple approach I was thinking was that a couple of
workers would be started and they'd have some prefetch amount that needs
to be kept out ahead of the applying process, which they could
potentially calculate themselves without needing to be pushed forward by
the applying process.

> >  Also, I'm not sure about the interface- wouldn't it make
> > more sense to have a "pre-fetch this amount of WAL" kind of parameter
> > directly instead of tying that to the number of background workers?
> This approach was originally considered and closely discussed.
> Finally, it was decided that introducing an extra GUC parameter to control pre-fetch limit is not practical since it
shiftsresponsibility for tuning prefetching 
> mechanism from postgres server to a user.
> From my point of view the fewer parameters exist to set up some feature the better.

I agree in general that it's better to have fewer parameters, but I
disagree that this isn't an important option for users to be able to
tune- the rate of fetching WAL and of applying WAL varies quite a bit
from system to system.  Being able to tune the pre-fetch seems like it'd
actually be more important to a user than the number of processes
required to keep up with that amount of pre-fetching, which is something
we could actually figure out on our own...

> >  You
> > might only need one or two processes doing WAL fetching to be able to
> > fetch faster than the applying process is able to apply it, but you
> > probably want to pre-fetch more than just one or two 16 MB WAL files.
>
> Every time when prefetching is started a number of potentially prefetched files is calculated by expression
>      PREFETCH_RATION * max_restore_command_workers - 'number of already prefetched files'
> where PREFETCH_RATION is compiled-in constant and has value 16.
>
> After that a task for delivering a next WAL file is placed to a current free bgworker process up until no more free
bgworkerprocesses. 

Ah, it wasn't mentioned that we've got a multiplier in here, but it
still ends up meaning that if a user actually wants to tune the amount
of pre-fetching being done, they're going to end up having to tune the,
pretty much entirely unrelated, value of max_restore_command_workers.
That really seems entirely backwards to me from what I would think the
user would actually want to tune.

> > In other words, I would have thought we'd have:
> >
> > wal_prefetch_amount = 1GB
> > max_restore_command_workers = 2
> >
> > and then you'd have up to 2 worker processes running and they'd be
> > keeping 1GB of WAL pre-fetched at all times.  If we have just
> > 'max_restore_command_workers' and you want to pre-fetch 1GB of WAL then
> > you'd have to have a pretty high value there and you'd end up with
> > a bunch of threads that all spike to go do work each time the applying
> Sorry, I don't see how we can end up with a bunch of threads?
> max_restore_command_workers has value 2 in your example meaning that no more than 2 bgworkers could be run
concurrentlyfor the sake of WAL files prefetching  

If you don't give the user the option to configure the prefetch amount,
except indirectly by changing the number of max restore workers, then to
get a higher prefetch amount they have to increase the number of
workers.  That's what I'm referring to above, and previously, here.

> > process finishes a WAL file but then just sit around doing nothing while
> > waiting for the applying process to finish another segment.
>
> I believe that for typical set-up the parameter max_restore_command_workers would have value 2 or 3 in order to
supply
> a delivered WAL file on time just before it be started processing.
>
> This use case is for environment where time required for delivering WAL file from archive is greater  than time
requiredfor applying records contained in the WAL file. 
> If time required for WAL file delivering lesser than than time required for handling records contained in it then
max_restore_command_workersshouldn't be specified at all 

That's certainly not correct at all- the two aren't really all that
related, because any time spent waiting for a WAL file to be delivered
is time that the applying process *could* be working to apply WAL
instead of waiting.  At a minimum, I'd expect us to want to have, by
default, at least one worker process running out in front of the
applying process to hopefully eliminate most, if not all, time where the
applying process is waiting for a WAL to show up.  In cases where the
applying process is faster than a single fetching process, a user might
want to have two or more restore workers, though ultimately I still
contend that what they really want is as many workers as needed to make
sure that the applying process doesn't ever need to wait- up to some
limit based on the amount of space that's available.

And back to the configuration side of this- have you considered the
challenge that a user who is using very large WAL files might run
into with the proposed approach that doesn't allow them to control the
amount of space used?  If I'm using 1G WAL files, then I need to have
16G available to have *any* pre-fetching done with this proposed
approach, right?  That doesn't seem great.

Thanks,

Stephen

Attachment

Re: Reduce the time required for a database recovery from archive.

From
Anastasia Lubennikova
Date:
On 09.11.2020 19:31, Stephen Frost wrote:
> Greetings,
>
> * Dmitry Shulga (d.shulga@postgrespro.ru) wrote:
>>> On 19 Oct 2020, at 23:25, Stephen Frost <sfrost@snowman.net> wrote:
>>>
>>> process finishes a WAL file but then just sit around doing nothing while
>>> waiting for the applying process to finish another segment.
>> I believe that for typical set-up the parameter max_restore_command_workers would have value 2 or 3 in order to
supply
>> a delivered WAL file on time just before it be started processing.
>>
>> This use case is for environment where time required for delivering WAL file from archive is greater  than time
requiredfor applying records contained in the WAL file.
 
>> If time required for WAL file delivering lesser than than time required for handling records contained in it then
max_restore_command_workersshouldn't be specified at all
 
> That's certainly not correct at all- the two aren't really all that
> related, because any time spent waiting for a WAL file to be delivered
> is time that the applying process *could* be working to apply WAL
> instead of waiting.  At a minimum, I'd expect us to want to have, by
> default, at least one worker process running out in front of the
> applying process to hopefully eliminate most, if not all, time where the
> applying process is waiting for a WAL to show up.  In cases where the
> applying process is faster than a single fetching process, a user might
> want to have two or more restore workers, though ultimately I still
> contend that what they really want is as many workers as needed to make
> sure that the applying process doesn't ever need to wait- up to some
> limit based on the amount of space that's available.
>
> And back to the configuration side of this- have you considered the
> challenge that a user who is using very large WAL files might run
> into with the proposed approach that doesn't allow them to control the
> amount of space used?  If I'm using 1G WAL files, then I need to have
> 16G available to have *any* pre-fetching done with this proposed
> approach, right?  That doesn't seem great.
>
> Thanks,
>
> Stephen

Status update for a commitfest entry.

The commitfest is closed now. As this entry has been Waiting on Author 
for a while, I've marked it as returned with feedback. Dmitry, feel free 
to resubmit an updated version to a future commitfest.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Reduce the time required for a database recovery from archive.

From
Dmitry Shulga
Date:
Hi Stephen

Based on our last discussion I redesigned the implementation of WAL archive recovery speed-up. The main idea of the new implementation was partly borrowed from your proposal, to be more accurate from the following one:

On 9 Nov 2020, at 23:31, Stephen Frost <sfrost@snowman.net> wrote:

The relatively simple approach I was thinking was that a couple of
workers would be started and they'd have some prefetch amount that needs
to be kept out ahead of the applying process, which they could
potentially calculate themselves without needing to be pushed forward by
the applying process.

In the new implementation, several workers are spawned on server start up for delivering WAL segements from archive. The number of workers to spawn is specfied by the GUC parameter wal_prefetch_workers; the max. number of files to preload from the archive is determined by the GUC parameter wal_max_prefetch_amount. The applier of WAL records still handles WAL files one-by-one, but since several prefetching processes are loading files from the archive, there is a high probability that when the next WAL file is requested by the applier of WAL records, it has already been delivered from the archive.

Every time any of the running workers is going to preload the next WAL file, it checks whether a limit imposed by the parameter wal_max_prefetch_amount was reached. If it was, then the process suspends preloading until the WAL applier process handles some of the already preloaded WAL files and the total number of already loaded but not yet processed WAL files drops below this limit.

At the moment I didn't implement a mechanism for dynamic calculation of the number of workers required for loading the WAL files in time. We can consider current (simplified) implementation as a base for further discussion and turn to this matter in the next iteration if it be needed.

Also I would like to ask your opinion about the issue I'm thinking about:
  Parallel workers spawned for preloading WAL files from archive use the original mechanism for delivering files from archive - they run a command specified by the GUC parameter restore_command. One of the possible parameters accepted by the restore_command is %r, which specifies the filename of the last restart point. If several workers preload WAL files simultaneously with another process applying the preloaded WAL files, I’m not sure what is correct way to determine the last restart point value that WAL-preloading processes should use, because this value can be updated at any time by the process that applies WALs.

Another issue that I would like to ask your opinion about regards to choosing correct value for a max size of the hash table stored in shared memory. Currently, wal_max_prefetch_amount is passed as the value for max. hash table size that I'm not sure is the best decision.

Thanks in advance for feedback.

Regards,
Dmitry


Attachment

Re: Reduce the time required for a database recovery from archive.

From
David Steele
Date:
Hi Dimtry,

On 1/11/21 2:51 AM, Dmitry Shulga wrote:
> Hi Stephen
> 
> Based on our last discussion I redesigned the implementation of WAL 
> archive recovery speed-up. 

Seems like there should have been a patch attached? In any case the 
current patch no longer applies so marked Waiting on Author.

Personally, I'm not too keen on this patch as implemented. Several 
third-party backup solutions support parallel archive get so it would be 
nice to support an interface that simply says to the restore_command, 
"go get 1gb of WAL and write the files here." This patch still assumes 
that the user has written their own restore command, which is 
third-party by definition, so I can't see how interfacing with 
third-party software is an issue here.

Also, having multiple workers blindly asking for WAL can cause quite a 
bit of traffic and cost because PG knows what WAL it wants but it 
doesn't know what exists. On the other hand, a backup solution can 
cheaply determine what is available to prevent hammering the archive 
with requests for files that don't exist.

Regards,
-- 
-David
david@pgmasters.net



Re: Reduce the time required for a database recovery from archive.

From
Andrey Borodin
Date:

> 18 марта 2021 г., в 20:04, David Steele <david@pgmasters.net> написал(а):
> it would be nice to support an interface that simply says to the restore_command, "go get 1gb of WAL and write the
fileshere." 

+1 to redesigning restore_command and archive_command.

Best regards, Andrey Borodin.




Re: Reduce the time required for a database recovery from archive.

From
David Steele
Date:
On 3/18/21 11:37 AM, Andrey Borodin wrote:
> 
> 
>> 18 марта 2021 г., в 20:04, David Steele <david@pgmasters.net> написал(а):
>> it would be nice to support an interface that simply says to the restore_command, "go get 1gb of WAL and write the
fileshere."
 
> 
> +1 to redesigning restore_command and archive_command.

Indeed, archive_command would benefit from the same treatment. The need 
to call archive_command for each WAL segment even when parallel 
processing is going on behind the scenes is a bottleneck.

Larger WAL segments sizes can be used to mitigate this issue but an 
improvement would be welcome.

Regards,
-- 
-David
david@pgmasters.net



Re: Reduce the time required for a database recovery from archive.

From
Marina Polyakova
Date:
Hello everyone in this thread!

On 2021-03-18 18:04, David Steele wrote:
> Seems like there should have been a patch attached?

IMO there's a technical problem with sending, receiving (or displaying 
on the site) emails from the list pgsql-hackers. By subsribing to this 
list I received the attached patch from the email [1]. And my colleague 
Roman Zharkov said that the button 'Resend email' from that link helped 
him to receive the email with the attached patch. On the other hand 
follwing this link in the browser I do not see the attached patch. Do 
you think it is worth to write about this issue to 
webmaster(dot)postgresql(dot)org?..

Just in case I'm lucky this email contains the lost patch.

[1] 
https://www.postgresql.org/message-id/4047CC05-1AF5-454B-850B-ED37374A2AC0%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Reduce the time required for a database recovery from archive.

From
David Steele
Date:
On 3/19/21 4:32 AM, Marina Polyakova wrote:
> 
> On 2021-03-18 18:04, David Steele wrote:
>> Seems like there should have been a patch attached?
> 
> IMO there's a technical problem with sending, receiving (or displaying 
> on the site) emails from the list pgsql-hackers. By subsribing to this 
> list I received the attached patch from the email [1]. And my colleague 
> Roman Zharkov said that the button 'Resend email' from that link helped 
> him to receive the email with the attached patch. On the other hand 
> follwing this link in the browser I do not see the attached patch. Do 
> you think it is worth to write about this issue to 
> webmaster(dot)postgresql(dot)org?..
> 
> Just in case I'm lucky this email contains the lost patch.

You are correct -- if I send the email to myself I can see the patch (I 
can also see it on the website in raw form). I usually read the thread 
on the website before replying and I did not notice that the actual 
email had an attachment.

My guess about why it is not showing up on the website is that the 
encoding is a bit unusual (quoted-printable vs base64). Or it may have 
to do with the inclusion of an HTML version of the message. You are 
welcome to followup with the webmaster and see if this is a known issue.

Regards,
-- 
-David
david@pgmasters.net