Thread: improve pg_receivewal code

improve pg_receivewal code

From
Bharath Rupireddy
Date:
Hi,

I see there's a scope to do following improvements to pg_receivewal.c:

1) Fetch the server system identifier in the first RunIdentifySystem
call and use it to identify(via pg_receivewal's ReceiveXlogStream) any
unexpected changes that may happen in the server while pg_receivewal
is connected to it. This can be helpful in scenarios when
pg_receivewal tries to reconnect to the server (see the code around
pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has
happnend in the server that changed the its system identifier. Once
the pg_receivewal establishes the conenction to server again, then the
ReceiveXlogStream has a code chunk to compare the system identifier
that we received in the initial connection.
2) Move the RunIdentifySystem to identify timeline id and start LSN
from the server only if the pg_receivewal failed to get them from
FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is
avoided.
3) Place the "replication connetion shouldn't have any database name
associated" error check right after RunIdentifySystem so that we can
avoid fetching wal segment size with RetrieveWalSegSize if at all we
were to fail with that error. This change is similar to what
pg_recvlogical.c does.
4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters
main loop to get the wal from the server. This avoids an unnecessary
query for pg_receivewal with "--create-slot" or "--drop-slot".
5) Have an assertion after the pg_receivewal done a good amount of
work to find start timeline and LSN might be helpful:
Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr);

Attaching a patch that does take care of above improvements. Thoughts?

Regards,
Bharath Rupireddy.

Attachment

Re: improve pg_receivewal code

From
Bharath Rupireddy
Date:
On Mon, Aug 30, 2021 at 1:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> I see there's a scope to do following improvements to pg_receivewal.c:
>
> 1) Fetch the server system identifier in the first RunIdentifySystem
> call and use it to identify(via pg_receivewal's ReceiveXlogStream) any
> unexpected changes that may happen in the server while pg_receivewal
> is connected to it. This can be helpful in scenarios when
> pg_receivewal tries to reconnect to the server (see the code around
> pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has
> happnend in the server that changed the its system identifier. Once
> the pg_receivewal establishes the conenction to server again, then the
> ReceiveXlogStream has a code chunk to compare the system identifier
> that we received in the initial connection.
> 2) Move the RunIdentifySystem to identify timeline id and start LSN
> from the server only if the pg_receivewal failed to get them from
> FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is
> avoided.
> 3) Place the "replication connetion shouldn't have any database name
> associated" error check right after RunIdentifySystem so that we can
> avoid fetching wal segment size with RetrieveWalSegSize if at all we
> were to fail with that error. This change is similar to what
> pg_recvlogical.c does.
> 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters
> main loop to get the wal from the server. This avoids an unnecessary
> query for pg_receivewal with "--create-slot" or "--drop-slot".
> 5) Have an assertion after the pg_receivewal done a good amount of
> work to find start timeline and LSN might be helpful:
> Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr);
>
> Attaching a patch that does take care of above improvements. Thoughts?

Here's the CF entry - https://commitfest.postgresql.org/34/3315/

Regards,
Bharath Rupireddy.



Re: improve pg_receivewal code

From
Ronan Dunklau
Date:
Le lundi 30 août 2021, 09:32:05 CEST Bharath Rupireddy a écrit :
> Hi,
>
> I see there's a scope to do following improvements to pg_receivewal.c:

Thank you Bharath for this patch.

>
> 1) Fetch the server system identifier in the first RunIdentifySystem
> call and use it to identify(via pg_receivewal's ReceiveXlogStream) any
> unexpected changes that may happen in the server while pg_receivewal
> is connected to it. This can be helpful in scenarios when
> pg_receivewal tries to reconnect to the server (see the code around
> pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has
> happnend in the server that changed the its system identifier. Once
> the pg_receivewal establishes the conenction to server again, then the
> ReceiveXlogStream has a code chunk to compare the system identifier
> that we received in the initial connection.

I'm not sure what kind of problem could be happening here: if you were
somewhat routed to another server ? Or if we "switched" the cluster listening
on that port ?

> 2) Move the RunIdentifySystem to identify timeline id and start LSN
> from the server only if the pg_receivewal failed to get them from
> FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is
> avoided.

That makes sense, even if we add another IDENTIFY_SYSTEM to check against the
one set in the first place.

> 3) Place the "replication connetion shouldn't have any database name
> associated" error check right after RunIdentifySystem so that we can
> avoid fetching wal segment size with RetrieveWalSegSize if at all we
> were to fail with that error. This change is similar to what
> pg_recvlogical.c does.

Makes sense.

> 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters
> main loop to get the wal from the server. This avoids an unnecessary
> query for pg_receivewal with "--create-slot" or "--drop-slot".
> 5) Have an assertion after the pg_receivewal done a good amount of
> work to find start timeline and LSN might be helpful:
> Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr);
>
> Attaching a patch that does take care of above improvements. Thoughts?

Overall I think it is good.

However, you have some formatting issues, here it mixes tabs and spaces:

+        /*
+          * No valid data can be found in the existing output
directory.
+         * Get start LSN position and current timeline ID from
the server.
+          */

And here it is not formatted properly:

+static char       *server_sysid = NULL;



>
> Regards,
> Bharath Rupireddy.


--
Ronan Dunklau





Re: improve pg_receivewal code

From
Bharath Rupireddy
Date:
On Thu, Sep 2, 2021 at 9:05 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > 1) Fetch the server system identifier in the first RunIdentifySystem
> > call and use it to identify(via pg_receivewal's ReceiveXlogStream) any
> > unexpected changes that may happen in the server while pg_receivewal
> > is connected to it. This can be helpful in scenarios when
> > pg_receivewal tries to reconnect to the server (see the code around
> > pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has
> > happnend in the server that changed the its system identifier. Once
> > the pg_receivewal establishes the conenction to server again, then the
> > ReceiveXlogStream has a code chunk to compare the system identifier
> > that we received in the initial connection.
>
> I'm not sure what kind of problem could be happening here: if you were
> somewhat routed to another server ? Or if we "switched" the cluster listening
> on that port ?

Yeah. Also, pg_control file on the server can get corrupted for
whatever reasons it may be. This sys identifier check is useful in
case if the pg_receivewal connects to the server again and again.
These are things that sound over cautious to me, however it's nothing
wrong to use what ReceiveXlogStream provides. pg_basebackup does make
use of this already.

> > 2) Move the RunIdentifySystem to identify timeline id and start LSN
> > from the server only if the pg_receivewal failed to get them from
> > FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is
> > avoided.
>
> That makes sense, even if we add another IDENTIFY_SYSTEM to check against the
> one set in the first place.
>
> > 3) Place the "replication connetion shouldn't have any database name
> > associated" error check right after RunIdentifySystem so that we can
> > avoid fetching wal segment size with RetrieveWalSegSize if at all we
> > were to fail with that error. This change is similar to what
> > pg_recvlogical.c does.
>
> Makes sense.
>
> > 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters
> > main loop to get the wal from the server. This avoids an unnecessary
> > query for pg_receivewal with "--create-slot" or "--drop-slot".
> > 5) Have an assertion after the pg_receivewal done a good amount of
> > work to find start timeline and LSN might be helpful:
> > Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr);
> >
> > Attaching a patch that does take care of above improvements. Thoughts?
>
> Overall I think it is good.

Thanks for your review.

> However, you have some formatting issues, here it mixes tabs and spaces:
>
> +               /*
> +                * No valid data can be found in the existing output
> directory.
> +                * Get start LSN position and current timeline ID from
> the server.
> +                */

My bad. I forgot to run "git diff --check" on the v1 patch.

> And here it is not formatted properly:
>
> +static char       *server_sysid = NULL;

Done.

Here's the v2 with above modifications.

Regards,
Bharath Rupireddy.

Attachment

Re: improve pg_receivewal code

From
Michael Paquier
Date:
> Here's the v2 with above modifications.

I was looking at this patch, and I agree that checking for the system
ID and the timeline by setting sysidentifier beforehand looks like an
improvement.

The extra IDENTIFY_SYSTEM done at the beginning of StreamLog() is not
a performance bottleneck as we run it only once for each loop.  I
don't really get the argument of a server replacing another one on the
same port requiring to rely only on the first system ID fetched before
starting the loops of StreamLog() calls.  So I would leave main()
alone, but fill in the system ID from RunIdentifySystem() in
StreamLog().
--
Michael

Attachment

Re: improve pg_receivewal code

From
Bharath Rupireddy
Date:
On Thu, Sep 16, 2021 at 9:31 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> > Here's the v2 with above modifications.
>
> I was looking at this patch, and I agree that checking for the system
> ID and the timeline by setting sysidentifier beforehand looks like an
> improvement.
>
> The extra IDENTIFY_SYSTEM done at the beginning of StreamLog() is not
> a performance bottleneck as we run it only once for each loop.  I
> don't really get the argument of a server replacing another one on the
> same port requiring to rely only on the first system ID fetched before
> starting the loops of StreamLog() calls.  So I would leave main()
> alone, but fill in the system ID from RunIdentifySystem() in
> StreamLog().

Thanks. I changed the code that way. PSA v3 patch.

Regards,
Bharath Rupireddy.

Attachment

Re: improve pg_receivewal code

From
Michael Paquier
Date:
On Fri, Sep 17, 2021 at 11:46:33AM +0530, Bharath Rupireddy wrote:
> Thanks. I changed the code that way. PSA v3 patch.

Thanks.  Done.
--
Michael

Attachment