Thread: improve pg_receivewal code
Hi, I see there's a scope to do following improvements to pg_receivewal.c: 1) Fetch the server system identifier in the first RunIdentifySystem call and use it to identify(via pg_receivewal's ReceiveXlogStream) any unexpected changes that may happen in the server while pg_receivewal is connected to it. This can be helpful in scenarios when pg_receivewal tries to reconnect to the server (see the code around pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has happnend in the server that changed the its system identifier. Once the pg_receivewal establishes the conenction to server again, then the ReceiveXlogStream has a code chunk to compare the system identifier that we received in the initial connection. 2) Move the RunIdentifySystem to identify timeline id and start LSN from the server only if the pg_receivewal failed to get them from FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is avoided. 3) Place the "replication connetion shouldn't have any database name associated" error check right after RunIdentifySystem so that we can avoid fetching wal segment size with RetrieveWalSegSize if at all we were to fail with that error. This change is similar to what pg_recvlogical.c does. 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters main loop to get the wal from the server. This avoids an unnecessary query for pg_receivewal with "--create-slot" or "--drop-slot". 5) Have an assertion after the pg_receivewal done a good amount of work to find start timeline and LSN might be helpful: Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr); Attaching a patch that does take care of above improvements. Thoughts? Regards, Bharath Rupireddy.
Attachment
On Mon, Aug 30, 2021 at 1:02 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > I see there's a scope to do following improvements to pg_receivewal.c: > > 1) Fetch the server system identifier in the first RunIdentifySystem > call and use it to identify(via pg_receivewal's ReceiveXlogStream) any > unexpected changes that may happen in the server while pg_receivewal > is connected to it. This can be helpful in scenarios when > pg_receivewal tries to reconnect to the server (see the code around > pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has > happnend in the server that changed the its system identifier. Once > the pg_receivewal establishes the conenction to server again, then the > ReceiveXlogStream has a code chunk to compare the system identifier > that we received in the initial connection. > 2) Move the RunIdentifySystem to identify timeline id and start LSN > from the server only if the pg_receivewal failed to get them from > FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is > avoided. > 3) Place the "replication connetion shouldn't have any database name > associated" error check right after RunIdentifySystem so that we can > avoid fetching wal segment size with RetrieveWalSegSize if at all we > were to fail with that error. This change is similar to what > pg_recvlogical.c does. > 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters > main loop to get the wal from the server. This avoids an unnecessary > query for pg_receivewal with "--create-slot" or "--drop-slot". > 5) Have an assertion after the pg_receivewal done a good amount of > work to find start timeline and LSN might be helpful: > Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr); > > Attaching a patch that does take care of above improvements. Thoughts? Here's the CF entry - https://commitfest.postgresql.org/34/3315/ Regards, Bharath Rupireddy.
Le lundi 30 août 2021, 09:32:05 CEST Bharath Rupireddy a écrit : > Hi, > > I see there's a scope to do following improvements to pg_receivewal.c: Thank you Bharath for this patch. > > 1) Fetch the server system identifier in the first RunIdentifySystem > call and use it to identify(via pg_receivewal's ReceiveXlogStream) any > unexpected changes that may happen in the server while pg_receivewal > is connected to it. This can be helpful in scenarios when > pg_receivewal tries to reconnect to the server (see the code around > pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has > happnend in the server that changed the its system identifier. Once > the pg_receivewal establishes the conenction to server again, then the > ReceiveXlogStream has a code chunk to compare the system identifier > that we received in the initial connection. I'm not sure what kind of problem could be happening here: if you were somewhat routed to another server ? Or if we "switched" the cluster listening on that port ? > 2) Move the RunIdentifySystem to identify timeline id and start LSN > from the server only if the pg_receivewal failed to get them from > FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is > avoided. That makes sense, even if we add another IDENTIFY_SYSTEM to check against the one set in the first place. > 3) Place the "replication connetion shouldn't have any database name > associated" error check right after RunIdentifySystem so that we can > avoid fetching wal segment size with RetrieveWalSegSize if at all we > were to fail with that error. This change is similar to what > pg_recvlogical.c does. Makes sense. > 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters > main loop to get the wal from the server. This avoids an unnecessary > query for pg_receivewal with "--create-slot" or "--drop-slot". > 5) Have an assertion after the pg_receivewal done a good amount of > work to find start timeline and LSN might be helpful: > Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr); > > Attaching a patch that does take care of above improvements. Thoughts? Overall I think it is good. However, you have some formatting issues, here it mixes tabs and spaces: + /* + * No valid data can be found in the existing output directory. + * Get start LSN position and current timeline ID from the server. + */ And here it is not formatted properly: +static char *server_sysid = NULL; > > Regards, > Bharath Rupireddy. -- Ronan Dunklau
On Thu, Sep 2, 2021 at 9:05 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > > 1) Fetch the server system identifier in the first RunIdentifySystem > > call and use it to identify(via pg_receivewal's ReceiveXlogStream) any > > unexpected changes that may happen in the server while pg_receivewal > > is connected to it. This can be helpful in scenarios when > > pg_receivewal tries to reconnect to the server (see the code around > > pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has > > happnend in the server that changed the its system identifier. Once > > the pg_receivewal establishes the conenction to server again, then the > > ReceiveXlogStream has a code chunk to compare the system identifier > > that we received in the initial connection. > > I'm not sure what kind of problem could be happening here: if you were > somewhat routed to another server ? Or if we "switched" the cluster listening > on that port ? Yeah. Also, pg_control file on the server can get corrupted for whatever reasons it may be. This sys identifier check is useful in case if the pg_receivewal connects to the server again and again. These are things that sound over cautious to me, however it's nothing wrong to use what ReceiveXlogStream provides. pg_basebackup does make use of this already. > > 2) Move the RunIdentifySystem to identify timeline id and start LSN > > from the server only if the pg_receivewal failed to get them from > > FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is > > avoided. > > That makes sense, even if we add another IDENTIFY_SYSTEM to check against the > one set in the first place. > > > 3) Place the "replication connetion shouldn't have any database name > > associated" error check right after RunIdentifySystem so that we can > > avoid fetching wal segment size with RetrieveWalSegSize if at all we > > were to fail with that error. This change is similar to what > > pg_recvlogical.c does. > > Makes sense. > > > 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters > > main loop to get the wal from the server. This avoids an unnecessary > > query for pg_receivewal with "--create-slot" or "--drop-slot". > > 5) Have an assertion after the pg_receivewal done a good amount of > > work to find start timeline and LSN might be helpful: > > Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr); > > > > Attaching a patch that does take care of above improvements. Thoughts? > > Overall I think it is good. Thanks for your review. > However, you have some formatting issues, here it mixes tabs and spaces: > > + /* > + * No valid data can be found in the existing output > directory. > + * Get start LSN position and current timeline ID from > the server. > + */ My bad. I forgot to run "git diff --check" on the v1 patch. > And here it is not formatted properly: > > +static char *server_sysid = NULL; Done. Here's the v2 with above modifications. Regards, Bharath Rupireddy.
Attachment
> Here's the v2 with above modifications. I was looking at this patch, and I agree that checking for the system ID and the timeline by setting sysidentifier beforehand looks like an improvement. The extra IDENTIFY_SYSTEM done at the beginning of StreamLog() is not a performance bottleneck as we run it only once for each loop. I don't really get the argument of a server replacing another one on the same port requiring to rely only on the first system ID fetched before starting the loops of StreamLog() calls. So I would leave main() alone, but fill in the system ID from RunIdentifySystem() in StreamLog(). -- Michael
Attachment
On Thu, Sep 16, 2021 at 9:31 AM Michael Paquier <michael@paquier.xyz> wrote: > > > Here's the v2 with above modifications. > > I was looking at this patch, and I agree that checking for the system > ID and the timeline by setting sysidentifier beforehand looks like an > improvement. > > The extra IDENTIFY_SYSTEM done at the beginning of StreamLog() is not > a performance bottleneck as we run it only once for each loop. I > don't really get the argument of a server replacing another one on the > same port requiring to rely only on the first system ID fetched before > starting the loops of StreamLog() calls. So I would leave main() > alone, but fill in the system ID from RunIdentifySystem() in > StreamLog(). Thanks. I changed the code that way. PSA v3 patch. Regards, Bharath Rupireddy.
Attachment
On Fri, Sep 17, 2021 at 11:46:33AM +0530, Bharath Rupireddy wrote: > Thanks. I changed the code that way. PSA v3 patch. Thanks. Done. -- Michael