Thread: pg13: xlogreader API adjust
Hello Per discussion in thread [1], I propose the following patch to give another adjustment to the xlogreader API. This results in a small but not insignificat net reduction of lines of code. What this patch does is adjust the signature of these new xlogreader callbacks, making the API simpler. The changes are: * the segment_open callback installs the FD in xlogreader state itself, instead of passing the FD back. This was suggested by Kyotaro Horiguchi in that thread[2]. * We no longer pass segcxt to segment_open; it's in XLogReaderState, which is already an argument. * We no longer pass seg/segcxt to WALRead; instead, that function takes them from XLogReaderState, which is already an argument. (This means XLogSendPhysical has to drink more of the fake_xlogreader kool-aid.) I claim the reason to do it now instead of pg14 is to make it simpler for third-party xlogreader callers to adjust. (Some might be thinking that I do this to avoid an API change later, but my guts tell me that we'll adjust xlogreader again in pg14 for the encryption stuff and other reasons, so.) [1] https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de [2] https://postgr.es/m/20200508.114228.963995144765118400.horikyota.ntt@gmail.com -- Álvaro Herrera Developer, https://www.PostgreSQL.org/
Attachment
At Mon, 11 May 2020 16:33:36 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > Hello > > Per discussion in thread [1], I propose the following patch to give > another adjustment to the xlogreader API. This results in a small but > not insignificat net reduction of lines of code. What this patch does > is adjust the signature of these new xlogreader callbacks, making the > API simpler. The changes are: > > * the segment_open callback installs the FD in xlogreader state itself, > instead of passing the FD back. This was suggested by Kyotaro > Horiguchi in that thread[2]. > > * We no longer pass segcxt to segment_open; it's in XLogReaderState, > which is already an argument. > > * We no longer pass seg/segcxt to WALRead; instead, that function takes > them from XLogReaderState, which is already an argument. > (This means XLogSendPhysical has to drink more of the fake_xlogreader > kool-aid.) > > I claim the reason to do it now instead of pg14 is to make it simpler > for third-party xlogreader callers to adjust. > > (Some might be thinking that I do this to avoid an API change later, but > my guts tell me that we'll adjust xlogreader again in pg14 for the > encryption stuff and other reasons, so.) > > > [1] https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de > [2] https://postgr.es/m/20200508.114228.963995144765118400.horikyota.ntt@gmail.com The simplified interface of WALRead looks far better to me since it no longer has unreasonable duplicates of parameters. I agree to the discussion about third-party xlogreader callers but not sure about back-patching burden. I'm not sure the reason for wal_segment_open and WalSndSegmentOpen being modified different way about error handling of BasicOpenFile, I prefer the WalSndSegmentOpen way. However, that difference doesn't harm anything so I'm fine with the current patch. + fake_xlogreader.seg = *sendSeg; + fake_xlogreader.segcxt = *sendCxt; fake_xlogreader.seg is a different instance from *sendSeg. WALRead modifies fake_xlogreader.seg but does not modify *sendSeg. Thus the change doesn't persist. On the other hand WalSndSegmentOpen reads *sendSeg, which is not under control of WALRead. Maybe we had better to make fake_xlogreader be a global variable of walsender.c that covers the current sendSeg and sendCxt. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020-May-12, Kyotaro Horiguchi wrote: > I'm not sure the reason for wal_segment_open and WalSndSegmentOpen > being modified different way about error handling of BasicOpenFile, I > prefer the WalSndSegmentOpen way. However, that difference doesn't > harm anything so I'm fine with the current patch. Yeah, I couldn't decide which style I liked the most. I used the one you suggested. > + fake_xlogreader.seg = *sendSeg; > + fake_xlogreader.segcxt = *sendCxt; > > fake_xlogreader.seg is a different instance from *sendSeg. WALRead > modifies fake_xlogreader.seg but does not modify *sendSeg. Thus the > change doesn't persist. On the other hand WalSndSegmentOpen reads > *sendSeg, which is not under control of WALRead. > > Maybe we had better to make fake_xlogreader be a global variable of > walsender.c that covers the current sendSeg and sendCxt. I tried that. I was about to leave it at just modifying physical walsender (simple enough, and it passed tests), but I noticed that WalSndErrorCleanup() would be a problem because we don't know if it's physical or logical walsender. So in the end I added a global 'xlogreader' pointer in walsender.c -- logical walsender sets it to the true xlogreader it has inside the logical decoding context, and physical walsender sets it to its fake xlogreader. That seems to work nicely. sendSeg/sendCxt are gone entirely. Logical walsender was doing WALOpenSegmentInit() uselessly during InitWalSender(), since it was using the separate sendSeg/sendCxt structs instead of the ones in its xlogreader. (Some mysteries become clearer!) It's slightly disquieting that the segment_close call in WalSndErrorCleanup is not covered, but in any case this should work well AFAICS. I think this is simpler to understand than formerly. Now the only silliness remaining is the fact that different users of the xlogreader interface are doing different things about the TLI. Hopefully we can unify everything to something sensible one day .. but that's not going to happen in pg13. I'll get this pushed tomorrow, unless there are further objections. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Pushed. Thanks for the help! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Pushed. Thanks for the help! This seems to have fixed bowerbird. Were you expecting that? regards, tom lane
On 2020-May-13, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Pushed. Thanks for the help! > > This seems to have fixed bowerbird. Were you expecting that? Hm, not really. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I think I've discovered a problem with 850196b6. The following steps can be used to trigger a segfault: # wal_level = logical psql postgres -c "create database testdb;" psql testdb -c "select pg_create_logical_replication_slot('slot', 'test_decoding');" psql "dbname=postgres replication=database" -c "START_REPLICATION SLOT slot LOGICAL 0/0;" From a quick glance, I think the problem starts in StartLogicalReplication() in walsender.c. The call to CreateDecodingContext() may ERROR before xlogreader is initialized in the next line, so the subsequent call to WalSndErrorCleanup() segfaults when it attempts to access xlogreader. Nathan
At Thu, 14 May 2020 01:03:48 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > I think I've discovered a problem with 850196b6. The following steps > can be used to trigger a segfault: > > # wal_level = logical > psql postgres -c "create database testdb;" > psql testdb -c "select pg_create_logical_replication_slot('slot', 'test_decoding');" > psql "dbname=postgres replication=database" -c "START_REPLICATION SLOT slot LOGICAL 0/0;" > > From a quick glance, I think the problem starts in > StartLogicalReplication() in walsender.c. The call to > CreateDecodingContext() may ERROR before xlogreader is initialized in > the next line, so the subsequent call to WalSndErrorCleanup() > segfaults when it attempts to access xlogreader. Good catch! That's not only for CreateDecodingContet. That happens everywhere in the query loop in PostgresMain() until logreader is initialized. So that also happens, for example, by starting logical replication using invalidated slot. Checking xlogreader != NULL in WalSndErrorCleanup is sufficient. It doesn't make actual difference, but the attached explicitly initialize the pointer with NULL. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 5421f5b0e63abfe18b6f09d75c44697cdac699f3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 14 May 2020 13:37:50 +0900 Subject: [PATCH] Don't check uninitialized xlog reader state WanSndErrorCleanup may be called before initialization of the variable xlogreader and crashes in that case. Let the function not to examine the xlogreader state if not yet initialized. --- src/backend/replication/walsender.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 3367aa98f8..661fa12a94 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -136,7 +136,7 @@ bool wake_wal_senders = false; * routines. */ static XLogReaderState fake_xlogreader; -static XLogReaderState *xlogreader; +static XLogReaderState *xlogreader = NULL; /* * These variables keep track of the state of the timeline we're currently @@ -315,7 +315,7 @@ WalSndErrorCleanup(void) ConditionVariableCancelSleep(); pgstat_report_wait_end(); - if (xlogreader->seg.ws_file >= 0) + if (xlogreader != NULL && xlogreader->seg.ws_file >= 0) wal_segment_close(xlogreader); if (MyReplicationSlot != NULL) -- 2.18.2
On Thu, May 14, 2020 at 02:12:25PM +0900, Kyotaro Horiguchi wrote: > Good catch! That's not only for CreateDecodingContet. That happens > everywhere in the query loop in PostgresMain() until logreader is > initialized. So that also happens, for example, by starting logical > replication using invalidated slot. Checking xlogreader != NULL in > WalSndErrorCleanup is sufficient. It doesn't make actual difference, > but the attached explicitly initialize the pointer with NULL. Alvaro, are you planning to look at that? Should we have an open item for this matter? -- Michael
Attachment
On 2020-May-15, Michael Paquier wrote: > On Thu, May 14, 2020 at 02:12:25PM +0900, Kyotaro Horiguchi wrote: > > Good catch! That's not only for CreateDecodingContet. That happens > > everywhere in the query loop in PostgresMain() until logreader is > > initialized. So that also happens, for example, by starting logical > > replication using invalidated slot. Checking xlogreader != NULL in > > WalSndErrorCleanup is sufficient. It doesn't make actual difference, > > but the attached explicitly initialize the pointer with NULL. > > Alvaro, are you planning to look at that? Should we have an open item > for this matter? On it now. I'm trying to add a test for this (needs a small change to PostgresNode->psql), but I'm probably doing something stupid in the Perl side, because it doesn't detect things as well as I'd like. Still trying, but I may be asked to evict the office soon ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
At Fri, 15 May 2020 19:24:28 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-May-15, Michael Paquier wrote: > > > On Thu, May 14, 2020 at 02:12:25PM +0900, Kyotaro Horiguchi wrote: > > > Good catch! That's not only for CreateDecodingContet. That happens > > > everywhere in the query loop in PostgresMain() until logreader is > > > initialized. So that also happens, for example, by starting logical > > > replication using invalidated slot. Checking xlogreader != NULL in > > > WalSndErrorCleanup is sufficient. It doesn't make actual difference, > > > but the attached explicitly initialize the pointer with NULL. > > > > Alvaro, are you planning to look at that? Should we have an open item > > for this matter? > > On it now. I'm trying to add a test for this (needs a small change to > PostgresNode->psql), but I'm probably doing something stupid in the Perl > side, because it doesn't detect things as well as I'd like. Still > trying, but I may be asked to evict the office soon ... FWIW, and I'm not sure which of the mail and the commit 1d3743023e was earlier, but I confirmed that the committed test in 006_logical_decoding.pl causes a crash, and the crash is fixed by the change of code. regards. -- Kyotaro Horiguchi NTT Open Source Software Center