Thread: xlog.c: removing ReadRecPtr and EndRecPtr
I spent a lot of time trying to figure out why xlog.c has global variables ReadRecPtr and EndRecPtr instead of just relying on the eponymous structure members inside the XLogReaderState. I concluded that the values are the same at most points in the code, and thus that we could just use xlogreaderstate->{Read,End}RecPtr instead. There are two places where this wouldn't produce the same results we're getting today. Both of those appear to be bugs. The reason why it's generally the case that ReadRecPtr == xlogreaderstate->ReadRecPtr and likewise for EndRecPtr is that ReadRecord() calls XLogReadRecord(), which sets the values inside the XLogReaderState, and then immediately assigns the values stored there to the global variables. There's no other code that changes the other global variables, and the only other code that changes the structure members is XLogBeginRead(). So the values can be unequal from the time XLogBeginRead() is called up until the time that the XLogReadRecord() call inside ReadRecord() returns. In practice, StartupXLOG() always calls ReadRecord() right after it calls XLogBeginRead(), and ReadRecord() does not reference either global variable before calling XLogReadRecord(), so the problem surface is limited to code that runs underneath XLogReadRecord(). XLogReadRecord() is part of xlogreader.c, but it uses a callback interface: the callback is XLogPageRead(), which itself references EndRecPtr, and also calls WaitForWALToBecomeAvailable(), which in turn calls rescanLatestTimeLine(), which also references EndRecPtr. So these are the two problem cases: XLogPageRead(), and rescanLatestTimeLine(). In rescanLatestTimeLine(), the problem is IMHO probably serious enough to justify a separate commit with back-patching. The problem is that EndRecPtr is being used here to reject impermissible attempts to switch to a bad timeline, but if pg_wal starts out empty, EndRecPtr will be 0 here, which causes the code to fail to detect a case that should be prohibited. Consider the following test case: - create a primary - create standby #1 from the primary - start standby #1 and promote it - take a backup from the primary using -Xnone to create standby #2 - clear primary_conninfo on standby #2 and then start it - copy 00000002.history from standby #1 to standby #2 You get: 2021-11-17 15:34:26.213 EST [7474] LOG: selected new timeline ID: 2 But with the attached patch, you get: 2021-11-17 16:12:01.566 EST [20900] LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/A000060 Had the problem occurred at some later point in the WAL stream rather than before fetching the very first record, I think everything is fine; at that point, I think that the global variable EndRecPtr will be initialized. I'm not entirely sure that it contains exactly the right value, but it's someplace in the right ballpark, at least. In XLogPageRead(), the problem is just cosmetic. We're only using EndRecPtr as an argument to emode_for_corrupt_record(), which is all about suppressing duplicate complaints about the same LSN. But if the xlogreader has been repositioned using XLogBeginRead() since the last call to ReadRecord(), or if there are no preceding calls to ReadRecord(), then the value of EndRecPtr here is left over from the previous read position and is not particularly related to the record we're reading now. xlogreader->EndRecPtr, OTOH, is. This doesn't seem worth a separate commit to me, or a back-patch, but it seems worth fixing while I'm cleaning up these global variables. Review appreciated. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Nov 18, 2021 at 3:31 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I spent a lot of time trying to figure out why xlog.c has global > variables ReadRecPtr and EndRecPtr instead of just relying on the > eponymous structure members inside the XLogReaderState. I concluded > that the values are the same at most points in the code, and thus that > we could just use xlogreaderstate->{Read,End}RecPtr instead. There are > two places where this wouldn't produce the same results we're getting > today. Both of those appear to be bugs. > > The reason why it's generally the case that ReadRecPtr == > xlogreaderstate->ReadRecPtr and likewise for EndRecPtr is that > ReadRecord() calls XLogReadRecord(), which sets the values inside the > XLogReaderState, and then immediately assigns the values stored there > to the global variables. There's no other code that changes the other > global variables, and the only other code that changes the structure > members is XLogBeginRead(). So the values can be unequal from the time > XLogBeginRead() is called up until the time that the XLogReadRecord() > call inside ReadRecord() returns. In practice, StartupXLOG() always > calls ReadRecord() right after it calls XLogBeginRead(), and > ReadRecord() does not reference either global variable before calling > XLogReadRecord(), so the problem surface is limited to code that runs > underneath XLogReadRecord(). XLogReadRecord() is part of xlogreader.c, > but it uses a callback interface: the callback is XLogPageRead(), > which itself references EndRecPtr, and also calls > WaitForWALToBecomeAvailable(), which in turn calls > rescanLatestTimeLine(), which also references EndRecPtr. So these are > the two problem cases: XLogPageRead(), and rescanLatestTimeLine(). > > In rescanLatestTimeLine(), the problem is IMHO probably serious enough > to justify a separate commit with back-patching. The problem is that > EndRecPtr is being used here to reject impermissible attempts to > switch to a bad timeline, but if pg_wal starts out empty, EndRecPtr > will be 0 here, which causes the code to fail to detect a case that > should be prohibited. Consider the following test case: > > - create a primary > - create standby #1 from the primary > - start standby #1 and promote it > - take a backup from the primary using -Xnone to create standby #2 > - clear primary_conninfo on standby #2 and then start it > - copy 00000002.history from standby #1 to standby #2 > > You get: > > 2021-11-17 15:34:26.213 EST [7474] LOG: selected new timeline ID: 2 > > But with the attached patch, you get: > > 2021-11-17 16:12:01.566 EST [20900] LOG: new timeline 2 forked off > current database system timeline 1 before current recovery point > 0/A000060 > Somehow with and without patch I am getting the same log. > Had the problem occurred at some later point in the WAL stream rather > than before fetching the very first record, I think everything is > fine; at that point, I think that the global variable EndRecPtr will > be initialized. I'm not entirely sure that it contains exactly the > right value, but it's someplace in the right ballpark, at least. > Agree, change seems pretty much reasonable. > In XLogPageRead(), the problem is just cosmetic. We're only using > EndRecPtr as an argument to emode_for_corrupt_record(), which is all > about suppressing duplicate complaints about the same LSN. But if the > xlogreader has been repositioned using XLogBeginRead() since the last > call to ReadRecord(), or if there are no preceding calls to > ReadRecord(), then the value of EndRecPtr here is left over from the > previous read position and is not particularly related to the record > we're reading now. xlogreader->EndRecPtr, OTOH, is. This doesn't seem > worth a separate commit to me, or a back-patch, but it seems worth > fixing while I'm cleaning up these global variables. > LGTM. Regards, Amul
On Thu, Nov 18, 2021 at 7:30 AM Amul Sul <sulamul@gmail.com> wrote: > Somehow with and without patch I am getting the same log. Try applying the attached 0001-dubious-test-cast.patch for you and see if that fails. It does for me. If so, then try applying 0002-fix-the-bug.patch and see if that makes it pass. Unfortunately, this test case isn't remotely committable as-is, and I don't know how to make it so. The main problem is that, although you can start up a server with nothing in pg_wal, no restore_command, and no archive command, pg_ctl will not believe that it has started. I worked around that problem by telling pg_ctl to ignore failures, but it still waits for a long time before timing out, which sucks both because (1) hackers are impatient and (2) some hackers run extremely slow buildfarm machines where almost any timeout won't be long enough. There's a second place where the patch needs to wait for something also, and that one I've crudely kludged with sleep(10). If anybody around here who is good at figuring out how to write clever TAP tests can tell me how to fix this test to be non-stupid, I will happily do so. Otherwise, I think I will just need to commit and back-patch the actual bug fix without a test, and then rebase the rest of the patch I posted previously over top of those changes for master only. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 2021-Nov-18, Robert Haas wrote: > Unfortunately, this test case isn't remotely committable as-is, and I > don't know how to make it so. The main problem is that, although you > can start up a server with nothing in pg_wal, no restore_command, and > no archive command, pg_ctl will not believe that it has started. Would it work to start postmaster directly instad of using pg_ctl, and then rely on (say) pg_isready? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
On Thu, Nov 18, 2021 at 2:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Would it work to start postmaster directly instad of using pg_ctl, and > then rely on (say) pg_isready? I *think* that pg_isready would also fail, because the documentation says "pg_isready returns 0 to the shell if the server is accepting connections normally, 1 if the server is rejecting connections (for example during startup) ..." and I believe the "during startup" case would apply here. Starting postmaster directly is a thought. Is there any existing precedent for that approach? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > There's a second place where the patch needs to wait for something > also, and that one I've crudely kludged with sleep(10). If anybody > around here who is good at figuring out how to write clever TAP tests > can tell me how to fix this test to be non-stupid, I will happily do > so. As far as that goes, if you conceptualize it as "wait for this text to appear in the log file", there's prior art in existing TAP tests. Basically, sleep for some reasonable short period and check the log file; if not there, repeat until timeout. I'm a little dubious that this test case is valuable enough to mess around with a nonstandard postmaster startup protocol, though. The main reason I dislike that idea is that any fixes we apply to the TAP tests' normal postmaster-startup code would almost inevitably miss fixing this test. IIRC there have been security-related fixes in that area (e.g. where do we put the postmaster's socket), so I find that prospect pretty scary. regards, tom lane
On Thu, Nov 18, 2021 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > There's a second place where the patch needs to wait for something > > also, and that one I've crudely kludged with sleep(10). If anybody > > around here who is good at figuring out how to write clever TAP tests > > can tell me how to fix this test to be non-stupid, I will happily do > > so. > > As far as that goes, if you conceptualize it as "wait for this text > to appear in the log file", there's prior art in existing TAP tests. > Basically, sleep for some reasonable short period and check the > log file; if not there, repeat until timeout. Yeah, there's something like that in the form of find_in_log in 019_replslot_init.pl. I thought about copying that, but that didn't seem great, and I also thought about trying to move into a common module, which seems maybe better but also more work, and thus not worth doing unless we have agreement that it's what we should do. > I'm a little dubious that this test case is valuable enough to > mess around with a nonstandard postmaster startup protocol, though. > The main reason I dislike that idea is that any fixes we apply to > the TAP tests' normal postmaster-startup code would almost inevitably > miss fixing this test. IIRC there have been security-related fixes in > that area (e.g. where do we put the postmaster's socket), so I find > that prospect pretty scary. The problem that I have with the present situation is that the test coverage of xlog.c is pretty abysmal. It actually doesn't look bad if you just run a coverage report, but there are a shazillion flag variables in that file and bugs like this make it quite clear that we don't come close to testing all the possible combinations. It's really borderline unmaintainable. I don't know whether there's a specific individual who wrote most of this code and didn't get the memo that global variables are best avoided, or whether this is sort of case where we started over 1 or 2 and then it just gradually ballooned into the giant mess that it now is, but the present situation is pretty outrageous. It's taking me weeks of time to figure out how to make changes that would normally take days, or maybe hours. We clearly need to try both to get the number of cases under control by eliminating stupid variables that are almost but not quite the same as something else, and also get proper test coverage for the things that remain so that it's possible to modify the code without excesive risk of shooting ourselves in the foot. That said, I'm not wedded to this particular test case, either. It's an extremely specific bug that is unlikely to reappear once squashed, and making the test case robust enough to avoid having the buildfarm complain seems fraught. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Nov 18, 2021 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm a little dubious that this test case is valuable enough to >> mess around with a nonstandard postmaster startup protocol, though. > The problem that I have with the present situation is that the test > coverage of xlog.c is pretty abysmal. Agreed, but this one test case isn't going to move the needle much. To get to reasonable coverage we're going to need more tests, and I definitely don't want multiple versions of ad-hoc postmaster startup code. If we need that, it'd be smarter to extend Cluster.pm so that the mainline code could do what's needful. Having said that, it wasn't entirely clear to me why you needed a weird startup method. Why couldn't you drop the bogus history file into place *before* starting the charlie postmaster? If you have to do it after, aren't there race/timing problems anyway? regards, tom lane
On Thu, Nov 18, 2021 at 4:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Nov 18, 2021 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I'm a little dubious that this test case is valuable enough to > >> mess around with a nonstandard postmaster startup protocol, though. > > > The problem that I have with the present situation is that the test > > coverage of xlog.c is pretty abysmal. > > Agreed, but this one test case isn't going to move the needle much. > To get to reasonable coverage we're going to need more tests, and > I definitely don't want multiple versions of ad-hoc postmaster startup > code. If we need that, it'd be smarter to extend Cluster.pm so that > the mainline code could do what's needful. Perhaps so. I don't have a clear view on what a full set of good tests would look like, so it's hard for me to guess which needs are general and which are not. > Having said that, it wasn't entirely clear to me why you needed a > weird startup method. Why couldn't you drop the bogus history file > into place *before* starting the charlie postmaster? If you have > to do it after, aren't there race/timing problems anyway? Well, I need rescanLatestTimeLine() to be called. I'm not sure that will happen if the file is there originally -- that sounds like more of a scan than a rescan, but I haven't poked at that angle. I also think it doesn't matter whether the file is dropped in or whether it is restored via restore_command, so having the server restore the file rather than discover that it is appeared might be another and more satisfying option, but I also have not tested whether that reproduces the issue. This has been extremely time-consuming to hunt down. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-Nov-18, Robert Haas wrote: > On Thu, Nov 18, 2021 at 2:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Would it work to start postmaster directly instad of using pg_ctl, and > > then rely on (say) pg_isready? > > I *think* that pg_isready would also fail, because the documentation > says "pg_isready returns 0 to the shell if the server is accepting > connections normally, 1 if the server is rejecting connections (for > example during startup) ..." and I believe the "during startup" case > would apply here. Hmm, right ... I suppose there are other ways to check, but I'm not sure that the value of adding this particular test is large enough to justify such hacks. I think one possibly useful technique might be Alexander Korotkov's stop events[1], except that it is designed around having working SQL access to the server in order to control it. You'd need some frosting on top in order to control the startup sequence without SQL access. > Starting postmaster directly is a thought. Is there any existing > precedent for that approach? Not as far as I can see. [1] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Entristecido, Wutra (canción de Las Barreras) echa a Freyr a rodar y a nosotros al mar"
On Fri, Nov 19, 2021 at 12:43 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Nov 18, 2021 at 7:30 AM Amul Sul <sulamul@gmail.com> wrote: > > Somehow with and without patch I am getting the same log. > > Try applying the attached 0001-dubious-test-cast.patch for you and see > if that fails. It does for me. If so, then try applying > 0002-fix-the-bug.patch and see if that makes it pass. > Thanks, I can see the reported behavior -- 0001 alone fails & 0002 corrects that. > Unfortunately, this test case isn't remotely committable as-is, and I > don't know how to make it so. The main problem is that, although you > can start up a server with nothing in pg_wal, no restore_command, and > no archive command, pg_ctl will not believe that it has started. I > worked around that problem by telling pg_ctl to ignore failures, but > it still waits for a long time before timing out, which sucks both > because (1) hackers are impatient and (2) some hackers run extremely > slow buildfarm machines where almost any timeout won't be long enough. Yeah :( > There's a second place where the patch needs to wait for something > also, and that one I've crudely kludged with sleep(10). If anybody > around here who is good at figuring out how to write clever TAP tests > can tell me how to fix this test to be non-stupid, I will happily do > so. > > Otherwise, I think I will just need to commit and back-patch the > actual bug fix without a test, and then rebase the rest of the patch I > posted previously over top of those changes for master only. > +1. Regards, Amul
On Wed, Nov 17, 2021 at 5:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > In rescanLatestTimeLine(), the problem is IMHO probably serious enough > to justify a separate commit with back-patching. On second thought, I think it's better not to back-patch this fix. It turns out that, while it's easy enough to back-patch the proposed patch, it doesn't make the test case pass in pre-v14 versions. The test case itself requires some changes to work at all because of the perl module renaming, but that's not a big deal. The issue is that, in the back-branches, when starting up the server without any local WAL, rescanLatestTimeLine() is checking with not only the wrong LSN but also with the wrong TLI. That got fixed in master by commit 4a92a1c3d1c361ffb031ed05bf65b801241d7cdd even though, rather unfortunately, the commit message does not say so. So to back-patch this commit we would need to also back-patch much of that commit. But that commit depends on the other commits that reduced use of ThisTimeLineID. Untangling that seems like an undue amount of work and risk for a corner-case bug fix that was discovered in the lab rather than in the field and which won't matter anyway if you do things correctly. So now I'm intending to commit to just to master only. Attached please find the test case not for commit as v2-0001-dubious-test-case.patch; the fix, for commit and now with a proper commit message as v2-0002-Fix-corner-case-failure-to-detect-improper-timeli.patch; and the back-ported test case for v14 as v14-0001-dubious-test-case.patch in case anyone wants to play around with that. (with apologies for using the idea of a version number in two different and conflicting senses) -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 23, 2021 at 1:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > Attached please find the test case not for commit as > v2-0001-dubious-test-case.patch; the fix, for commit and now with a > proper commit message as > v2-0002-Fix-corner-case-failure-to-detect-improper-timeli.patch; and > the back-ported test case for v14 as v14-0001-dubious-test-case.patch > in case anyone wants to play around with that. (with apologies for > using the idea of a version number in two different and conflicting > senses) OK, I have committed this patch, rebased the original combined patch over top of that, and committed that too. By my count that means I've now removed a total of 3 global variables from this file and Amul got rid of one as well so that makes 4 in total. If we continue at this brisk pace the code may become understandable and maintainable sometime prior to the heat death of the universe, which I think would be rather nice. -- Robert Haas EDB: http://www.enterprisedb.com