Thread: RecoveryInProgress() has critical side effects
Hi, I have noticed that $SUBJECT, and I think it's bad, and I'd like to propose some patches to fix it. 0001 is hopefully uncontroversial, but 0002 might need some discussion as I'm not quite sure what approach to take. The reason why RecoveryInProgress() has critical side effects is that it calls InitXLOGAccess(). It turns out that the only critical thing that InitXLOGAccess() does is call InitXLogInsert().[1] If a backend doesn't call InitXLogInsert(), max_rdatas won't be initialized, and the first attempt to insert WAL will probably fail with something like "ERROR: too much WAL data". But there's no real reason why InitXLogInsert() needs to happen only after recovery is finished. The work that it does is just setting up data structures in backend-local memory, and not much is lost if they are set up and not used. So, in 0001, I propose to move the call to InitXLogInsert() from InitXLOGAccess() to BaseInit(). That means every backend will do it the first time it starts up. It also means that CreateCheckPoint() will no longer need a special case call to InitXLogInsert(), which seems like a good thing. Now, what about the other stuff that InitXLOGAccess() does? As far as I can tell, "wal_segment_size = ControlFile->xlog_seg_size" is always redundant, because LocalProcessControlFile() has always been called, and it does the same thing. The postmaster does that call, and in non-EXEC_BACKEND builds, the global variable setting will be inherited by all child processes. In EXEC_BACKEND builds, every child process will go through SubPostmasterMain() which also calls LocalProcessControlFile() very early in the startup sequence. InitXLOGAccess() also sets RedoRecPtr and doPageWrites ... but these are non-critical values. If you set them to 0 and false respectively and then call XLogInsert(), GetFullPagesWritesInfo() will return 0 and false, XLogRecordAssemble() will use those values to make wrong decisions, and then XLogInsertRecord() will notice that the wrong values were used and will update them and cause XLogInsert() to loop around. Similarly, XLogCheckpointNeeded() relies on the caller to set RedoRecPtr appropriately, but all callers recheck after updating RedoRecPtr, so it's fine if the initial value is zero. The only other interesting case is XLogCheckBufferNeedsBackup(), which calls GetFullPageWriteInfo(), but if somehow it gets the wrong answer, the only thing that can happen is we might come the wrong conclusion about whether a heap-update WAL record should attempt to compress by looking for a common prefix and suffix between the old and new tuples. That's non-critical, and if it happens or doesn't happen at most once per backend, fine. Nevertheless, what I did in 0002 is remove InitXLOGAccess() but move the initialization of RedoRecPtr and doPageWrites into GetFullPageWritesInfo(), only in the case where RedoRecPtr hasn't been set yet. This has one property that I really really like, which is that makes the code more understandable. There is no suggestion that the correctness of WAL insertion somehow depends on a RecoveryInProgress() call which may or may not have occurred at some arbitrarily distant time in the past. Initializing the value at the point of first use is just way clearer than initializing it as a side-effect of some other function that's not even that tightly connected. However, it does have the effect of introducing a branch into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough not to matter. I think the obvious alternative is to just not initialize RedoRecPtr and doPageWrites at all and document in the comments that the first time each backend does something with them it's going to end up retrying once; perhaps that is preferable. Going the other way, we could get rid of the global variables altogether and have GetFullPageWrites() read the data from shared memory. However, unless 8-byte atomicity is guaranteed, that requires a spinlock, so it seems likely unappealing. (In case the explanation above is unclear to casual readers, here's a bit more detail: RedoRecPtr is the redo pointer from what we believe to be the most recent checkpoint record, and doFullPageWrites is whether full_page_writes are required at the moment. The latter can be true either because full_page_writes=on or because a base backup is in progress. When assembling an XLOG record, we have to decide whether to include full-page images in that or not, and we do that based on the value of these variables: if full page writes are required in general and if the page LSN precedes RedoRecPtr, we include a full page image. If it turns out that our RedoRecPtr was out of date - i.e. there's a newer checkpoint we didn't know about - then we go back and re-assemble the record to make sure all required full-page images are included.) Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com [1] Isn't it great that these two similarly-named functions capitalize "xlog" differently? Wa-hoo.
Attachment
On 11/11/21, 9:20 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > memory, and not much is lost if they are set up and not used. So, in > 0001, I propose to move the call to InitXLogInsert() from > InitXLOGAccess() to BaseInit(). That means every backend will do it > the first time it starts up. It also means that CreateCheckPoint() > will no longer need a special case call to InitXLogInsert(), which > seems like a good thing. 0001 looks reasonable to me. > connected. However, it does have the effect of introducing a branch > into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough > not to matter. I think the obvious alternative is to just not > initialize RedoRecPtr and doPageWrites at all and document in the > comments that the first time each backend does something with them > it's going to end up retrying once; perhaps that is preferable. Going Unless there's a demonstrable performance benefit from adding the branch, my preference is to leave it out. I could be off-base, but it seems possible that future changes might end up depending on any side effects from this new branch, which is exactly what you're trying to fix. Plus, always using the retry path is a good way to test that it works as expected. Nathan
On Fri, Nov 12, 2021 at 10:23 AM Bossart, Nathan <bossartn@amazon.com> wrote: > 0001 looks reasonable to me. Cool. > Unless there's a demonstrable performance benefit from adding the > branch, my preference is to leave it out. I could be off-base, but it > seems possible that future changes might end up depending on any side > effects from this new branch, which is exactly what you're trying to > fix. Plus, always using the retry path is a good way to test that it > works as expected. Here's a new version that does it that way. Any other opinions? The best thing I could come up with for a test case for this was to try repeatedly making a new connection and running "SELECT txid_current()", which will cause just one WAL record to be generated. Unfortunately that path has overhead from a lot of other causes so I'm not sure the results are very meaningful, but here they are: v1: 0.378 ms v2: 0.391 ms common base commit (10eae82b2): 0.376 ms methodology: for i in `seq 1 1000`; do psql -c '\timing' -c 'select txid_current();'; done | grep '^Time:' | awk '{total+=$2} END {print total/NR}' run twice, discarding the first result, since the very first connection attempt after starting a new server process is notably slower -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 11/15/21, 1:30 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > Here's a new version that does it that way. Any other opinions? LGTM > The best thing I could come up with for a test case for this was to > try repeatedly making a new connection and running "SELECT > txid_current()", which will cause just one WAL record to be generated. > Unfortunately that path has overhead from a lot of other causes so I'm > not sure the results are very meaningful, but here they are: > > v1: 0.378 ms > v2: 0.391 ms > common base commit (10eae82b2): 0.376 ms I'm personally not too worried about a ~4% regression in this particular benchmark... Nathan
On Mon, Nov 15, 2021 at 10:09:17PM +0000, Bossart, Nathan wrote: > On 11/15/21, 1:30 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: >> Here's a new version that does it that way. Any other opinions? > > LGTM Patch 0001 means that the startup process would set up the structures to be able to build WAL records before running any kind of recovery action rather than doing it when it really needs it. That's fine by me. Is patch 0002 actually right regarding the handling of doPageWrites? Once applied, we finish by setting it when the startup process starts and not anymore at the end of recovery based on its the state of Insert, but this could have changed while in recovery when replaying one or more XLOG_FPW_CHANGE records. > I'm personally not too worried about a ~4% regression in this > particular benchmark... This is not a hot code path, that should be fine. -- Michael
Attachment
On Mon, Nov 15, 2021 at 9:34 PM Michael Paquier <michael@paquier.xyz> wrote: > Patch 0001 means that the startup process would set up the structures > to be able to build WAL records before running any kind of recovery > action rather than doing it when it really needs it. That's fine by > me. Great, thanks. I think I'll go ahead and commit this one, then. > Is patch 0002 actually right regarding the handling of doPageWrites? > Once applied, we finish by setting it when the startup process starts > and not anymore at the end of recovery based on its the state of > Insert, but this could have changed while in recovery when replaying > one or more XLOG_FPW_CHANGE records. Maybe I'm not understanding you properly here, but it seems like you might be forgetting that this is a local variable and thus every backend is going to have something different. In the startup process, it will be initialized by StartupXLOG(); in other processes, it's currently initialized by RecoveryInProgress(), but with this patch it wouldn't be. Either way, it's then updated by future calls to XLogInsertRecord() as required. XLOG_FPW_CHANGE records might affect the new value that gets set the next time XLogInsertRecord(), but they don't directly affect doPageWrites. > > I'm personally not too worried about a ~4% regression in this > > particular benchmark... > > This is not a hot code path, that should be fine. OK. I'll wait a while and see if anyone else wants to weigh in. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-11-11 12:15:03 -0500, Robert Haas wrote: > The reason why RecoveryInProgress() has critical side effects is that > it calls InitXLOGAccess(). It turns out that the only critical thing > that InitXLOGAccess() does is call InitXLogInsert().[1] If a backend > doesn't call InitXLogInsert(), max_rdatas won't be initialized, and > the first attempt to insert WAL will probably fail with something like > "ERROR: too much WAL data". But there's no real reason why > InitXLogInsert() needs to happen only after recovery is finished. I think we should consider allocating that memory in postmaster, on !windows at least. Or, perhaps even better, to initially use some static variable, and only use a separate memory context when XLogEnsureRecordSpace(). Reserving all that memory in every process just needlessly increases our per-connection memory usage, for no good reason. > The > work that it does is just setting up data structures in backend-local > memory, and not much is lost if they are set up and not used. So, in > 0001, I propose to move the call to InitXLogInsert() from > InitXLOGAccess() to BaseInit(). That means every backend will do it > the first time it starts up. It also means that CreateCheckPoint() > will no longer need a special case call to InitXLogInsert(), which > seems like a good thing. Yes. Seems making BaseInit() being executed at a halfway consistent point is starting to have some benefits at least ;) > Nevertheless, what I did in 0002 is remove InitXLOGAccess() but move > the initialization of RedoRecPtr and doPageWrites into > GetFullPageWritesInfo(), only in the case where RedoRecPtr hasn't been > set yet. This has one property that I really really like, which is > that makes the code more understandable. There is no suggestion that > the correctness of WAL insertion somehow depends on a > RecoveryInProgress() call which may or may not have occurred at some > arbitrarily distant time in the past. Initializing the value at the > point of first use is just way clearer than initializing it as a > side-effect of some other function that's not even that tightly > connected. However, it does have the effect of introducing a branch > into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough > not to matter. Hm. Compared to the other costs of the XLogInsert do/while loop it probably doesn't matter. One nearly-always-false branch is vastly cheaper than going through the loop unnecessarily. Sure, the unnecessarily iteration saved will only be the first (for now, it might be worth to refresh the values more often), but there's enough branches in the body of the loop that I can't really see it mattering. Maybe kinda conceivably the cost of that branch could be an argument if GetFullPageWritesInfo() were inline in XLogInsert(). But it isn't. > I think the obvious alternative is to just not > initialize RedoRecPtr and doPageWrites at all and document in the > comments that the first time each backend does something with them > it's going to end up retrying once; perhaps that is preferable. Going > the other way, we could get rid of the global variables altogether and > have GetFullPageWrites() read the data from shared memory. However, > unless 8-byte atomicity is guaranteed, that requires a spinlock, so it > seems likely unappealing. I think it'd be fine to just make them 8byte atomics, which'd be lock-free on relevant platforms (at least once the aarch64 thing is fixed, there's a patch). XLogCtlInsert already takes care to ensure that RedoRecPtr/fullPageWrites are on a cacheline that's not constantly modified. If we really wanted to optimize the no-8-byte-single-copy-atomicity path, we could introduce a counter counting how many times RedoRecPtr/fullPageWrites have changed. But it seems unlikely to be worth it. Greetings, Andres Freund
Hi, On 2021-11-15 16:29:28 -0500, Robert Haas wrote: > v1: 0.378 ms > v2: 0.391 ms > common base commit (10eae82b2): 0.376 ms Is this with assertions enabled? Because on my workstation (which likely is slower on a single-core basis than your laptop) I get around half of this with an optimized build (and a non-optimized build rarely is worth using as a measuring stick). It could also be that your psqlrc does something heavyweight? > methodology: > for i in `seq 1 1000`; do psql -c '\timing' -c 'select > txid_current();'; done | grep '^Time:' | awk '{total+=$2} END {print > total/NR}' > run twice, discarding the first result, since the very first > connection attempt after starting a new server process is notably > slower Hm. I think this might included a bunch of convoluting factors that make it hard to judge the actual size of the performance difference. It'll e.g. include a fair bit of syscache initialization overhead that won't change between the two approaches. This could be addressed by doing something like -c "SELECT 'txid_current'::regproc;" first. Also, the psql invocations itself use up a fair bit of time, even if you ignored them from the result with the \timing trick. A pgbench -C doing 1k SELECT 1;s is about ~1.5s for me, whereas psql is ~5.7s. Just to size up the order of magnitude of overhead of the connection establishment and syscache initialization, I ran a pgbench with a script of SELECT 1; SELECT 1; SELECT 'txid_current()'::regprocedure; SELECT 'txid_current()'::regprocedure; SELECT txid_current(); SELECT txid_current(); and ran that with pgbench -n -f /tmp/txid.sql -C -t1000 -P1 --report-latencies and got statement latencies in milliseconds: 0.125 SELECT 1; 0.024 SELECT 1; 0.095 SELECT 'txid_current()'::regprocedure; 0.025 SELECT 'txid_current()'::regprocedure; 0.033 SELECT txid_current(); 0.024 SELECT txid_current(); which nicely shows how much of the overhead is not related to the actual txid_current() invocation. Greetings, Andres Freund
On Tue, Nov 16, 2021 at 2:27 PM Andres Freund <andres@anarazel.de> wrote: > Is this with assertions enabled? Because on my workstation (which likely is > slower on a single-core basis than your laptop) I get around half of this with > an optimized build (and a non-optimized build rarely is worth using as a > measuring stick). It could also be that your psqlrc does something > heavyweight? Assertions disabled. ~/.psqlrc does not exist. > Hm. I think this might included a bunch of convoluting factors that make it > hard to judge the actual size of the performance difference. Yes, I think so, too. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-11-16 15:19:19 -0500, Robert Haas wrote: > > Hm. I think this might included a bunch of convoluting factors that make it > > hard to judge the actual size of the performance difference. > > Yes, I think so, too. FWIW I ran that pgench thing I presented upthread, and I didn't see any meaningful and repeatable performance difference 354a1f8d220, ad26ee28250 and 0002 applied ontop of ad26ee28250. The run-to-run variance is way higher than the difference between the changes. Greetings, Andres Freund
On Tue, Nov 16, 2021 at 3:42 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-11-16 15:19:19 -0500, Robert Haas wrote: > > > Hm. I think this might included a bunch of convoluting factors that make it > > > hard to judge the actual size of the performance difference. > > > > Yes, I think so, too. > > FWIW I ran that pgench thing I presented upthread, and I didn't see any > meaningful and repeatable performance difference 354a1f8d220, ad26ee28250 and > 0002 applied ontop of ad26ee28250. The run-to-run variance is way higher than > the difference between the changes. Thanks. I suspected that the results I was seeing were not meaningful, but it's hard to be sure when the results seem to be repeatable locally. I'm still not entirely clear on whether you prefer v1-0002, v2-0002, or something else. -- Robert Haas EDB: http://www.enterprisedb.com
On 2021-11-16 16:30:27 -0500, Robert Haas wrote: > I'm still not entirely clear on whether you prefer v1-0002, v2-0002, > or something else. I think it basically doesn't matter much. It's such a small piece of the cost compared to either the cost of a single insert or the ratio between RedoRecPtr/FPW changes and the number of inserted records. I guess v2-0002 is mildly simpler, so I very weakly lean towards that.
On 17/11/2021 00:04, Andres Freund wrote: > On 2021-11-16 16:30:27 -0500, Robert Haas wrote: >> I'm still not entirely clear on whether you prefer v1-0002, v2-0002, >> or something else. > > I think it basically doesn't matter much. It's such a small piece of the cost > compared to either the cost of a single insert or the ratio between > RedoRecPtr/FPW changes and the number of inserted records. > > I guess v2-0002 is mildly simpler, so I very weakly lean towards that. I was going to argue for v1, because I was worried about generating full-page writes unnecessarily. But looking closer, it doesn't do that. It will initially think that no full page images are needed, and perform the extra loop in XLogInsert() to try again with full page images. Similarly in heap_update(), it will fail in the direction of sometimes doing the "prefix-suffix compression" unnecessarily, which means a little bit of extra CPU work, but it won't affect the WAL record that gets written. So either way is fine, really. That's a bit subtle, though, a comment somewhere would be good. But here's yet another idea: We could initialize RedoRecPtr and doPageWrites in InitXLogInsert(). It would seem like a pretty natural place for it. - Heikki PS. typo in patch v2: s/prepard/prepared/
On Sat, Nov 20, 2021 at 3:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > But here's yet another idea: We could initialize RedoRecPtr and > doPageWrites in InitXLogInsert(). It would seem like a pretty natural > place for it. I considered that, but it seemed like an abstraction violation to me, because that code is part of xloginsert.c, which is mostly concerned with assembly of write-ahead log records, not the associated shared-memory state. Also, I don't think there's any guarantee that the state in shared memory is initialized at the time we call that function, so we might just be copying uninitialized memory into other uninitialized memory. > PS. typo in patch v2: s/prepard/prepared/ Thanks, fixed. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
There are a couple typos. "uninitalized," in one place. Also, + * If this is either a bootstrap process nor a standalone backend, start Is "either .. nor" correct? It looks very wrong to me. I think you meant "either .. or". Overall, this patch seems mostly about documenting some undesirable existing behavior. Is part of your plan to get such things fixed/ replaced entirely? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "We're here to devour each other alive" (Hobbes)
On Tue, Nov 16, 2021 at 09:41:58AM -0500, Robert Haas wrote: > Maybe I'm not understanding you properly here, but it seems like you > might be forgetting that this is a local variable and thus every > backend is going to have something different. In the startup process, > it will be initialized by StartupXLOG(); in other processes, it's > currently initialized by RecoveryInProgress(), but with this patch it > wouldn't be. Either way, it's then updated by future calls to > XLogInsertRecord() as required. XLOG_FPW_CHANGE records might affect > the new value that gets set the next time XLogInsertRecord(), but they > don't directly affect doPageWrites. My main worry here is that this changes slightly the definition of doPageWrites across stable branches at the end of recovery as there could be records generated there. Note that GetFullPageWriteInfo() gets called in XLogInsert(), while Insert->fullPageWrites gets updated before CleanupAfterArchiveRecovery(). And it may influence the value of doPageWrites in the startup process. -- Michael
Attachment
On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier <michael@paquier.xyz> wrote: > My main worry here is that this changes slightly the definition of > doPageWrites across stable branches at the end of recovery as there > could be records generated there. Note that GetFullPageWriteInfo() > gets called in XLogInsert(), while Insert->fullPageWrites gets updated > before CleanupAfterArchiveRecovery(). And it may influence > the value of doPageWrites in the startup process. But ... so what? All the code that uses it retries if the value that was tentatively used turns out to be wrong. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 7, 2021 at 5:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier <michael@paquier.xyz> wrote: > > My main worry here is that this changes slightly the definition of > > doPageWrites across stable branches at the end of recovery as there > > could be records generated there. Note that GetFullPageWriteInfo() > > gets called in XLogInsert(), while Insert->fullPageWrites gets updated > > before CleanupAfterArchiveRecovery(). And it may influence > > the value of doPageWrites in the startup process. > > But ... so what? All the code that uses it retries if the value that > was tentatively used turns out to be wrong. Despite the fact that this question didn't get further discussion, and the fact that nobody seems quite sure what the best way of proceeding here is, my interpretation of the comments made so far is that nobody thinks that what we have now is superior to either of the proposed alternatives, and there's a weak preference for v2 over v1. So I committed that one. -- Robert Haas EDB: http://www.enterprisedb.com