Thread: RecoveryInProgress() has critical side effects

RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
Hi,

I have noticed that $SUBJECT, and I think it's bad, and I'd like to
propose some patches to fix it. 0001 is hopefully uncontroversial, but
0002 might need some discussion as I'm not quite sure what approach to
take.

The reason why RecoveryInProgress() has critical side effects is that
it calls InitXLOGAccess(). It turns out that the only critical thing
that InitXLOGAccess() does is call InitXLogInsert().[1] If a backend
doesn't call InitXLogInsert(), max_rdatas won't be initialized, and
the first attempt to insert WAL will probably fail with something like
"ERROR: too much WAL data". But there's no real reason why
InitXLogInsert() needs to happen only after recovery is finished. The
work that it does is just setting up data structures in backend-local
memory, and not much is lost if they are set up and not used. So, in
0001, I propose to move the call to InitXLogInsert() from
InitXLOGAccess() to BaseInit(). That means every backend will do it
the first time it starts up. It also means that CreateCheckPoint()
will no longer need a special case call to InitXLogInsert(), which
seems like a good thing.

Now, what about the other stuff that InitXLOGAccess() does? As far as
I can tell, "wal_segment_size = ControlFile->xlog_seg_size" is always
redundant, because LocalProcessControlFile() has always been called,
and it does the same thing. The postmaster does that call, and in
non-EXEC_BACKEND builds, the global variable setting will be inherited
by all child processes. In EXEC_BACKEND builds, every child process
will go through SubPostmasterMain() which also calls
LocalProcessControlFile() very early in the startup sequence.
InitXLOGAccess() also sets RedoRecPtr and doPageWrites ... but these
are non-critical values. If you set them to 0 and false respectively
and then call XLogInsert(), GetFullPagesWritesInfo() will return 0 and
false, XLogRecordAssemble() will use those values to make wrong
decisions, and then XLogInsertRecord() will notice that the wrong
values were used and will update them and cause XLogInsert() to loop
around. Similarly, XLogCheckpointNeeded() relies on the caller to set
RedoRecPtr appropriately, but all callers recheck after updating
RedoRecPtr, so it's fine if the initial value is zero. The only other
interesting case is XLogCheckBufferNeedsBackup(), which calls
GetFullPageWriteInfo(), but if somehow it gets the wrong answer, the
only thing that can happen is we might come the wrong conclusion about
whether a heap-update WAL record should attempt to compress by looking
for a common prefix and suffix between the old and new tuples. That's
non-critical, and if it happens or doesn't happen at most once per
backend, fine.

Nevertheless, what I did in 0002 is remove InitXLOGAccess() but move
the initialization of RedoRecPtr and doPageWrites into
GetFullPageWritesInfo(), only in the case where RedoRecPtr hasn't been
set yet. This has one property that I really really like, which is
that makes the code more understandable. There is no suggestion that
the correctness of WAL insertion somehow depends on a
RecoveryInProgress() call which may or may not have occurred at some
arbitrarily distant time in the past. Initializing the value at the
point of first use is just way clearer than initializing it as a
side-effect of some other function that's not even that tightly
connected. However, it does have the effect of introducing a branch
into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough
not to matter. I think the obvious alternative is to just not
initialize RedoRecPtr and doPageWrites at all and document in the
comments that the first time each backend does something with them
it's going to end up retrying once; perhaps that is preferable. Going
the other way, we could get rid of the global variables altogether and
have GetFullPageWrites() read the data from shared memory. However,
unless 8-byte atomicity is guaranteed, that requires a spinlock, so it
seems likely unappealing.

(In case the explanation above is unclear to casual readers, here's a
bit more detail: RedoRecPtr is the redo pointer from what we believe
to be the most recent checkpoint record, and doFullPageWrites is
whether full_page_writes are required at the moment. The latter can be
true either because full_page_writes=on or because a base backup is in
progress. When assembling an XLOG record, we have to decide whether to
include full-page images in that or not, and we do that based on the
value of these variables: if full page writes are required in general
and if the page LSN precedes RedoRecPtr, we include a full page image.
If it turns out that our RedoRecPtr was out of date - i.e. there's a
newer checkpoint we didn't know about - then we go back and
re-assemble the record to make sure all required full-page images are
included.)

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] Isn't it great that these two similarly-named functions capitalize
"xlog" differently? Wa-hoo.

Attachment

Re: RecoveryInProgress() has critical side effects

From
"Bossart, Nathan"
Date:
On 11/11/21, 9:20 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> memory, and not much is lost if they are set up and not used. So, in
> 0001, I propose to move the call to InitXLogInsert() from
> InitXLOGAccess() to BaseInit(). That means every backend will do it
> the first time it starts up. It also means that CreateCheckPoint()
> will no longer need a special case call to InitXLogInsert(), which
> seems like a good thing.

0001 looks reasonable to me.

> connected. However, it does have the effect of introducing a branch
> into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough
> not to matter. I think the obvious alternative is to just not
> initialize RedoRecPtr and doPageWrites at all and document in the
> comments that the first time each backend does something with them
> it's going to end up retrying once; perhaps that is preferable. Going

Unless there's a demonstrable performance benefit from adding the
branch, my preference is to leave it out.  I could be off-base, but it
seems possible that future changes might end up depending on any side
effects from this new branch, which is exactly what you're trying to
fix.  Plus, always using the retry path is a good way to test that it
works as expected.

Nathan


Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Fri, Nov 12, 2021 at 10:23 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> 0001 looks reasonable to me.

Cool.

> Unless there's a demonstrable performance benefit from adding the
> branch, my preference is to leave it out.  I could be off-base, but it
> seems possible that future changes might end up depending on any side
> effects from this new branch, which is exactly what you're trying to
> fix.  Plus, always using the retry path is a good way to test that it
> works as expected.

Here's a new version that does it that way. Any other opinions?

The best thing I could come up with for a test case for this was to
try repeatedly making a new connection and running "SELECT
txid_current()", which will cause just one WAL record to be generated.
Unfortunately that path has overhead from a lot of other causes so I'm
not sure the results are very meaningful, but here they are:

v1: 0.378 ms
v2: 0.391 ms
common base commit (10eae82b2): 0.376 ms

methodology:
for i in `seq 1 1000`; do psql -c '\timing' -c 'select
txid_current();'; done | grep '^Time:' | awk '{total+=$2} END {print
total/NR}'
run twice, discarding the first result, since the very first
connection attempt after starting a new server process is notably
slower

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: RecoveryInProgress() has critical side effects

From
"Bossart, Nathan"
Date:
On 11/15/21, 1:30 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> Here's a new version that does it that way. Any other opinions?

LGTM

> The best thing I could come up with for a test case for this was to
> try repeatedly making a new connection and running "SELECT
> txid_current()", which will cause just one WAL record to be generated.
> Unfortunately that path has overhead from a lot of other causes so I'm
> not sure the results are very meaningful, but here they are:
>
> v1: 0.378 ms
> v2: 0.391 ms
> common base commit (10eae82b2): 0.376 ms

I'm personally not too worried about a ~4% regression in this
particular benchmark...

Nathan


Re: RecoveryInProgress() has critical side effects

From
Michael Paquier
Date:
On Mon, Nov 15, 2021 at 10:09:17PM +0000, Bossart, Nathan wrote:
> On 11/15/21, 1:30 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
>> Here's a new version that does it that way. Any other opinions?
>
> LGTM

Patch 0001 means that the startup process would set up the structures
to be able to build WAL records before running any kind of recovery
action rather than doing it when it really needs it.  That's fine by
me.

Is patch 0002 actually right regarding the handling of doPageWrites?
Once applied, we finish by setting it when the startup process starts
and not anymore at the end of recovery based on its the state of
Insert, but this could have changed while in recovery when replaying
one or more XLOG_FPW_CHANGE records.

> I'm personally not too worried about a ~4% regression in this
> particular benchmark...

This is not a hot code path, that should be fine.
--
Michael

Attachment

Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Mon, Nov 15, 2021 at 9:34 PM Michael Paquier <michael@paquier.xyz> wrote:
> Patch 0001 means that the startup process would set up the structures
> to be able to build WAL records before running any kind of recovery
> action rather than doing it when it really needs it.  That's fine by
> me.

Great, thanks. I think I'll go ahead and commit this one, then.

> Is patch 0002 actually right regarding the handling of doPageWrites?
> Once applied, we finish by setting it when the startup process starts
> and not anymore at the end of recovery based on its the state of
> Insert, but this could have changed while in recovery when replaying
> one or more XLOG_FPW_CHANGE records.

Maybe I'm not understanding you properly here, but it seems like you
might be forgetting that this is a local variable and thus every
backend is going to have something different. In the startup process,
it will be initialized by StartupXLOG(); in other processes, it's
currently initialized by RecoveryInProgress(), but with this patch it
wouldn't be. Either way, it's then updated by future calls to
XLogInsertRecord() as required. XLOG_FPW_CHANGE records might affect
the new value that gets set the next time XLogInsertRecord(), but they
don't directly affect doPageWrites.

> > I'm personally not too worried about a ~4% regression in this
> > particular benchmark...
>
> This is not a hot code path, that should be fine.

OK. I'll wait a while and see if anyone else wants to weigh in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: RecoveryInProgress() has critical side effects

From
Andres Freund
Date:
Hi,

On 2021-11-11 12:15:03 -0500, Robert Haas wrote:
> The reason why RecoveryInProgress() has critical side effects is that
> it calls InitXLOGAccess(). It turns out that the only critical thing
> that InitXLOGAccess() does is call InitXLogInsert().[1] If a backend
> doesn't call InitXLogInsert(), max_rdatas won't be initialized, and
> the first attempt to insert WAL will probably fail with something like
> "ERROR: too much WAL data". But there's no real reason why
> InitXLogInsert() needs to happen only after recovery is finished.

I think we should consider allocating that memory in postmaster, on !windows
at least. Or, perhaps even better, to initially use some static variable, and
only use a separate memory context when XLogEnsureRecordSpace().  Reserving
all that memory in every process just needlessly increases our per-connection
memory usage, for no good reason.


> The
> work that it does is just setting up data structures in backend-local
> memory, and not much is lost if they are set up and not used. So, in
> 0001, I propose to move the call to InitXLogInsert() from
> InitXLOGAccess() to BaseInit(). That means every backend will do it
> the first time it starts up. It also means that CreateCheckPoint()
> will no longer need a special case call to InitXLogInsert(), which
> seems like a good thing.

Yes. Seems making BaseInit() being executed at a halfway consistent point is
starting to have some benefits at least ;)


> Nevertheless, what I did in 0002 is remove InitXLOGAccess() but move
> the initialization of RedoRecPtr and doPageWrites into
> GetFullPageWritesInfo(), only in the case where RedoRecPtr hasn't been
> set yet. This has one property that I really really like, which is
> that makes the code more understandable. There is no suggestion that
> the correctness of WAL insertion somehow depends on a
> RecoveryInProgress() call which may or may not have occurred at some
> arbitrarily distant time in the past. Initializing the value at the
> point of first use is just way clearer than initializing it as a
> side-effect of some other function that's not even that tightly
> connected. However, it does have the effect of introducing a branch
> into GetFullPageWritesInfo(), which I'm not 100% sure is cheap enough
> not to matter.

Hm. Compared to the other costs of the XLogInsert do/while loop it probably
doesn't matter. One nearly-always-false branch is vastly cheaper than going
through the loop unnecessarily. Sure, the unnecessarily iteration saved will
only be the first (for now, it might be worth to refresh the values more
often), but there's enough branches in the body of the loop that I can't
really see it mattering.

Maybe kinda conceivably the cost of that branch could be an argument if
GetFullPageWritesInfo() were inline in XLogInsert(). But it isn't.


> I think the obvious alternative is to just not
> initialize RedoRecPtr and doPageWrites at all and document in the
> comments that the first time each backend does something with them
> it's going to end up retrying once; perhaps that is preferable. Going
> the other way, we could get rid of the global variables altogether and
> have GetFullPageWrites() read the data from shared memory. However,
> unless 8-byte atomicity is guaranteed, that requires a spinlock, so it
> seems likely unappealing.

I think it'd be fine to just make them 8byte atomics, which'd be lock-free on
relevant platforms (at least once the aarch64 thing is fixed, there's a
patch).

XLogCtlInsert already takes care to ensure that RedoRecPtr/fullPageWrites are
on a cacheline that's not constantly modified.


If we really wanted to optimize the no-8-byte-single-copy-atomicity path, we
could introduce a counter counting how many times RedoRecPtr/fullPageWrites
have changed. But it seems unlikely to be worth it.


Greetings,

Andres Freund



Re: RecoveryInProgress() has critical side effects

From
Andres Freund
Date:
Hi,

On 2021-11-15 16:29:28 -0500, Robert Haas wrote:
> v1: 0.378 ms
> v2: 0.391 ms
> common base commit (10eae82b2): 0.376 ms

Is this with assertions enabled? Because on my workstation (which likely is
slower on a single-core basis than your laptop) I get around half of this with
an optimized build (and a non-optimized build rarely is worth using as a
measuring stick). It could also be that your psqlrc does something
heavyweight?


> methodology:
> for i in `seq 1 1000`; do psql -c '\timing' -c 'select
> txid_current();'; done | grep '^Time:' | awk '{total+=$2} END {print
> total/NR}'
> run twice, discarding the first result, since the very first
> connection attempt after starting a new server process is notably
> slower

Hm. I think this might included a bunch of convoluting factors that make it
hard to judge the actual size of the performance difference. It'll
e.g. include a fair bit of syscache initialization overhead that won't change
between the two approaches. This could be addressed by doing something like -c
"SELECT 'txid_current'::regproc;" first.

Also, the psql invocations itself use up a fair bit of time, even if you
ignored them from the result with the \timing trick. A pgbench -C doing 1k
SELECT 1;s is about ~1.5s for me, whereas psql is ~5.7s.

Just to size up the order of magnitude of overhead of the connection
establishment and syscache initialization, I ran a pgbench with a script of

SELECT 1;
SELECT 1;
SELECT 'txid_current()'::regprocedure;
SELECT 'txid_current()'::regprocedure;
SELECT txid_current();
SELECT txid_current();

and ran that with pgbench -n -f /tmp/txid.sql -C -t1000 -P1 --report-latencies
and got
statement latencies in milliseconds:
         0.125  SELECT 1;
         0.024  SELECT 1;
         0.095  SELECT 'txid_current()'::regprocedure;
         0.025  SELECT 'txid_current()'::regprocedure;
         0.033  SELECT txid_current();
         0.024  SELECT txid_current();

which nicely shows how much of the overhead is not related to the actual
txid_current() invocation.

Greetings,

Andres Freund



Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Tue, Nov 16, 2021 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
> Is this with assertions enabled? Because on my workstation (which likely is
> slower on a single-core basis than your laptop) I get around half of this with
> an optimized build (and a non-optimized build rarely is worth using as a
> measuring stick). It could also be that your psqlrc does something
> heavyweight?

Assertions disabled. ~/.psqlrc does not exist.

> Hm. I think this might included a bunch of convoluting factors that make it
> hard to judge the actual size of the performance difference.

Yes, I think so, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: RecoveryInProgress() has critical side effects

From
Andres Freund
Date:
Hi,

On 2021-11-16 15:19:19 -0500, Robert Haas wrote:
> > Hm. I think this might included a bunch of convoluting factors that make it
> > hard to judge the actual size of the performance difference.
> 
> Yes, I think so, too.

FWIW I ran that pgench thing I presented upthread, and I didn't see any
meaningful and repeatable performance difference 354a1f8d220, ad26ee28250 and
0002 applied ontop of ad26ee28250. The run-to-run variance is way higher than
the difference between the changes.

Greetings,

Andres Freund



Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Tue, Nov 16, 2021 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-11-16 15:19:19 -0500, Robert Haas wrote:
> > > Hm. I think this might included a bunch of convoluting factors that make it
> > > hard to judge the actual size of the performance difference.
> >
> > Yes, I think so, too.
>
> FWIW I ran that pgench thing I presented upthread, and I didn't see any
> meaningful and repeatable performance difference 354a1f8d220, ad26ee28250 and
> 0002 applied ontop of ad26ee28250. The run-to-run variance is way higher than
> the difference between the changes.

Thanks. I suspected that the results I was seeing were not meaningful,
but it's hard to be sure when the results seem to be repeatable
locally.

I'm still not entirely clear on whether you prefer v1-0002, v2-0002,
or something else.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: RecoveryInProgress() has critical side effects

From
Andres Freund
Date:
On 2021-11-16 16:30:27 -0500, Robert Haas wrote:
> I'm still not entirely clear on whether you prefer v1-0002, v2-0002,
> or something else.

I think it basically doesn't matter much. It's such a small piece of the cost
compared to either the cost of a single insert or the ratio between
RedoRecPtr/FPW changes and the number of inserted records.

I guess v2-0002 is mildly simpler, so I very weakly lean towards that.



Re: RecoveryInProgress() has critical side effects

From
Heikki Linnakangas
Date:
On 17/11/2021 00:04, Andres Freund wrote:
> On 2021-11-16 16:30:27 -0500, Robert Haas wrote:
>> I'm still not entirely clear on whether you prefer v1-0002, v2-0002,
>> or something else.
> 
> I think it basically doesn't matter much. It's such a small piece of the cost
> compared to either the cost of a single insert or the ratio between
> RedoRecPtr/FPW changes and the number of inserted records.
> 
> I guess v2-0002 is mildly simpler, so I very weakly lean towards that.

I was going to argue for v1, because I was worried about generating 
full-page writes unnecessarily. But looking closer, it doesn't do that. 
It will initially think that no full page images are needed, and perform 
the extra loop in XLogInsert() to try again with full page images. 
Similarly in heap_update(), it will fail in the direction of sometimes 
doing the "prefix-suffix compression" unnecessarily, which means a 
little bit of extra CPU work, but it won't affect the WAL record that 
gets written.

So either way is fine, really. That's a bit subtle, though, a comment 
somewhere would be good.

But here's yet another idea: We could initialize RedoRecPtr and 
doPageWrites in InitXLogInsert(). It would seem like a pretty natural 
place for it.

- Heikki

PS. typo in patch v2: s/prepard/prepared/



Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Sat, Nov 20, 2021 at 3:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> But here's yet another idea: We could initialize RedoRecPtr and
> doPageWrites in InitXLogInsert(). It would seem like a pretty natural
> place for it.

I considered that, but it seemed like an abstraction violation to me,
because that code is part of xloginsert.c, which is mostly concerned
with assembly of write-ahead log records, not the associated
shared-memory state. Also, I don't think there's any guarantee that
the state in shared memory is initialized at the time we call that
function, so we might just be copying uninitialized memory into other
uninitialized memory.

> PS. typo in patch v2: s/prepard/prepared/

Thanks, fixed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: RecoveryInProgress() has critical side effects

From
Alvaro Herrera
Date:
There are a couple typos.  "uninitalized," in one place.

Also,

+        * If this is either a bootstrap process nor a standalone backend, start

Is "either .. nor" correct?  It looks very wrong to me.  I think you
meant "either .. or".

Overall, this patch seems mostly about documenting some undesirable
existing behavior.  Is part of your plan to get such things fixed/
replaced entirely?

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"We're here to devour each other alive"            (Hobbes)



Re: RecoveryInProgress() has critical side effects

From
Michael Paquier
Date:
On Tue, Nov 16, 2021 at 09:41:58AM -0500, Robert Haas wrote:
> Maybe I'm not understanding you properly here, but it seems like you
> might be forgetting that this is a local variable and thus every
> backend is going to have something different. In the startup process,
> it will be initialized by StartupXLOG(); in other processes, it's
> currently initialized by RecoveryInProgress(), but with this patch it
> wouldn't be. Either way, it's then updated by future calls to
> XLogInsertRecord() as required. XLOG_FPW_CHANGE records might affect
> the new value that gets set the next time XLogInsertRecord(), but they
> don't directly affect doPageWrites.

My main worry here is that this changes slightly the definition of
doPageWrites across stable branches at the end of recovery as there
could be records generated there.  Note that GetFullPageWriteInfo()
gets called in XLogInsert(), while Insert->fullPageWrites gets updated
before CleanupAfterArchiveRecovery().  And it may influence
the value of doPageWrites in the startup process.
--
Michael

Attachment

Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> My main worry here is that this changes slightly the definition of
> doPageWrites across stable branches at the end of recovery as there
> could be records generated there.  Note that GetFullPageWriteInfo()
> gets called in XLogInsert(), while Insert->fullPageWrites gets updated
> before CleanupAfterArchiveRecovery().  And it may influence
> the value of doPageWrites in the startup process.

But ... so what? All the code that uses it retries if the value that
was tentatively used turns out to be wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: RecoveryInProgress() has critical side effects

From
Robert Haas
Date:
On Tue, Dec 7, 2021 at 5:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> > My main worry here is that this changes slightly the definition of
> > doPageWrites across stable branches at the end of recovery as there
> > could be records generated there.  Note that GetFullPageWriteInfo()
> > gets called in XLogInsert(), while Insert->fullPageWrites gets updated
> > before CleanupAfterArchiveRecovery().  And it may influence
> > the value of doPageWrites in the startup process.
>
> But ... so what? All the code that uses it retries if the value that
> was tentatively used turns out to be wrong.

Despite the fact that this question didn't get further discussion, and
the fact that nobody seems quite sure what the best way of proceeding
here is, my interpretation of the comments made so far is that nobody
thinks that what we have now is superior to either of the proposed
alternatives, and there's a weak preference for v2 over v1. So I
committed that one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com