Thread: Basebackups reported as idle

Basebackups reported as idle

From
Magnus Hagander
Date:
AFAICT, base backups running on the replication protocol are always reported as "idle" in pg_stat_activity. This seems to have been an oversight in the "include walsender backends in pg_stat_activity" in 10, which does include it for walsenders in general, just not for the ones sending base backups. (and was then improved on later with the "include all non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since you can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't think it has a large risk of breaking things. Thoughts?

Also, in setting this, there is no real way to differentiate between a regular walsender and a basebackup walsender, other than looking at the wait events. They're both listed as walsenders. Should there be?  (That might not be as easily backpatchable, so keeping that as a separate one)

--
Attachment

Re: Basebackups reported as idle

From
David Steele
Date:
Hi Magnus,

On 12/19/17 4:56 AM, Magnus Hagander wrote:
> AFAICT, base backups running on the replication protocol are always
> reported as "idle" in pg_stat_activity. This seems to have been an
> oversight in the "include walsender backends in pg_stat_activity" in 10,
> which does include it for walsenders in general, just not for the ones
> sending base backups. (and was then improved on later with the "include
> all non-standard backends" patch).
> 
> Unlike the regular walsender it also has to set it back to IDLE, since
> you can actually finish a base backup without disconnecting.
> 
> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
> don't think it has a large risk of breaking things. Thoughts?

+1 for this being a bug, albeit a minor one.

> Also, in setting this, there is no real way to differentiate between a
> regular walsender and a basebackup walsender, other than looking at the
> wait events. They're both listed as walsenders. Should there be?  (That
> might not be as easily backpatchable, so keeping that as a separate one)

Maybe something like "walsender [backup]" or just "basebackup" since
walsender is pretty misleading?  It think it would be nice to be able to
tell them apart, though I don't think it should be back-patched.  People
might be relying on the name in the current versions.

Thanks!
-- 
-David
david@pgmasters.net


Re: Basebackups reported as idle

From
Michael Paquier
Date:
On Tue, Dec 19, 2017 at 11:50 PM, David Steele <david@pgmasters.net> wrote:
> On 12/19/17 4:56 AM, Magnus Hagander wrote:
>> AFAICT, base backups running on the replication protocol are always
>> reported as "idle" in pg_stat_activity. This seems to have been an
>> oversight in the "include walsender backends in pg_stat_activity" in 10,
>> which does include it for walsenders in general, just not for the ones
>> sending base backups. (and was then improved on later with the "include
>> all non-standard backends" patch).
>>
>> Unlike the regular walsender it also has to set it back to IDLE, since
>> you can actually finish a base backup without disconnecting.
>>
>> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
>> don't think it has a large risk of breaking things. Thoughts?
>
> +1 for this being a bug, albeit a minor one.

+1 for adding calls to pgstat_report_activity() in WAL senders and
tracking the activity of those processes.  Now I don't think that this
is the correct location as what matters is tracking if replication
commands are running or not, and not only BASE_BACKUP. So those calls
should be instead in exec_replication_command().

>> Also, in setting this, there is no real way to differentiate between a
>> regular walsender and a basebackup walsender, other than looking at the
>> wait events. They're both listed as walsenders. Should there be?  (That
>> might not be as easily backpatchable, so keeping that as a separate one)
>
> Maybe something like "walsender [backup]" or just "basebackup" since
> walsender is pretty misleading?  It think it would be nice to be able to
> tell them apart, though I don't think it should be back-patched.  People
> might be relying on the name in the current versions.

You can already do a join with pg_stat_replication.pid and look for
the WAL sender state which is marked as "backup" in this case. So I am
-1 for making the current code more complicated. Please note as well
that pg_stat_activity.backend_type is set depending on the process
type, not based on what the process is doing so you would need to
invent a new logic.
-- 
Michael


Re: Basebackups reported as idle

From
Magnus Hagander
Date:
On Wed, Dec 20, 2017 at 6:11 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Dec 19, 2017 at 11:50 PM, David Steele <david@pgmasters.net> wrote:
> On 12/19/17 4:56 AM, Magnus Hagander wrote:
>> AFAICT, base backups running on the replication protocol are always
>> reported as "idle" in pg_stat_activity. This seems to have been an
>> oversight in the "include walsender backends in pg_stat_activity" in 10,
>> which does include it for walsenders in general, just not for the ones
>> sending base backups. (and was then improved on later with the "include
>> all non-standard backends" patch).
>>
>> Unlike the regular walsender it also has to set it back to IDLE, since
>> you can actually finish a base backup without disconnecting.
>>
>> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
>> don't think it has a large risk of breaking things. Thoughts?
>
> +1 for this being a bug, albeit a minor one.

+1 for adding calls to pgstat_report_activity() in WAL senders and
tracking the activity of those processes.  Now I don't think that this
is the correct location as what matters is tracking if replication
commands are running or not, and not only BASE_BACKUP. So those calls
should be instead in exec_replication_command().

Hmm. That does make sense. I guess I got stuck looking at the old code, which is then *also* in the wrong place.

What about the attached?

Also, I noticed that the docs for exec_replication_command() says " * Returns true if the cmd_string was recognized as WalSender command, false
 * if not.". But it doesn't actually do that, it ereport(ERROR):s if it's not. There is one "return false", but it's after an ERROR.

So all other exit points of the procedure will be ERROR, in which case we should set back to idle automatically.

Independent of this change, that function should perhaps be made to return void.


>> Also, in setting this, there is no real way to differentiate between a
>> regular walsender and a basebackup walsender, other than looking at the
>> wait events. They're both listed as walsenders. Should there be?  (That
>> might not be as easily backpatchable, so keeping that as a separate one)
>
> Maybe something like "walsender [backup]" or just "basebackup" since
> walsender is pretty misleading?  It think it would be nice to be able to
> tell them apart, though I don't think it should be back-patched.  People
> might be relying on the name in the current versions.

You can already do a join with pg_stat_replication.pid and look for
the WAL sender state which is marked as "backup" in this case. So I am
-1 for making the current code more complicated. Please note as well
that pg_stat_activity.backend_type is set depending on the process
type, not based on what the process is doing so you would need to
invent a new logic.

Good point. I'm not sure why I didn't think of that.

It's still quite a bit weird that we call this process "walsender" when it does other things as well. But the ship sailed on that many years ago, changing that completely now would not be worth the breakage. 

--
Attachment

Re: Basebackups reported as idle

From
Michael Paquier
Date:
On Wed, Dec 20, 2017 at 6:07 PM, Magnus Hagander <magnus@hagander.net> wrote:
> What about the attached?

The new positions look good to me, still aren't you missing the case
where a SQL command is found and exec_replication_command returns
false? This should be switched to idle as well.

+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

> Also, I noticed that the docs for exec_replication_command() says " *
> Returns true if the cmd_string was recognized as WalSender command, false
>  * if not.". But it doesn't actually do that, it ereport(ERROR):s if it's
> not. There is one "return false", but it's after an ERROR.

WAL senders can connect to a specific database, and if a WAL sender
command is not used in this case this goes through the normal query
processing.

> It's still quite a bit weird that we call this process "walsender" when it
> does other things as well. But the ship sailed on that many years ago,
> changing that completely now would not be worth the breakage.

ps shows walsender as well, that's a reason why "walsender" has been decided.
-- 
Michael


Re: Basebackups reported as idle

From
Magnus Hagander
Date:
On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Dec 20, 2017 at 6:07 PM, Magnus Hagander <magnus@hagander.net> wrote:
> What about the attached?

The new positions look good to me, still aren't you missing the case
where a SQL command is found and exec_replication_command returns
false? This should be switched to idle as well.

Yes. Of course. I can't read. That's the same as the notice below about it not returning false -- I managed to miss the extra if() there, and thought it always exited with ERROR.

:O

 
+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

Do we want that though? That would be a compat break at least, wouldn't it?


> It's still quite a bit weird that we call this process "walsender" when it
> does other things as well. But the ship sailed on that many years ago,
> changing that completely now would not be worth the breakage.

ps shows walsender as well, that's a reason why "walsender" has been decided.

Right. It's just a weird term for what it is. But it's the term that we've always used, so we can't go change it now. 

--

Re: Basebackups reported as idle

From
Alvaro Herrera
Date:
Magnus Hagander wrote:

> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't
> think it has a large risk of breaking things. Thoughts?

Agreed.  As long as it doesn't show up as idle-in-transaction afterwards
or something odd like that, it should be okay to backpatch.  (I suppose
there's no way for it to end as being in-transaction anyway.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Basebackups reported as idle

From
Michael Paquier
Date:
On Wed, Dec 20, 2017 at 9:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> Yes. Of course. I can't read. That's the same as the notice below about it
> not returning false -- I managed to miss the extra if() there, and thought
> it always exited with ERROR.

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

>> +   /* Report to pgstat that this process is running */
>> +   pgstat_report_activity(STATE_RUNNING, NULL);
>> Bonus points if cmd_string is used instead of string? This way, you
>> can know what is the replication command running ;)
>
> Do we want that though? That would be a compat break at least, wouldn't it?

Perhaps not, I found the idea funky but you actually don't want to
show a string in exec_replication_command for a T_SQLCmd node. That's
not complicated to check either. So let's discard such a thing for
now.
-- 
Michael


Re: Basebackups reported as idle

From
Magnus Hagander
Date:


On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Dec 20, 2017 at 9:02 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> Yes. Of course. I can't read. That's the same as the notice below about it
> not returning false -- I managed to miss the extra if() there, and thought
> it always exited with ERROR.

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't it? exec_replication_command() is called from PostgresMain(), and then calls WalSndLoop().

So I agree there is a small gap, but actually moving it to exec_replication_command() makes that gap smaller than it was before, no?

 

>> +   /* Report to pgstat that this process is running */
>> +   pgstat_report_activity(STATE_RUNNING, NULL);
>> Bonus points if cmd_string is used instead of string? This way, you
>> can know what is the replication command running ;)
>
> Do we want that though? That would be a compat break at least, wouldn't it?

Perhaps not, I found the idea funky but you actually don't want to
show a string in exec_replication_command for a T_SQLCmd node. That's
not complicated to check either. So let's discard such a thing for
now.

Agreed :) 


--

Re: Basebackups reported as idle

From
Michael Paquier
Date:
On Thu, Dec 21, 2017 at 02:46:15PM +0100, Magnus Hagander wrote:
> On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> I think that the call to pgstat_report_activity in WalSndLoop() should
>> be kept as well. There is a small gap between the moment the process
>> is started and the first replication command is run.
>>
>
> Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't
> it? exec_replication_command() is called from PostgresMain(), and then
> calls WalSndLoop().
>
> So I agree there is a small gap, but actually moving it to
> exec_replication_command() makes that gap smaller than it was before, no?

My turn to read things wrong then, thinking that WalSndLoop() was the
main routine used for starting the WAL sender process. You are right removing
the call there is adapted.

Could you update the patch?
--
Michael

Attachment

Re: Basebackups reported as idle

From
Magnus Hagander
Date:


On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Dec 21, 2017 at 02:46:15PM +0100, Magnus Hagander wrote:
> On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> I think that the call to pgstat_report_activity in WalSndLoop() should
>> be kept as well. There is a small gap between the moment the process
>> is started and the first replication command is run.
>>
>
> Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't
> it? exec_replication_command() is called from PostgresMain(), and then
> calls WalSndLoop().
>
> So I agree there is a small gap, but actually moving it to
> exec_replication_command() makes that gap smaller than it was before, no?

My turn to read things wrong then, thinking that WalSndLoop() was the
main routine used for starting the WAL sender process. You are right removing
the call there is adapted.

Could you update the patch?

I thought I had, but I can see now that email was a figment of my imagination :)

PFA an actual patch.
 
--
Attachment

Re: Basebackups reported as idle

From
Michael Paquier
Date:
On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:
> On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> Could you update the patch?
>
> I thought I had, but I can see now that email was a figment of my
> imagination :)

I'll take that as a fragment instead. The patch as proposed looks good
to me. Thanks for sending an updated version.
--
Michael

Attachment

Re: Basebackups reported as idle

From
David Steele
Date:
On 12/29/17 6:49 AM, Michael Paquier wrote:
> On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:
>> On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>> Could you update the patch?
>>
>> I thought I had, but I can see now that email was a figment of my
>> imagination :)
>
> I'll take that as a fragment instead. The patch as proposed looks good
> to me. Thanks for sending an updated version.

Looks good to me as well.

--
-David
david@pgmasters.net


Attachment

Re: Basebackups reported as idle

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:

> > I thought I had, but I can see now that email was a figment of my
> > imagination :)
> 
> I'll take that as a fragment instead.

Not at all ... https://www.merriam-webster.com/dictionary/figment
"something made up or contrived"

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Basebackups reported as idle

From
Magnus Hagander
Date:


On Fri, Dec 29, 2017 at 3:11 PM, David Steele <david@pgmasters.net> wrote:
On 12/29/17 6:49 AM, Michael Paquier wrote:
> On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:
>> On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>> Could you update the patch?
>>
>> I thought I had, but I can see now that email was a figment of my
>> imagination :)
>
> I'll take that as a fragment instead. The patch as proposed looks good
> to me. Thanks for sending an updated version.

Looks good to me as well.

Thanks to both of you for the reviews. Backpatched and pushed! 

--