Thread: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

[HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Michael Banck

Date:

11 February 2017, 15:38:09

Hi,

one take-away from the Gitlab Post-Mortem[1] appears to be that after
their secondary lost replication, they were confused about what
pg_basebackup was doing when they tried to rebuild it. It just sat there
and did nothing (even with --verbose), so they assumed something was
wrong with either the primary or the connection, and restarted it
several times.

AFAICT, it turns out the checkpoint was written on the master (they
probably did not use -c fast), but this wasn't obvious to them:

"One of the engineers went to the secondary and wiped the data
directory, then ran pg_basebackup. Unfortunately pg_basebackup would
hang, producing no meaningful output, despite the --verbose option being
set."

[...]

"Unfortunately this did not resolve the problem of pg_basebackup not
starting replication immediately. One of the engineers decided to run it
with strace to see what it was blocking on. strace showed that
pg_basebackup was hanging in a poll call, but that did not provide any
other meaningful information that might have explained why."

[...]

"It would later be revealed by another engineer (who wasn't around at
the time) that this is normal behavior: pg_basebackup will wait for the
primary to start sending over replication data and it will sit and wait
silently until that time. Unfortunately this was not clearly documented
in our engineering runbooks nor in the official pg_basebackup document."

ISTM that even with WAL streaming, nothing would be written on the
client server until the checkpoint is complete, as do_pg_start_backup()
runs the checkpoint and only returns the starting WAL location
afterwards.

The attached (untested) patch is to kick of a discussion on how to
improve the situation, it is supposed to mention the checkpoint when
--verbose is used and adds a paragraph about the checkpoint being run to
the Notes section of the documentation.


Michael

[1]https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pg_basebackup.patch

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

11 February 2017, 16:07:59

On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

one take-away from the Gitlab Post-Mortem[1] appears to be that after
their secondary lost replication, they were confused about what
pg_basebackup was doing when they tried to rebuild it. It just sat there
and did nothing (even with --verbose), so they assumed something was
wrong with either the primary or the connection, and restarted it
several times.

AFAICT, it turns out the checkpoint was written on the master (they
probably did not use -c fast), but this wasn't obvious to them:

Yeah, I've seen this happen to a number of people. I think that sounds like what's happened here as well. I've considered things in the line of the patch you posted, but never got around to actually doing anything about it.

ISTM that even with WAL streaming, nothing would be written on the
client server until the checkpoint is complete, as do_pg_start_backup()
runs the checkpoint and only returns the starting WAL location
afterwards.

The attached (untested) patch is to kick of a discussion on how to
improve the situation, it is supposed to mention the checkpoint when
--verbose is used and adds a paragraph about the checkpoint being run to
the Notes section of the documentation.

Docs look good to me, other than claiming that pg_basebackup runs on a server (it can run anywhere). I would just say "during which pg_basebackup will appear idle". How does that sound to you?

As for the code, while I haven't tested it, isn't the "checkpoint completed" message in the wrong place? Doesn't PQsendQuery() complete immediately, and the check needs to be put *after* the PQgetResult() call?

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

11 February 2017, 16:25:20

Hi,


Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander:


> As for the code, while I haven't tested it, isn't the "checkpoint
> completed" message in the wrong place? Doesn't PQsendQuery() complete
> immediately, and the check needs to be put *after* the PQgetResult()
> call?

I guess you're right, I've moved it further down. There is in fact a
message about the xlog location (unless you switch off wal entirely),
but having another one right before that mentioning the completed
checkpoint sounds ok to me.

There's also some inconsistencies around which messages are prepended
with "pg_basebackup: " and which are translatable; I guess all messages
printed on --verbose should be translatable? Also, as almost all
messages have a "pg_basebackup: " prefix, I've added it to the rest.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pg_basebackup_v2.patch

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

11 February 2017, 16:36:24

Hi,

Am Samstag, den 11.02.2017, 11:25 +0100 schrieb Michael Banck:
> Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander:
> > As for the code, while I haven't tested it, isn't the "checkpoint
> > completed" message in the wrong place? Doesn't PQsendQuery() complete
> > immediately, and the check needs to be put *after* the PQgetResult()
> > call?
> 
> I guess you're right, I've moved it further down. There is in fact a
> message about the xlog location (unless you switch off wal entirely),
> but having another one right before that mentioning the completed
> checkpoint sounds ok to me.
> 
> There's also some inconsistencies around which messages are prepended
> with "pg_basebackup: " and which are translatable; I guess all messages
> printed on --verbose should be translatable? Also, as almost all
> messages have a "pg_basebackup: " prefix, I've added it to the rest.

Sorry, there were two typoes in the last patch, I've attached a fixed
one.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pg_basebackup_v3.patch

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Jim Nasby

Date:

13 February 2017, 08:29:25

On 2/11/17 4:36 AM, Michael Banck wrote:
> I guess you're right, I've moved it further down. There is in fact a
> message about the xlog location (unless you switch off wal entirely),
> but having another one right before that mentioning the completed
> checkpoint sounds ok to me.

1) I don't think this should be verbose output. Having a program sit 
there "doing nothing" for no apparent reason is just horrible UI design.

2) I think it'd be useful to have a way to get the status of a running 
checkpoint. The checkpointer already has that info, and I think it might 
even be in shared memory already. If there was a function that reported 
checkpoint status pg_basebackup could poll that to provide users with 
live status. That should be a separate patch though.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

13 February 2017, 14:31:03

On Mon, Feb 13, 2017 at 3:29 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 2/11/17 4:36 AM, Michael Banck wrote:
I guess you're right, I've moved it further down. There is in fact a
message about the xlog location (unless you switch off wal entirely),
but having another one right before that mentioning the completed
checkpoint sounds ok to me.

1) I don't think this should be verbose output. Having a program sit there "doing nothing" for no apparent reason is just horrible UI design.

That would include much of Unix then.. For example if I run "cp" on a large file it sits around "doing nothing". Same if I do "tar". No?

2) I think it'd be useful to have a way to get the status of a running checkpoint. The checkpointer already has that info, and I think it might even be in shared memory already. If there was a function that reported checkpoint status pg_basebackup could poll that to provide users with live status. That should be a separate patch though.

I agree that this would definitely be useful. But it might be something that's better exposed as a server-side view?

(and if pg_basebackup could poll it it would probably still not be included by default -- only if -P was given).

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

13 February 2017, 15:33:47

Hi,

Am Montag, den 13.02.2017, 09:31 +0100 schrieb Magnus Hagander:
> On Mon, Feb 13, 2017 at 3:29 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
> wrote:
>         On 2/11/17 4:36 AM, Michael Banck wrote:
>                 I guess you're right, I've moved it further down.
>                 There is in fact a
>                 message about the xlog location (unless you switch off
>                 wal entirely),
>                 but having another one right before that mentioning
>                 the completed
>                 checkpoint sounds ok to me.
>         
>         1) I don't think this should be verbose output. Having a
>         program sit there "doing nothing" for no apparent reason is
>         just horrible UI design.
> 
> 
> That would include much of Unix then.. For example if I run "cp" on a
> large file it sits around "doing nothing". Same if I do "tar". No?

The expectation for all three commands is that, even if there is no
output on stdout, they will write data to the local machine. So you can
easily monitor the progress of cp and tar by running du or something in
a different terminal.

With pg_basebackup, nothing is happening on the local machine until the
checkpoint on the remote is finished; while this is obvious to somebody
familiar with how basebackups work internally, it appears to be not
clear at all to some users.

So I think notifying the user that something is happening remotely while
the local process waits would be useful, but on the other hand,
pg_basebackup does not print anything unless (i) --verbose is set or
(ii) there is an error, so I think having it mention the checkpoint in
--verbose mode only is acceptable.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

14 February 2017, 23:06:48

On Mon, Feb 13, 2017 at 10:33 AM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

Am Montag, den 13.02.2017, 09:31 +0100 schrieb Magnus Hagander:
> On Mon, Feb 13, 2017 at 3:29 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
> wrote:
> On 2/11/17 4:36 AM, Michael Banck wrote:
> I guess you're right, I've moved it further down.
> There is in fact a
> message about the xlog location (unless you switch off
> wal entirely),
> but having another one right before that mentioning
> the completed
> checkpoint sounds ok to me.
>
> 1) I don't think this should be verbose output. Having a
> program sit there "doing nothing" for no apparent reason is
> just horrible UI design.
>
>
> That would include much of Unix then.. For example if I run "cp" on a
> large file it sits around "doing nothing". Same if I do "tar". No?

The expectation for all three commands is that, even if there is no
output on stdout, they will write data to the local machine. So you can
easily monitor the progress of cp and tar by running du or something in
a different terminal.

With pg_basebackup, nothing is happening on the local machine until the
checkpoint on the remote is finished; while this is obvious to somebody
familiar with how basebackups work internally, it appears to be not
clear at all to some users.

True.

However, outputing this info by default will make it show up in things like everybodys cronjobs by default. Right now a successful pg_basebackup run will come out with no output at all, which is how most Unix commands work, and brings it's own advantages. If we change that people will have to send all the output to /dev/null, resulting in missing the things that are actually important in any regard.

So I think notifying the user that something is happening remotely while
the local process waits would be useful, but on the other hand,
pg_basebackup does not print anything unless (i) --verbose is set or
(ii) there is an error, so I think having it mention the checkpoint in
--verbose mode only is acceptable.

Yeah, that's my view as well. I'm all for including it in verbose mode.

*Iff* we can get a progress indicator through the checkpoint we could include that in --progress mode. But that's a different patch, of course, but it shouldn't be included in the default output even if we find it.

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Robert Haas

Date:

15 February 2017, 00:41:33

On Tue, Feb 14, 2017 at 12:06 PM, Magnus Hagander <magnus@hagander.net> wrote:
> However, outputing this info by default will make it show up in things like
> everybodys cronjobs by default. Right now a successful pg_basebackup run
> will come out with no output at all, which is how most Unix commands work,
> and brings it's own advantages. If we change that people will have to send
> all the output to /dev/null, resulting in missing the things that are
> actually important in any regard.

I agree with that.  I think having this show up in verbose mode is a
really good idea - when something just hangs, users don't know what's
going on, and that's bad.  But showing it all the time seems like a
bridge too far.  As the postmortem linked above shows, people will
think of things like "hey, let's try --verbose mode" when the obvious
thing doesn't work.  What is really irritating to them is when
--verbose mode fails to be, uh, verbose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Alvaro Herrera

Date:

15 February 2017, 03:06:51

Robert Haas wrote:
> On Tue, Feb 14, 2017 at 12:06 PM, Magnus Hagander <magnus@hagander.net> wrote:
> > However, outputing this info by default will make it show up in things like
> > everybodys cronjobs by default. Right now a successful pg_basebackup run
> > will come out with no output at all, which is how most Unix commands work,
> > and brings it's own advantages. If we change that people will have to send
> > all the output to /dev/null, resulting in missing the things that are
> > actually important in any regard.
> 
> I agree with that.  I think having this show up in verbose mode is a
> really good idea - when something just hangs, users don't know what's
> going on, and that's bad.  But showing it all the time seems like a
> bridge too far.  As the postmortem linked above shows, people will
> think of things like "hey, let's try --verbose mode" when the obvious
> thing doesn't work.  What is really irritating to them is when
> --verbose mode fails to be, uh, verbose.

I'd rather have a --quiet mode instead.  If you're running it by hand,
you're likely to omit the switch, whereas when writing the cron job
you're going to notice lack of switch even before you let the job run
once. 

I think progress reporting ought to go to stderr anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Robert Haas

Date:

15 February 2017, 05:18:24

On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I'd rather have a --quiet mode instead.  If you're running it by hand,
> you're likely to omit the switch, whereas when writing the cron job
> you're going to notice lack of switch even before you let the job run
> once.

Well, that might've been a better way to design it, but changing it
now would break backward compatibility and I'm not really sure that's
a good idea.  Even if it is, it's a separate concern from whether or
not in the less-quiet mode we should point out that we're waiting for
a checkpoint on the server side.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Jeff Janes

Date:

16 February 2017, 05:05:28

On Tue, Feb 14, 2017 at 9:06 AM, Magnus Hagander <magnus@hagander.net> wrote:

Yeah, that's my view as well. I'm all for including it in verbose mode.

*Iff* we can get a progress indicator through the checkpoint we could include that in --progress mode. But that's a different patch, of course, but it shouldn't be included in the default output even if we find it.

I think it should show up in --progress mode. It would be great if we could show fine-grained progress reports on the checkpoint, but if we can't do that we should still report as fine as we are able to, which is that a checkpoint is in progress. Otherwise we are setting the perfect as the enemy of the good.

Cheers,

Jeff

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Jim Nasby

Date:

18 February 2017, 01:17:02

On 2/14/17 5:18 PM, Robert Haas wrote:
> On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I'd rather have a --quiet mode instead.  If you're running it by hand,
>> you're likely to omit the switch, whereas when writing the cron job
>> you're going to notice lack of switch even before you let the job run
>> once.
>
> Well, that might've been a better way to design it, but changing it
> now would break backward compatibility and I'm not really sure that's

Meh... it's really only going to affect cronjobs or scripts, which are 
easy enough to fix, and you're not going to have that many of them (or 
if you do you certainly have an automated way to push the update).

> a good idea.  Even if it is, it's a separate concern from whether or
> not in the less-quiet mode we should point out that we're waiting for
> a checkpoint on the server side.

Well, --quite was suggested because of confusion from pg_basebackup 
twiddling it's thumbs...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Tomas Vondra

Date:

18 February 2017, 05:22:20

On 02/17/2017 08:17 PM, Jim Nasby wrote:> On 2/14/17 5:18 PM, Robert Haas wrote:>> On Tue, Feb 14, 2017 at 4:06 PM,
AlvaroHerrera>> <alvherre@2ndquadrant.com> wrote:>>> I'd rather have a --quiet mode instead.  If you're running it by
hand,>>>you're likely to omit the switch, whereas when writing the cron job>>> you're going to notice lack of switch
evenbefore you let the job run>>> once.>>>> Well, that might've been a better way to design it, but changing it>> now
wouldbreak backward compatibility and I'm not really sure that's>> Meh... it's really only going to affect cronjobs or
scripts,which are> easy enough to fix, and you're not going to have that many of them (or> if you do you certainly have
anautomated way to push the update).>
 

I think you're underestimating the breakage and overestimating how easy 
it's going to be to it. It's true we'd only change this in a major 
version, so people should assume possible breakage and test.
>> a good idea.  Even if it is, it's a separate concern from whether or>> not in the less-quiet mode we should point
outthat we're waiting for>> a checkpoint on the server side.>> Well, --quite was suggested because of confusion from
pg_basebackup>twiddling it's thumbs...
 

I'm in favor of the '--verbose' route. People are used to that when 
investigating issues, and it does not break existing cron jobs. I can 
live with --quiet though, as long as we don't resort to some craziness 
along the lines "if there's tty be verbose, otherwise be quiet".

I have my doubts about this actually addressing gitlab-like mistakes, 
though, because it's a helluva jump from "It's waiting and not doing 
anything," to "We need to remove the datadir." (One of the reasons being 
that non-empty directory is a local issue, and there's no reason why the 
tool should wait instead of just reporting an error.)

FWIW before messing with the pg_basebackup code, perhaps we should 
improve the documentation and explain clearly the meaning of 'fast' and 
'spread' checkpoint modes. Right now, pg_basebackup docs only say this:
  Sets checkpoint mode to fast or spread (default) (see Section 24.3.3).

which is pretty damn useless, when you're investigating an issue. And 
the referenced section (Making a Base Backup Using the Low Level API) 
does not clearly explain how this maps to pg_start_backup(_,?).

What about adding a paragraph into pg_basebackup docs, explaining that 
with 'fast' it does immediate checkpoint, while with 'spread' it'll wait 
for a spread checkpoint.

regards

-- Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

"David G. Johnston"

Date:

18 February 2017, 06:43:27

On Fri, Feb 17, 2017 at 4:22 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

What about adding a paragraph into pg_basebackup docs, explaining that with 'fast' it does immediate checkpoint, while with 'spread' it'll wait for a spread checkpoint.

I agree that a better, and self-contained, explanation of the behaviors that fast and spread invoke on the server should be included directly in the pg_basebackup docs.

Additionally, a primary benefit of pg_basebackup is hiding the low-level details from the user and in that spirit the cross-reference link to Section 25.3.3 "Making a Base Backup Using the Low Level API" should be removed. If there is specific information there that a user of pg_basebackup needs it should be presented properly in the application documentation.

The top of pg_basebackup points to the entire 25.3 chapter but the flow from there is solid - coverage of pg_basebackup occurs and points out the low level API for those whose needs are not fully served by the bundled application. If one uses pg_basebackup they should be able to stop at that point, go back to the app page, and continue reading and skip all of 25.3.3

The term "spread checkpoint" isn't actually a defined term in our docs...and aside from the word spread itself describing out a checkpoint works, it isn't used outside of pg_basebackup docs. So "it will wait for a spread checkpoint" doesn't really work - "it will start and then wait for a normal checkpoint to complete" does.

More holistically (i.e., feel free to skip)

This paragraph from 25.3.3:

"""

This is because it performs a checkpoint, and the I/O required for the checkpoint will be spread out over a significant period of time, by default half your inter-checkpoint interval (see the configuration parameter checkpoint_completion_target). This is usually what you want, because it minimizes the impact on query processing. If you want to start the backup as soon as possible, change the second parameter to true.

"""

is good but buried and seems like it would be more visible in Chapter 30. Reliability and the Write-Ahead Log. To there both the internals and backbackup pages could point the reader. There isn't a chapter dedicated to checkpoints - nor does there need to be - but a section in 30 seems warranted as being the official reference. Right now you have to skim the configuration variables and "WAL Configuration" and "CHECKPOINT" and "base backup API and pg_basebackup" to cover everything. A checkpoint chapter with that paragraph as a focus would allow the other items to simply say "immediate or normal checkpoint" as needed and redirect the reader for additional context as to the trade-offs of each - whether done manually or during some form of backup script.

David J.

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Robert Haas

Date:

19 February 2017, 16:21:06

On Sat, Feb 18, 2017 at 4:52 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I have my doubts about this actually addressing gitlab-like mistakes,
> though, because it's a helluva jump from "It's waiting and not doing
> anything," to "We need to remove the datadir." (One of the reasons being
> that non-empty directory is a local issue, and there's no reason why the
> tool should wait instead of just reporting an error.)

It's pretty clear that the gitlab postmortem involves multiple people
making multiple serious errors, including failing to test that the
ostensible backups could actually be restored.  I was taught that rule
#1 as far as backups are concerned is to test that you can restore
them, so that seems like a big miss.  However, I don't think the fact
they made other mistakes is a reason not to improve the things we can
improve and, certainly, having some way for pg_basebackup to tell you
that it's waiting for the master to checkpoint will help the next
person who is confused by that particular thing.  That person may go
on to be confused by something else, but then again maybe not.
Improving the reporting in this case stands on its own merits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

27 February 2017, 01:27:03

Hi,

Am Dienstag, den 14.02.2017, 18:18 -0500 schrieb Robert Haas:
> On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I'd rather have a --quiet mode instead.  If you're running it by hand,
> > you're likely to omit the switch, whereas when writing the cron job
> > you're going to notice lack of switch even before you let the job run
> > once.
> 
> Well, that might've been a better way to design it, but changing it
> now would break backward compatibility and I'm not really sure that's
> a good idea.  Even if it is, it's a separate concern from whether or
> not in the less-quiet mode we should point out that we're waiting for
> a checkpoint on the server side.

ISTM the consensus is that there should be no output in regular mode,
but a message should be displayed in verbose and progress mode.

So I went forth and also added a message in progress mode (unless
verbose messages are requested anyway).

Regarding the documentation, I tried to clarify the difference between
the checkpoint types a bit more, but I think any further action is
probably a larger rewrite of the documentation on this topic.

So attached are two patches, I've split it up in the documentation and
the code output part. I'll add it as one commitfest entry in the
"Clients" section though, as it's not really a big patch, unless
somebody thinks it should have a secon entry in "Documentation"?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

27 February 2017, 02:32:30

On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

Am Dienstag, den 14.02.2017, 18:18 -0500 schrieb Robert Haas:
> On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I'd rather have a --quiet mode instead. If you're running it by hand,
> > you're likely to omit the switch, whereas when writing the cron job
> > you're going to notice lack of switch even before you let the job run
> > once.
>
> Well, that might've been a better way to design it, but changing it
> now would break backward compatibility and I'm not really sure that's
> a good idea. Even if it is, it's a separate concern from whether or
> not in the less-quiet mode we should point out that we're waiting for
> a checkpoint on the server side.

ISTM the consensus is that there should be no output in regular mode,
but a message should be displayed in verbose and progress mode.

So I went forth and also added a message in progress mode (unless
verbose messages are requested anyway).

Regarding the documentation, I tried to clarify the difference between
the checkpoint types a bit more, but I think any further action is
probably a larger rewrite of the documentation on this topic.

So attached are two patches, I've split it up in the documentation and
the code output part. I'll add it as one commitfest entry in the
"Clients" section though, as it's not really a big patch, unless
somebody thinks it should have a secon entry in "Documentation"?

Agreed, and applied as one patch. Except I noticed you also fixed a couple of entries which were missing the progname in the messages -- I broke those out to a separate patch instead.

Made a small change to "using as much I/O as available" rather than "as possible", which I think is a better wording, along with the change of the idle wording I suggested before. (but feel free to point it out to me if that's wrong).

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

27 February 2017, 02:53:46

Hi,

Am Sonntag, den 26.02.2017, 21:32 +0100 schrieb Magnus Hagander:

> On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck
> <michael.banck@credativ.de> wrote:

> Agreed, and applied as one patch. Except I noticed you also fixed a
> couple of entries which were missing the progname in the messages -- I
> broke those out to a separate patch instead.

Thanks!

> Made a small change to "using as much I/O as available" rather than
> "as possible", which I think is a better wording, along with the
> change of the idle wording I suggested before. (but feel free to point
> it out to me if that's wrong). 

LGTM, I apparently missed your suggestion when I re-read the thread.

I am just wondering whether this could/should be back-patched, maybe? It
is not a bug fix, of course, but OTOH is rather small and probably
helpful to some users on current releases.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

27 February 2017, 02:55:18

On Sun, Feb 26, 2017 at 9:53 PM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

Am Sonntag, den 26.02.2017, 21:32 +0100 schrieb Magnus Hagander:

> On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck
> <michael.banck@credativ.de> wrote:

> Agreed, and applied as one patch. Except I noticed you also fixed a
> couple of entries which were missing the progname in the messages -- I
> broke those out to a separate patch instead.

Thanks!

> Made a small change to "using as much I/O as available" rather than
> "as possible", which I think is a better wording, along with the
> change of the idle wording I suggested before. (but feel free to point
> it out to me if that's wrong).

LGTM, I apparently missed your suggestion when I re-read the thread.

I am just wondering whether this could/should be back-patched, maybe? It
is not a bug fix, of course, but OTOH is rather small and probably
helpful to some users on current releases.

Good point. We should definitely back-patch the documentation updates.

Not 100% sure about the others, as it's a small behaviour change. But since it's only in verbose mode, I doubt it is very likely to break anybodys scripts relying on certain output or so.

What do others think?

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Tom Lane

Date:

27 February 2017, 02:59:24

Magnus Hagander <magnus@hagander.net> writes:
> On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck <michael.banck@credativ.de>
> wrote:
>> ISTM the consensus is that there should be no output in regular mode,
>> but a message should be displayed in verbose and progress mode.

> Agreed, and applied as one patch.

Is there an argument for back-patching this?
        regards, tom lane

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

27 February 2017, 21:20:47

On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:
> On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck <michael.banck@credativ.de>
> wrote:
>> ISTM the consensus is that there should be no output in regular mode,
>> but a message should be displayed in verbose and progress mode.

> Agreed, and applied as one patch.

Is there an argument for back-patching this?

Seems you were typing that at the same time as we did.

I'm considering it, but not swayed in either direction. Should I take your comment as a vote that we should back-patch it?

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Tom Lane

Date:

27 February 2017, 21:22:43

Magnus Hagander <magnus@hagander.net> writes:
> On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Is there an argument for back-patching this?

> I'm considering it, but not swayed in either direction. Should I take your
> comment as a vote that we should back-patch it?

Yeah, I'd vote for it.
        regards, tom lane

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Simon Riggs

Date:

27 February 2017, 23:51:18

On 26 February 2017 at 20:55, Magnus Hagander <magnus@hagander.net> wrote:

> What do others think?

Changing the output behaviour of a command isn't something we usually
do as a backpatch.

This change doesn't affect the default behaviour so probably wouldn't
make a difference to the outcome of the situation that generated this
thread.

Having said that, if it helps others to avoid mistakes in the future
then its worth doing, so +1 to backpatch.

I've looked into changing the actual underlying behaviour and I don't
think its feasible, so making this change will at least allow some
responsiveness from us. Thanks Michael, Magnus.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Jeff Janes

Date:

28 February 2017, 00:46:18

On Sun, Feb 26, 2017 at 12:32 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck <michael.banck@credativ.de> wrote:
Hi,

Am Dienstag, den 14.02.2017, 18:18 -0500 schrieb Robert Haas:
> On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I'd rather have a --quiet mode instead. If you're running it by hand,
> > you're likely to omit the switch, whereas when writing the cron job
> > you're going to notice lack of switch even before you let the job run
> > once.
>
> Well, that might've been a better way to design it, but changing it
> now would break backward compatibility and I'm not really sure that's
> a good idea. Even if it is, it's a separate concern from whether or
> not in the less-quiet mode we should point out that we're waiting for
> a checkpoint on the server side.

ISTM the consensus is that there should be no output in regular mode,
but a message should be displayed in verbose and progress mode.

So I went forth and also added a message in progress mode (unless
verbose messages are requested anyway).

Regarding the documentation, I tried to clarify the difference between
the checkpoint types a bit more, but I think any further action is
probably a larger rewrite of the documentation on this topic.

So attached are two patches, I've split it up in the documentation and
the code output part. I'll add it as one commitfest entry in the
"Clients" section though, as it's not really a big patch, unless
somebody thinks it should have a secon entry in "Documentation"?

Agreed, and applied as one patch. Except I noticed you also fixed a couple of entries which were missing the progname in the messages -- I broke those out to a separate patch instead.

Made a small change to "using as much I/O as available" rather than "as possible", which I think is a better wording, along with the change of the idle wording I suggested before. (but feel free to point it out to me if that's wrong).

Should the below fprintf end in a \r rather than a \n, so that the the progress message gets over-written once the checkpoint is done and we have moved on?

if (showprogress && !verbose)

fprintf(stderr, "waiting for checkpoint\n");

That would seem more in keeping with how the other progress messages operate.

Cheers,

Jeff

Re: gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

29 March 2017, 17:05:54

Hi,

Am Montag, den 27.02.2017, 16:20 +0100 schrieb Magnus Hagander:
> On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>         Is there an argument for back-patching this?
> 
> 
> Seems you were typing that at the same time as we did.
> 
> 
> I'm considering it, but not swayed in either direction. Should I take
> your comment as a vote that we should back-patch it? 

I've checked back into this thread, and there seems to be a +1 from Tom
and a +(0.5-1) from Simon for backpatching, and no obvious -1s. Did you
decide against it in the end, or is this still an open item?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Re: gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

31 March 2017, 12:59:18

On Wed, Mar 29, 2017 at 1:05 PM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

Am Montag, den 27.02.2017, 16:20 +0100 schrieb Magnus Hagander:
> On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Is there an argument for back-patching this?
>
>
> Seems you were typing that at the same time as we did.
>
>
> I'm considering it, but not swayed in either direction. Should I take
> your comment as a vote that we should back-patch it?

I've checked back into this thread, and there seems to be a +1 from Tom
and a +(0.5-1) from Simon for backpatching, and no obvious -1s. Did you
decide against it in the end, or is this still an open item?

No, I plan to work on it, so it's still an open item. I've been backlogged with other things, but I will try to get too it soon.

(This also includes considering Jeff's note)

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

01 April 2017, 21:12:11

On Mon, Feb 27, 2017 at 7:46 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Feb 26, 2017 at 12:32 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Sun, Feb 26, 2017 at 8:27 PM, Michael Banck <michael.banck@credativ.de> wrote:
Hi,

Am Dienstag, den 14.02.2017, 18:18 -0500 schrieb Robert Haas:
> On Tue, Feb 14, 2017 at 4:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I'd rather have a --quiet mode instead. If you're running it by hand,
> > you're likely to omit the switch, whereas when writing the cron job
> > you're going to notice lack of switch even before you let the job run
> > once.
>
> Well, that might've been a better way to design it, but changing it
> now would break backward compatibility and I'm not really sure that's
> a good idea. Even if it is, it's a separate concern from whether or
> not in the less-quiet mode we should point out that we're waiting for
> a checkpoint on the server side.

ISTM the consensus is that there should be no output in regular mode,
but a message should be displayed in verbose and progress mode.

So I went forth and also added a message in progress mode (unless
verbose messages are requested anyway).

Regarding the documentation, I tried to clarify the difference between
the checkpoint types a bit more, but I think any further action is
probably a larger rewrite of the documentation on this topic.

So attached are two patches, I've split it up in the documentation and
the code output part. I'll add it as one commitfest entry in the
"Clients" section though, as it's not really a big patch, unless
somebody thinks it should have a secon entry in "Documentation"?

Agreed, and applied as one patch. Except I noticed you also fixed a couple of entries which were missing the progname in the messages -- I broke those out to a separate patch instead.

Made a small change to "using as much I/O as available" rather than "as possible", which I think is a better wording, along with the change of the idle wording I suggested before. (but feel free to point it out to me if that's wrong).

Should the below fprintf end in a \r rather than a \n, so that the the progress message gets over-written once the checkpoint is done and we have moved on?

if (showprogress && !verbose)
fprintf(stderr, "waiting for checkpoint\n");

That would seem more in keeping with how the other progress messages operate.

Agreed, that makes more sense. I've pushed a patch that does this.

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: gitlab post-mortem: pg_basebackup waiting for checkpoint

From

Magnus Hagander

Date:

01 April 2017, 21:29:42

On Fri, Mar 31, 2017 at 8:59 AM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Mar 29, 2017 at 1:05 PM, Michael Banck <michael.banck@credativ.de> wrote:
Hi,

Am Montag, den 27.02.2017, 16:20 +0100 schrieb Magnus Hagander:
> On Sun, Feb 26, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Is there an argument for back-patching this?
>
>
> Seems you were typing that at the same time as we did.
>
>
> I'm considering it, but not swayed in either direction. Should I take
> your comment as a vote that we should back-patch it?

I've checked back into this thread, and there seems to be a +1 from Tom
and a +(0.5-1) from Simon for backpatching, and no obvious -1s. Did you
decide against it in the end, or is this still an open item?

No, I plan to work on it, so it's still an open item. I've been backlogged with other things, but I will try to get too it soon.

(This also includes considering Jeff's note)

I've applied a backpatch to 9.4. Prior to that pretty much the entire patch is a conflict, so it would need a full rewrite.

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: gitlab post-mortem: pg_basebackup waiting forcheckpoint

From

Michael Banck

Date:

03 April 2017, 13:17:44

Am Samstag, den 01.04.2017, 17:29 +0200 schrieb Magnus Hagander:

> I've applied a backpatch to 9.4. Prior to that pretty much the entire
> patch is a conflict, so it would need a full rewrite.

Thanks!


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer