Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
Date
Msg-id CABUevExpVYuLUgoNgYNNHxFmZqo3PuuaKgcVwYE5B5wCGScZkQ@mail.gmail.com
Whole thread Raw
In response to [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint  (Michael Banck <michael.banck@credativ.de>)
Responses Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting forcheckpoint
List pgsql-hackers


On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael.banck@credativ.de> wrote:
Hi,

one take-away from the Gitlab Post-Mortem[1] appears to be that after
their secondary lost replication, they were confused about what
pg_basebackup was doing when they tried to rebuild it. It just sat there
and did nothing (even with --verbose), so they assumed something was
wrong with either the primary or the connection, and restarted it
several times.

AFAICT, it turns out the checkpoint was written on the master (they
probably did not use -c fast), but this wasn't obvious to them:


Yeah, I've seen this happen to a number of people. I think that sounds like what's happened here as well. I've considered things in the line of the patch you posted, but never got around to actually doing anything about it.



ISTM that even with WAL streaming, nothing would be written on the
client server until the checkpoint is complete, as do_pg_start_backup()
runs the checkpoint and only returns the starting WAL location
afterwards.

The attached (untested) patch is to kick of a discussion on how to
improve the situation, it is supposed to mention the checkpoint when
--verbose is used and adds a paragraph about the checkpoint being run to
the Notes section of the documentation.


Docs look good to me, other than claiming that pg_basebackup runs on a server (it can run anywhere). I would just say "during which pg_basebackup will appear idle". How does that sound to you?

As for the code, while I haven't tested it, isn't the "checkpoint completed" message in the wrong place? Doesn't PQsendQuery() complete immediately, and the check needs to be put *after* the PQgetResult() call?

--

pgsql-hackers by date:

Previous
From: Michael Banck
Date:
Subject: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
Next
From: Erik Rijkers
Date:
Subject: Re: [HACKERS] Logical replication existing data copy