Thread: [HACKERS] pg_basebackup behavior on non-existent slot
If I tell pg_basebackup to use a non-existent slot, it immediately reports an error. And then it exits with an error, but only after streaming the entire database contents.
If you are doing this interactively and are on the ball, of course, you can hit ctrl-C when you see the error message.
I don't know if this is exactly a bug, but it seems rather unfortunate.
Should the parent process of pg_basebackup be made to respond to SIGCHLD? Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
$ /usr/local/pgsql9_6/bin/pg_basebackup -D data_replica -P --slot foobar -Xs
pg_basebackup: could not send replication command "START_REPLICATION": ERROR: replication slot "foobar" does not exist
22384213/22384213 kB (100%), 1/1 tablespace
pg_basebackup: child process exited with error 1
pg_basebackup: removing data directory "data_replica"
Cheers,
Jeff
On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
If I tell pg_basebackup to use a non-existent slot, it immediately reports an error. And then it exits with an error, but only after streaming the entire database contents.If you are doing this interactively and are on the ball, of course, you can hit ctrl-C when you see the error message.I don't know if this is exactly a bug, but it seems rather unfortunate.
I think that should qualify as a bug.
In 10 it will automatically create a transient slot in this case, but there might still be a case where you can provoke this.
Should the parent process of pg_basebackup be made to respond to SIGCHLD? Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
I think it's ok to just call waitpid() -- we don't need to react super quickly, but we should react. And we should then exit the main process with an error before actually streaming everything.
Magnus Hagander wrote: > On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > Should the parent process of pg_basebackup be made to respond to SIGCHLD? > > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop? > > I think it's ok to just call waitpid() -- we don't need to react super > quickly, but we should react. Hmm, not sure about that ... in the normal case (slotname is correct) you'd be doing thousands of useless waitpid() system calls during the whole operation, no? I think it'd be better to have a SIGCHLD handler that sets a flag (just once), which can be quickly checked without accessing kernel space. > And we should then exit the main process with an error before actually > streaming everything. Right. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 6, 2017 at 11:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.
Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.
Good point.
So the question is what to do for Windows. I'd rather not have to bring in the whole extra thread and socket emulation stuff into pg_basebackup if it can be avoided. But I guess we could code up something Windows-specific in just that one (since it's threaded and not processed on Windows, it's easier than the backend). I think that means we'd have to rewrite it to use the async libpq apis, don't you?
The other option would be to just kill the process from the child thread. Since the're threads we can do that. However, that will leave us in a position where we can't clean up from the error (as in remove files/dirs), not sure that's good?
On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.
Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.
If we don't want polling by waitpid, then my next thought would be to move the data copy into another process, then have the main process do nothing but wait for the first child to exit. If the first to exit is the WAL receiver, then we must have an error and the data receiver can be killed. I don't know how to translate that to Windows, however.
Cheers,
Jeff
On Tue, Sep 12, 2017 at 7:35 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.
Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.If we don't want polling by waitpid, then my next thought would be to move the data copy into another process, then have the main process do nothing but wait for the first child to exit. If the first to exit is the WAL receiver, then we must have an error and the data receiver can be killed. I don't know how to translate that to Windows, however.
Well, we could do something similar -- run the main process and the streamer in separate threads on windows and have a main thread wait on both. The main thread would have to be in charge of cleanup as well of course. But I think that's likely going to be more complicated than using non blocking libpq APIs.