Thread: pg_basebackup behavior on non-existent slot
Hallo, I was just confronted with this behaviour described back in 2017, and the participants in the thread seem to consider it a bug. But I'm running on version 13. So, how did that discussion end? My case: @ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream Password: pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR: replication slot "stream" does not exist 53790996/53790996 kB (100%), 2/2 tablespaces pg_basebackup: write-ahead log end point: 23/4E000138 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: error: child process exited with exit code 1 pg_basebackup: removing data directory "/var/lib/postgresql/13/main" pg_basebackup: changes to tablespace directories will not be undone The old thread: https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com Kind regards, Gerard
Hi, On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote: > > Hallo, > > I was just confronted with this behaviour described back in 2017, and > the participants in the thread seem to consider it a bug. But I'm > running on version 13. So, how did that discussion end? > > My case: > @ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream > Password: > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: error: could not send replication command > "START_REPLICATION": ERROR: replication slot "stream" does not exist > 53790996/53790996 kB (100%), 2/2 tablespaces > > pg_basebackup: write-ahead log end point: 23/4E000138 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: error: child process exited with exit code 1 > pg_basebackup: removing data directory "/var/lib/postgresql/13/main" > pg_basebackup: changes to tablespace directories will not be undone > > The old thread: > https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com It seems it's not fixed yet even in HEAD as far as I tested. There were some ideas to fix that on that thread but the main point was how to fix it on Windows. I guess that since it creates a transient slot it’s not a common case to specify a non-existence slot in pg_baseback but what is your use case? This might help motivate to fix this issue. BTW in that thread, there was a discussion on how to detect the streamer process failure in the main process but probably we can fix this by just doing an existence check for the specified name replication slot before starting the streamer process? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Aug 23, 2021 at 12:54:58PM +0900, Masahiko Sawada wrote: > On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote: >> The old thread: >> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com A link to postgresql.org does the same work: https://www.postgresql.org/message-id/CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com No need to redirect that elsewhere. > It seems it's not fixed yet even in HEAD as far as I tested. There > were some ideas to fix that on that thread but the main point was how > to fix it on Windows. I guess that since it creates a transient slot > it’s not a common case to specify a non-existence slot in pg_baseback > but what is your use case? This might help motivate to fix this issue. > > BTW in that thread, there was a discussion on how to detect the > streamer process failure in the main process but probably we can fix > this by just doing an existence check for the specified name > replication slot before starting the streamer process? Yeah. Honestly, I am not really excited to redesign this part of base backups just to take care of a side problem that is mitigated for most users with the use of a temporary slot. It would be simpler, as you say, to find a way to detect if the slot wanted exists before even launching START_REPLICATION with the specified slot on the second thread copying the WAL, but that would be a new thing as one cannot do catalog lookups with a physical replication session. Some ideas here may be to extend the result of IDENTIFY_SYSTEM to return the slots available in a text array, invent a new command called SLOT, use SHOW or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main thread. -- Michael
Attachment
Hallo, my "use case" was forgetting the "-C" option whilst creating a rather large standby database over a slow network connection. I didn't interrupt it since it seemed to continue without problem until al the work was done, half a day later. Switching to a gigabit connection allowed me to repeat this much faster, and find a solution. I'm just learning about replicationon Postgres. If this is too complex to fix, perhaps adding a warning to the error message: "pg_basebackup: error: could not send replication command, replication will fail" ? Thanks! Gerard ________________________________________ Van: Masahiko Sawada <sawada.mshk@gmail.com> Verzonden: maandag 23 augustus 2021 5:54 Aan: Gerard H. Pille CC: PostgreSQL mailing lists Onderwerp: Re: pg_basebackup behavior on non-existent slot Hi, On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote: > > Hallo, > > I was just confronted with this behaviour described back in 2017, and > the participants in the thread seem to consider it a bug. But I'm > running on version 13. So, how did that discussion end? > > My case: > @ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream > Password: > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: error: could not send replication command > "START_REPLICATION": ERROR: replication slot "stream" does not exist > 53790996/53790996 kB (100%), 2/2 tablespaces > > pg_basebackup: write-ahead log end point: 23/4E000138 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: error: child process exited with exit code 1 > pg_basebackup: removing data directory "/var/lib/postgresql/13/main" > pg_basebackup: changes to tablespace directories will not be undone > > The old thread: > https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com It seems it's not fixed yet even in HEAD as far as I tested. There were some ideas to fix that on that thread but the main point was how to fix it on Windows. I guess that since it creates a transient slot it’s not a common case to specify a non-existence slot in pg_baseback but what is your use case? This might help motivate to fix this issue. BTW in that thread, there was a discussion on how to detect the streamer process failure in the main process but probably we can fix this by just doing an existence check for the specified name replication slot before starting the streamer process? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Aug 23, 2021 at 4:11 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Aug 23, 2021 at 12:54:58PM +0900, Masahiko Sawada wrote: > > On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote: > >> The old thread: > >> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com > > A link to postgresql.org does the same work: > https://www.postgresql.org/message-id/CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com > No need to redirect that elsewhere. > > > It seems it's not fixed yet even in HEAD as far as I tested. There > > were some ideas to fix that on that thread but the main point was how > > to fix it on Windows. I guess that since it creates a transient slot > > it’s not a common case to specify a non-existence slot in pg_baseback > > but what is your use case? This might help motivate to fix this issue. > > > > BTW in that thread, there was a discussion on how to detect the > > streamer process failure in the main process but probably we can fix > > this by just doing an existence check for the specified name > > replication slot before starting the streamer process? > > Yeah. Honestly, I am not really excited to redesign this part of base > backups just to take care of a side problem that is mitigated for most > users with the use of a temporary slot. I think that this problem can happen also when using a temporary slot. If the slot loses WAL file due to max_slot_wal_keep_size, the streamer process could fail even while streaming WAL. I guess that the main process taking a base backup won't stop in this case. If it's true, pre-checking the existence of the slot might not be enough. > It would be simpler, as you > say, to find a way to detect if the slot wanted exists before even > launching START_REPLICATION with the specified slot on the second > thread copying the WAL, but that would be a new thing as one cannot do > catalog lookups with a physical replication session. Good point. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Le lundi 23 août 2021, 09:11:12 CEST Michael Paquier a écrit : Some ideas here > may be to extend the result of IDENTIFY_SYSTEM to return the slots > available in a text array, invent a new command called SLOT, use SHOW > or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main > thread. This new command (albeit named differently, READ_REPLICATION_SLOT) has been proposed here: https://www.postgresql.org/message-id/ 1642037.dBafQRxpgh@aivenronan for a different purpose, although it would be useful to have that in that case too. I guess I can add a patch to use that in pg_basebackup itself in the same patchset. -- Ronan Dunklau
On Thu, Aug 26, 2021 at 2:31 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > > Le lundi 23 août 2021, 09:11:12 CEST Michael Paquier a écrit : > Some ideas here > > may be to extend the result of IDENTIFY_SYSTEM to return the slots > > available in a text array, invent a new command called SLOT, use SHOW > > or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main > > thread. > > This new command (albeit named differently, READ_REPLICATION_SLOT) has been > proposed here: https://www.postgresql.org/message-id/ > 1642037.dBafQRxpgh@aivenronan for a different purpose, although it would be > useful to have that in that case too. Cool! I was considering other use cases of those new command ideas proposed on this thread. > I guess I can add a patch to use that in pg_basebackup itself in the same > patchset. +1 I'll look at the patches. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Aug 26, 2021 at 04:26:41PM +0900, Masahiko Sawada wrote: > Cool! I was considering other use cases of those new command ideas > proposed on this thread. So that makes two use cases for that. That sounds convincing. > I'll look at the patches. So will I. At quick glance, there are some issues in them. Let's move the discussion there, then. -- Michael