Thread: pg_basebackup behavior on non-existent slot

pg_basebackup behavior on non-existent slot

From
"Gerard H. Pille"
Date:
Hallo,

I was just confronted with this behaviour described back in 2017, and 
the participants in the thread seem to consider it a bug.  But I'm 
running on version 13.  So, how did that discussion end?

My case:
@ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: error: could not send replication command 
"START_REPLICATION": ERROR:  replication slot "stream" does not exist
53790996/53790996 kB (100%), 2/2 tablespaces 

pg_basebackup: write-ahead log end point: 23/4E000138
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory "/var/lib/postgresql/13/main"
pg_basebackup: changes to tablespace directories will not be undone

The old thread:
https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com

Kind regards,

Gerard



Re: pg_basebackup behavior on non-existent slot

From
Masahiko Sawada
Date:
Hi,

On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote:
>
> Hallo,
>
> I was just confronted with this behaviour described back in 2017, and
> the participants in the thread seem to consider it a bug.  But I'm
> running on version 13.  So, how did that discussion end?
>
> My case:
> @ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream
> Password:
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: error: could not send replication command
> "START_REPLICATION": ERROR:  replication slot "stream" does not exist
> 53790996/53790996 kB (100%), 2/2 tablespaces
>
> pg_basebackup: write-ahead log end point: 23/4E000138
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: error: child process exited with exit code 1
> pg_basebackup: removing data directory "/var/lib/postgresql/13/main"
> pg_basebackup: changes to tablespace directories will not be undone
>
> The old thread:
> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com

It seems it's not fixed yet even in HEAD as far as I tested. There
were some ideas to fix that on that thread but the main point was how
to fix it on Windows. I guess that since it creates a transient slot
it’s not a common case to specify a non-existence slot in pg_baseback
but what is your use case? This might help motivate to fix this issue.

BTW in that thread, there was a discussion on how to detect the
streamer process failure in the main process but probably we can fix
this by just doing an existence check for the specified name
replication slot before starting the streamer process?

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg_basebackup behavior on non-existent slot

From
Michael Paquier
Date:
On Mon, Aug 23, 2021 at 12:54:58PM +0900, Masahiko Sawada wrote:
> On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote:
>> The old thread:
>> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com

A link to postgresql.org does the same work:
https://www.postgresql.org/message-id/CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com
No need to redirect that elsewhere.

> It seems it's not fixed yet even in HEAD as far as I tested. There
> were some ideas to fix that on that thread but the main point was how
> to fix it on Windows. I guess that since it creates a transient slot
> it’s not a common case to specify a non-existence slot in pg_baseback
> but what is your use case? This might help motivate to fix this issue.
>
> BTW in that thread, there was a discussion on how to detect the
> streamer process failure in the main process but probably we can fix
> this by just doing an existence check for the specified name
> replication slot before starting the streamer process?

Yeah.  Honestly, I am not really excited to redesign this part of base
backups just to take care of a side problem that is mitigated for most
users with the use of a temporary slot.  It would be simpler, as you
say, to find a way to detect if the slot wanted exists before even
launching START_REPLICATION with the specified slot on the second
thread copying the WAL, but that would be a new thing as one cannot do
catalog lookups with a physical replication session.  Some ideas here
may be to extend the result of IDENTIFY_SYSTEM to return the slots
available in a text array, invent a new command called SLOT, use SHOW
or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main
thread.
--
Michael

Attachment

Re: pg_basebackup behavior on non-existent slot

From
"Gerard H. Pille"
Date:
Hallo,

my "use case" was forgetting the "-C" option whilst creating a rather large standby database over a slow network
connection. I didn't interrupt it since it seemed to continue without problem until al the work was done, half a day
later.

Switching to a gigabit connection allowed me to repeat this much faster, and find a solution.  I'm just learning about
replicationon Postgres. 

If this is too complex to fix, perhaps adding a warning to the error message:

"pg_basebackup: error: could not send replication command, replication will fail" ?

Thanks!

Gerard

________________________________________
Van: Masahiko Sawada <sawada.mshk@gmail.com>
Verzonden: maandag 23 augustus 2021 5:54
Aan: Gerard H. Pille
CC: PostgreSQL mailing lists
Onderwerp: Re: pg_basebackup behavior on non-existent slot

Hi,

On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote:
>
> Hallo,
>
> I was just confronted with this behaviour described back in 2017, and
> the participants in the thread seem to consider it a bug.  But I'm
> running on version 13.  So, how did that discussion end?
>
> My case:
> @ pg_basebackup -h 192.168.1.131 -D $PWD/13/main -U repuser -v -P -S stream
> Password:
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 23/4E000028 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: error: could not send replication command
> "START_REPLICATION": ERROR:  replication slot "stream" does not exist
> 53790996/53790996 kB (100%), 2/2 tablespaces
>
> pg_basebackup: write-ahead log end point: 23/4E000138
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: error: child process exited with exit code 1
> pg_basebackup: removing data directory "/var/lib/postgresql/13/main"
> pg_basebackup: changes to tablespace directories will not be undone
>
> The old thread:
> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com

It seems it's not fixed yet even in HEAD as far as I tested. There
were some ideas to fix that on that thread but the main point was how
to fix it on Windows. I guess that since it creates a transient slot
it’s not a common case to specify a non-existence slot in pg_baseback
but what is your use case? This might help motivate to fix this issue.

BTW in that thread, there was a discussion on how to detect the
streamer process failure in the main process but probably we can fix
this by just doing an existence check for the specified name
replication slot before starting the streamer process?

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg_basebackup behavior on non-existent slot

From
Masahiko Sawada
Date:
On Mon, Aug 23, 2021 at 4:11 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Aug 23, 2021 at 12:54:58PM +0900, Masahiko Sawada wrote:
> > On Sat, Aug 21, 2021 at 10:15 PM Gerard H. Pille <ghpille@hotmail.com> wrote:
> >> The old thread:
> >> https://postgrespro.com/list/thread-id/2337189#CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com
>
> A link to postgresql.org does the same work:
> https://www.postgresql.org/message-id/CAMkU=1wSxYBNFY9TzuVh3=mDLr4BBsMct6wcViNMH+-6Xon4Uw@mail.gmail.com
> No need to redirect that elsewhere.
>
> > It seems it's not fixed yet even in HEAD as far as I tested. There
> > were some ideas to fix that on that thread but the main point was how
> > to fix it on Windows. I guess that since it creates a transient slot
> > it’s not a common case to specify a non-existence slot in pg_baseback
> > but what is your use case? This might help motivate to fix this issue.
> >
> > BTW in that thread, there was a discussion on how to detect the
> > streamer process failure in the main process but probably we can fix
> > this by just doing an existence check for the specified name
> > replication slot before starting the streamer process?
>
> Yeah.  Honestly, I am not really excited to redesign this part of base
> backups just to take care of a side problem that is mitigated for most
> users with the use of a temporary slot.

I think that this problem can happen also when using a temporary slot.
If the slot loses WAL file due to max_slot_wal_keep_size, the streamer
process could fail even while streaming WAL. I guess that the main
process taking a base backup won't stop in this case. If it's true,
pre-checking the existence of the slot might not be enough.

> It would be simpler, as you
> say, to find a way to detect if the slot wanted exists before even
> launching START_REPLICATION with the specified slot on the second
> thread copying the WAL, but that would be a new thing as one cannot do
> catalog lookups with a physical replication session.

Good point.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg_basebackup behavior on non-existent slot

From
Ronan Dunklau
Date:
Le lundi 23 août 2021, 09:11:12 CEST Michael Paquier a écrit :
Some ideas here
> may be to extend the result of IDENTIFY_SYSTEM to return the slots
> available in a text array, invent a new command called SLOT, use SHOW
> or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main
> thread.

This new command (albeit named differently, READ_REPLICATION_SLOT) has been
proposed here: https://www.postgresql.org/message-id/
1642037.dBafQRxpgh@aivenronan for a different purpose, although it would be
useful to have that in that case too.

I guess I can add a patch to use that in pg_basebackup itself in the same
patchset.

--
Ronan Dunklau





Re: pg_basebackup behavior on non-existent slot

From
Masahiko Sawada
Date:
On Thu, Aug 26, 2021 at 2:31 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
>
> Le lundi 23 août 2021, 09:11:12 CEST Michael Paquier a écrit :
> Some ideas here
> > may be to extend the result of IDENTIFY_SYSTEM to return the slots
> > available in a text array, invent a new command called SLOT, use SHOW
> > or perhaps do an extra CREATE_REPLICATION_SLOT TEMPORARY from the main
> > thread.
>
> This new command (albeit named differently, READ_REPLICATION_SLOT) has been
> proposed here: https://www.postgresql.org/message-id/
> 1642037.dBafQRxpgh@aivenronan for a different purpose, although it would be
> useful to have that in that case too.

Cool! I was considering other use cases of those new command ideas
proposed on this thread.

> I guess I can add a patch to use that in pg_basebackup itself in the same
> patchset.

+1

I'll look at the patches.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg_basebackup behavior on non-existent slot

From
Michael Paquier
Date:
On Thu, Aug 26, 2021 at 04:26:41PM +0900, Masahiko Sawada wrote:
> Cool! I was considering other use cases of those new command ideas
> proposed on this thread.

So that makes two use cases for that.  That sounds convincing.

> I'll look at the patches.

So will I.  At quick glance, there are some issues in them.  Let's
move the discussion there, then.
--
Michael

Attachment