Thread: pg_basebackup behavior on non-existent slot

pg_basebackup behavior on non-existent slot

From
Dimitri Fontaine
Date:

Hi,

 

When developing pg_auto_failover we found a bug where if the target replication slot given to pg_basebackup does not exist, then a full copy of the source PGDATA is completed before erroring out. You could easily end-up copying 100GB of data over the network just to see pg_basebackup remove them all at the end, and then when using the –progress option, you have to scroll up to the very start of the output to see the error message.

 

Please find attached a patch that shows a way to fix the issue. The patch is missing windows compatibility, I don’t know how to cast the WNOHANG spell on this platform. Please use the patch as you see fit, either inspiration, or maybe something you would like to commit to fix the bug.

 

Here what it looks like without the patch:

 

$ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P

pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot "SlotDoesNotExists" does not exist

32971/32971 kB (100%), 1/1 tablespace

pg_basebackup: error: child process exited with exit code 1

pg_basebackup: removing data directory "/tmp/bb"

 

 

Here’s what it looks like with the patch applied locally:

 

$ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P

pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot "SlotDoesNotExists" does not exist

pg_basebackup: error: child process exited with exit code 1

pg_basebackup: removing data directory "/tmp/bb"

 

Regards,

-- 

Dimitri Fontaine

PostgreSQL Major Contributor, Citus Data, Microsoft

Author of “The Art of PostgreSQL

Attachment

Re: pg_basebackup behavior on non-existent slot

From
Magnus Hagander
Date:
On Tue, Sep 28, 2021 at 12:03 PM Dimitri Fontaine
<Dimitri.Fontaine@microsoft.com> wrote:
>
> Hi,
>
>
>
> When developing pg_auto_failover we found a bug where if the target replication slot given to pg_basebackup does not
exist,then a full copy of the source PGDATA is completed before erroring out. You could easily end-up copying 100GB of
dataover the network just to see pg_basebackup remove them all at the end, and then when using the –progress option,
youhave to scroll up to the very start of the output to see the error message. 
>
>
>
> Please find attached a patch that shows a way to fix the issue. The patch is missing windows compatibility, I don’t
knowhow to cast the WNOHANG spell on this platform. Please use the patch as you see fit, either inspiration, or maybe
somethingyou would like to commit to fix the bug. 
>
>
>
> Here what it looks like without the patch:
>
>
>
> $ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P
>
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot
"SlotDoesNotExists"does not exist 
>
> 32971/32971 kB (100%), 1/1 tablespace
>
> pg_basebackup: error: child process exited with exit code 1
>
> pg_basebackup: removing data directory "/tmp/bb"
>
>
>
>
>
> Here’s what it looks like with the patch applied locally:
>
>
>
> $ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P
>
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot
"SlotDoesNotExists"does not exist 
>
> pg_basebackup: error: child process exited with exit code 1
>
> pg_basebackup: removing data directory "/tmp/bb"

Isn't this solving basically the same thing as
https://commitfest.postgresql.org/34/3302/ (which does have Windows
support)? Does this implementations have some advantages over the one
Daniel posted, or should we just focus on that one since it does have
Windows support in it?

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: [EXTERNAL] Re: pg_basebackup behavior on non-existent slot

From
Dimitri Fontaine
Date:

Thanks Magnus for linking the two approaches together, I missed Daniel’s effort entirely. I think it’s safe to focus on Daniel’s entry in the commit fest, and use this pgsql-bugs email as a reminder that it should very probably be backported as a bug-fix to all the maintained branches.

 

Regards,

-- 

Dimitri Fontaine

PostgreSQL Major Contributor, Citus Data, Microsoft

Author of “The Art of PostgreSQL

 

On 28/09/2021 12:25, "Magnus Hagander" <magnus@hagander.net> wrote:



On Tue, Sep 28, 2021 at 12:03 PM Dimitri Fontaine
<Dimitri.Fontaine@microsoft.com> wrote:
>
> Hi,
>
>
>
> When developing pg_auto_failover we found a bug where if the target replication slot given to pg_basebackup does not exist, then a full copy of the source PGDATA is completed before erroring out. You could easily end-up copying 100GB of data over the network just to see pg_basebackup remove them all at the end, and then when using the –progress option, you have to scroll up to the very start of the output to see the error message.
>
>
>
> Please find attached a patch that shows a way to fix the issue. The patch is missing windows compatibility, I don’t know how to cast the WNOHANG spell on this platform. Please use the patch as you see fit, either inspiration, or maybe something you would like to commit to fix the bug.
>
>
>
> Here what it looks like without the patch:
>
>
>
> $ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P
>
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot "SlotDoesNotExists" does not exist
>
> 32971/32971 kB (100%), 1/1 tablespace
>
> pg_basebackup: error: child process exited with exit code 1
>
> pg_basebackup: removing data directory "/tmp/bb"
>
>
>
>
>
> Here’s what it looks like with the patch applied locally:
>
>
>
> $ ./src/bin/pg_basebackup/pg_basebackup -p 5501 -D /tmp/bb -X stream -S SlotDoesNotExists -P
>
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot "SlotDoesNotExists" does not exist
>
> pg_basebackup: error: child process exited with exit code 1
>
> pg_basebackup: removing data directory "/tmp/bb"

Isn't this solving basically the same thing as
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitfest.postgresql.org%2F34%2F3302%2F&amp;data=04%7C01%7CDimitri.Fontaine%40microsoft.com%7C84c2d032a6d548beb6ca08d9826a46f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637684215249424989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BcCm8HBST8bDxbeUZopfdoXQCVES7NUBZ%2FJV3%2BJsZ9w%3D&amp;reserved=0 (which does have Windows
support)? Does this implementations have some advantages over the one
Daniel posted, or should we just focus on that one since it does have
Windows support in it?

--
 Magnus Hagander
 Me: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.hagander.net%2F&amp;data=04%7C01%7CDimitri.Fontaine%40microsoft.com%7C84c2d032a6d548beb6ca08d9826a46f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637684215249424989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Qy8phvZAe46s27rP%2FahkfU6Z5%2FbbUlsFAGYTin6ViM4%3D&amp;reserved=0
 Work: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.redpill-linpro.com%2F&amp;data=04%7C01%7CDimitri.Fontaine%40microsoft.com%7C84c2d032a6d548beb6ca08d9826a46f3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637684215249435007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HfTsMD1CEabJIR%2BzZQQC8R9ppXphNWcfYfnB67bWZcY%3D&amp;reserved=0

Re: [EXTERNAL] pg_basebackup behavior on non-existent slot

From
Daniel Gustafsson
Date:
> On 28 Sep 2021, at 13:12, Dimitri Fontaine <Dimitri.Fontaine@microsoft.com> wrote:

> I think it’s safe to focus on Daniel’s entry in the commit fest,

For reference, here is the output from your testcase running with the
referenced patch:

$ ./bin/pg_basebackup -p 5432 -D /tmp/bb -X stream -S SlotDoesNotExists -P
pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  replication slot
"SlotDoesNotExists"does not exist 
pg_basebackup: error: background WAL receiver terminated unexpectedly
pg_basebackup: removing data directory "/tmp/bb"

So this patch does seem to cover that case as well.

> and use this pgsql-bugs email as a reminder that it should very probably be backported as a bug-fix to all the
maintainedbranches. 

It probably should, I can't see anyone relying on the current behavior.  It's
however awfully intrusive for a backport as it effectively adds functionality
to solve what could be devils-advocate argued is an inconvenience and not a
bug.  Our conservative stance on backports make this not a clear-cut case, but
I'm interested in what others think here.

--
Daniel Gustafsson        https://vmware.com/