Re: pg_replication_origin_drop API potential race condition - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: pg_replication_origin_drop API potential race condition
Date
Msg-id CAA4eK1+D=M2TZUFd06rW1HSChpV6Qim6d_2+-7arC-ewMb3HAQ@mail.gmail.com
Whole thread Raw
In response to Re: pg_replication_origin_drop API potential race condition  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: pg_replication_origin_drop API potential race condition
Re: pg_replication_origin_drop API potential race condition
List pgsql-hackers
On Fri, Feb 5, 2021 at 9:46 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> PSA patch updated per above suggestions.
>

Thanks, I have tested your patch and before the patch, I was getting
errors like "tuple concurrently deleted" or "cache lookup failed for
replication origin with oid 1" and after the patch, I am getting
"replication origin "origin-1" does not exist" which is clearly better
and user-friendly.

Before Patch
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  tuple concurrently deleted
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  cache lookup failed for replication origin with oid 1

After Patch
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  replication origin "origin-1" does not exist

I wonder why you haven't changed the usage of the existing
replorigin_drop in the code? I have changed the same, added few
comments, ran pgindent, and updated the commit message in the
attached.

I am not completely whether we should retire replorigin_drop or just
keep it for backward compatibility? What do you think? Anybody else
has any opinion?

For others, the purpose of this patch is to "make
pg_replication_origin_drop safe against concurrent drops.". Currently,
we get the origin id from the name and then drop the origin by taking
ExclusiveLock on ReplicationOriginRelationId. So, two concurrent
sessions can get the id from the name at the same time, and then when
they try to drop the origin, one of the sessions will get either
"tuple concurrently deleted" or "cache lookup failed for replication
origin ..".

To prevent this race condition we do the entire operation under lock.
This obviates the need for replorigin_drop() API but we have kept it
for backward compatibility.

-- 
With Regards,
Amit Kapila.

Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: logical replication worker accesses catalogs in error context callback
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Single transaction in the tablesync worker?