Re: Logical decoding on standby - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Logical decoding on standby
Date
Msg-id 20170328152238.h3ikwwsl5kbqq6nk@alap3.anarazel.de
Whole thread Raw
In response to Re: Logical decoding on standby  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: Logical decoding on standby  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
Hi,

On 2017-03-27 16:03:48 +0800, Craig Ringer wrote:
> On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > So this patch makes ReplicationSlotAcquire check that the slot
> > database matches the current database and refuse to acquire the slot
> > if it does not.
> 
> New patch attached that drops above requirement, so slots can still be
> dropped from any DB.
> 
> This introduces a narrow race window where DROP DATABASE may ERROR if
> somebody connects to a different database and runs a
> pg_drop_replication_slot(...) for one of the slots being dropped by
> DROP DATABASE after we check for active slots but before we've dropped
> the slot. But it's hard to hit and it's pretty harmless; the worst
> possible result is dropping one or more of the slots before we ERROR
> out of the DROP. But you clearly didn't want them anyway, since you
> were dropping the DB and dropping some slots at the same time.
> 
> I think this one's ready to go.
> 
> -- 
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

> From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001
> From: Craig Ringer <craig@2ndquadrant.com>
> Date: Wed, 22 Mar 2017 13:21:09 +0800
> Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB
> 
> Automatically drop all logical replication slots associated with a
> database when the database is dropped.
> ---
>  doc/src/sgml/func.sgml                             |  3 +-
>  doc/src/sgml/protocol.sgml                         |  2 +
>  src/backend/commands/dbcommands.c                  | 32 +++++---
>  src/backend/replication/slot.c                     | 88 ++++++++++++++++++++++
>  src/include/replication/slot.h                     |  1 +
>  src/test/recovery/t/006_logical_decoding.pl        | 40 +++++++++-
>  .../recovery/t/010_logical_decoding_timelines.pl   | 30 +++++++-
>  7 files changed, 182 insertions(+), 14 deletions(-)
> 
> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index ba6f8dd..78508d7 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
>         <entry>
>          Drops the physical or logical replication slot
>          named <parameter>slot_name</parameter>. Same as replication protocol
> -        command <literal>DROP_REPLICATION_SLOT</>.
> +        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
> +        be called when connected to the same database the slot was created on.
>         </entry>
>        </row>
>  
> diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
> index b3a5026..5f97141 100644
> --- a/doc/src/sgml/protocol.sgml
> +++ b/doc/src/sgml/protocol.sgml
> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>       <para>
>        Drops a replication slot, freeing any reserved server-side resources. If
>        the slot is currently in use by an active connection, this command fails.
> +      If the slot is a logical slot that was created in a database other than
> +      the database the walsender is connected to, this command fails.
>       </para>
>       <variablelist>
>        <varlistentry>

Shouldn't the docs in the drop database section about this?


> +void
> +ReplicationSlotsDropDBSlots(Oid dboid)
> +{
> +    int            i;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData slotname;
> +        int active_pid;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* only logical slots are database specific, skip */
> +        if (!SlotIsLogical(s))
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != dboid)
> +            continue;
> +
> +        /* Claim the slot, as if ReplicationSlotAcquire()ing. */
> +        SpinLockAcquire(&s->mutex);
> +        strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
> +        NameStr(slotname)[NAMEDATALEN-1] = '\0';
> +        active_pid = s->active_pid;
> +        if (active_pid == 0)
> +        {
> +            MyReplicationSlot = s;
> +            s->active_pid = MyProcPid;
> +        }
> +        SpinLockRelease(&s->mutex);
> +
> +        /*
> +         * We might fail here if the slot was active. Even though we hold an
> +         * exclusive lock on the database object a logical slot for that DB can
> +         * still be active if it's being dropped by a backend connected to
> +         * another DB or is otherwise acquired.
> +         *
> +         * It's an unlikely race that'll only arise from concurrent user action,
> +         * so we'll just bail out.
> +         */
> +        if (active_pid)
> +            elog(ERROR, "replication slot %s is in use by pid %d",
> +                  NameStr(slotname), active_pid);
> +
> +        /*
> +         * To avoid largely duplicating ReplicationSlotDropAcquired() or
> +         * complicating it with already_locked flags for ProcArrayLock,
> +         * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
> +         * just release our ReplicationSlotControlLock to drop the slot.
> +         *
> +         * For safety we'll restart our scan from the beginning each
> +         * time we release the lock.
> +         */
> +        LWLockRelease(ReplicationSlotControlLock);
> +        ReplicationSlotDropAcquired();
> +        goto restart;
> +    }
> +    LWLockRelease(ReplicationSlotControlLock);
> +
> +    /* recompute limits once after all slots are dropped */
> +    ReplicationSlotsComputeRequiredXmin(false);
> +    ReplicationSlotsComputeRequiredLSN();

I was concerned for a second that we'd skip doing
ReplicationSlotsComputeRequired* if we ERROR out above - but
ReplicationSlotDropAcquired already calls these as necessary. I.e. they
should be dropped from here.


- Andres



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: dblink module printing unnamed connection (with commitacaf7ccb94)
Next
From: David Steele
Date:
Subject: Re: GUC for cleanup indexes threshold.