Re: pg_receivewal starting position - Mailing list pgsql-hackers

From Ronan Dunklau
Subject Re: pg_receivewal starting position
Date
Msg-id 2834449.HKl90sH2d3@aivenronan
Whole thread Raw
In response to Re: pg_receivewal starting position  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
Le vendredi 3 septembre 2021 17:49:34 CEST, vous avez écrit :
> On Fri, Sep 3, 2021 at 3:28 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> > There is still the concern raised by Bharath about being able to select
> > the
> > way to fetch the replication slot information for the user, and what
> > should or should not be included in the documentation. I am in favor of
> > documenting the process of selecting the wal start, and maybe include a
> > --start-lsn option for the user to override it, but that's maybe for
> > another patch.
>
> Let's hear from others.
>
> Thanks for the patches. I have some quick comments on the v5 patch-set:
>
> 0001:
> 1) Do you also want to MemSet values too in ReadReplicationSlot?
>
> 2) When if clause has single statement we don't generally use "{" and "}"
> + if (slot == NULL || !slot->in_use)
> + {
> + LWLockRelease(ReplicationSlotControlLock);
> + }
> you can just have:
> + if (slot == NULL || !slot->in_use)
> + LWLockRelease(ReplicationSlotControlLock);
>
> 3) This is still not copying the slot contents, as I said earlier you
> can just take the required info into some local variables instead of
> taking the slot pointer to slot_contents pointer.
> + /* Copy slot contents while holding spinlock */
> + SpinLockAcquire(&slot->mutex);
> + slot_contents = *slot;
> + SpinLockRelease(&slot->mutex);
> + LWLockRelease(ReplicationSlotControlLock);
>
> 4) As I said earlier, why can't we have separate tli variables for
> more readability instead of just one slots_position_timeline and
> timeline_history variable? And you are not even resetting those 2
> variables after if
> (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)), you might end
> up sending the restart_lsn timelineid for confirmed_flush. So, better
> use two separate variables. In fact you can use block local variables:
> + if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
> + {
> + List    *tl_history= NIL;
> + TimeLineID tli;
> + tl_history= readTimeLineHistory(ThisTimeLineID);
> + tli = tliOfPointInHistory(slot_contents.data.restart_lsn,
> +   tl_history);
> + values[i] = Int32GetDatum(tli);
> + nulls[i] = false;
> + }
> similarly for confirmed_flush.
>
> 5) I still don't see the need for below test case:
> + "READ_REPLICATION_SLOT does not produce an error with non existent slot");
> when we have
> + "READ_REPLICATION_SLOT does not produce an error with dropped slot");
> Because for a user, dropped or non existent slot is same, it's just
> that for dropped slot we internally don't delete its entry from the
> shared memory.
>

Thank you for reiterating those concerns. As I said, I haven't touched
Michael's version of the patch. I was incorporating those changes in my branch
before he sent this, so I'll probably merge both before sending an updated
patch.

> 0002:
> 1) Imagine GetSlotInformation always returns
> READ_REPLICATION_SLOT_ERROR, don't you think StreamLog enters an
> infinite loop there? Instead, why don't you just exit(1); instead of
> return; and retry? Similarly for READ_REPLICATION_SLOT_NONEXISTENT? I
> think, you can just do exit(1), no need to retry.
> + case READ_REPLICATION_SLOT_ERROR:
> +
> + /*
> + * Error has been logged by GetSlotInformation, return and
> + * maybe retry
> + */
> + return;

This is the same behaviour we had before: if there is an error with
pg_receivewal we retry the command. There was no special case for the
replication slot not existing before, I don't see why we should change it now
?

Eg:

2021-09-06 09:11:07.774 CEST [95853] ERROR:  replication slot "nonexistent"
does not exist
2021-09-06 09:11:07.774 CEST [95853] STATEMENT:  START_REPLICATION SLOT
"nonexistent" 0/1000000 TIMELINE 1
pg_receivewal: error: could not send replication command "START_REPLICATION":
ERROR:  replication slot "nonexistent" does not exist
pg_receivewal: disconnected; waiting 5 seconds to try again

Users may rely on it to keep retrying in the background until the slot has
been created for example.

>
> 2) Why is it returning READ_REPLICATION_SLOT_OK when slot_info isn't
> passed? And why are you having this check after you connect to the
> server and fetch the data?
> + /* If no slotinformation has been passed, we can return immediately */
> + if (slot_info == NULL)
> + {
> + PQclear(res);
> + return READ_REPLICATION_SLOT_OK;
> + }
> Instead you can just have a single assert:
>
> + Assert(slot_name && slot_info );

At first it was so that we didn't have to fill in all required information if we
don't need too, but it turns out pg_basebackup also has to check for the
slot's type. I agree we should not support the null slot_info case anymore.

>
> 3) How about
> pg_log_error("could not read replication slot:
> instead of
> pg_log_error("could not fetch replication slot:

Ok.
>
> 4) Why are you having the READ_REPLICATION_SLOT_OK case in default?
> + default:
> + if (slot_info.is_logical)
> + {
> + /*
> + * If the slot is not physical we can't expect to
> + * recover from that
> + */
> + pg_log_error("cannot use the slot provided, physical slot expected.");
> + exit(1);
> + }
> + stream.startpos = slot_info.restart_lsn;
> + stream.timeline = slot_info.restart_tli;
> + }
> You can just have another case statement for READ_REPLICATION_SLOT_OK
> and in the default you can throw an error "unknown read replication
> slot status" or some other better working and exit(1);

Ok.

>
> 5) Do you want initialize slot_info to 0?
> + if (replication_slot)
> + {
> + SlotInformation slot_info;
> + MemSet(slot_info, 0, sizeof(SlotInformation));
>
> 6) This isn't readable:
> + slot_info->is_logical = strcmp(PQgetvalue(res, 0, 0), "logical") == 0;
> How about:
> if (strcmp(PQgetvalue(res, 0, 0), "logical") == 0)
>    slot_info->is_logical = true;
> You don't need to set it false, because you would have
> MemSet(slot_info) in the caller.
>

Isn't it preferable to fill in the whole struct without any regards to it's
prior state ? I will modify the is_logical assignment to make it more readable
though.


> 7) How about
> uint32 hi;
> uint32 lo;
> instead of
> + uint32 hi,
> + lo;
>

Ok.

> 8) Move  SlotInformation * slot_info) to the next line as it crosses
> the 80 char limit.
> +GetSlotInformation(PGconn *conn, const char *slot_name,
> SlotInformation * slot_info)

Ok.

>
> 9) Instead of a boolean is_logical, I would rather suggest to use an
> enum or #define macros the slot types correctly, because someday we
> might introduce new type slots and somebody wants is_physical kind of
> variable name?
> +typedef struct SlotInformation {
> + bool    is_logical;

That's the reason why the READ_REPLICATION_SLOT command returns a slot type as
a string, but the intermediate representation on the client doesn't really
need to be concerned about this: it's the client after all and it will be very
easy to change it locally if we need to. Thinking about it, for our use case
we should really use is_physical instead of is_logical. But I guess all of
this is moot if we end up deciding not to support the logical case anymore ?

Best regards,

--
Ronan Dunklau





pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead
Next
From: Michael Paquier
Date:
Subject: Re: pg_receivewal starting position