Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAA4eK1K_6wQPd=f4JeeWeaJcKU-YopVd90WUouuKVodGRKFOAQ@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Khandekar <amitdkhan.pg@gmail.com>)
List pgsql-hackers
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> --------------------
>
> Some further review comments for undoworker.c :
>
>
> +/* Sets the worker's lingering status. */
> +static void
> +UndoWorkerIsLingering(bool sleep)
> The function name sounds like "is the worker lingering ?". Can we
> rename it to something like "UndoWorkerSetLingering" ?
>

makes sense, changed as per suggestion.

> -------------
>
> + errmsg("undo worker slot %d is empty, cannot attach",
> + slot)));
>
> + }
> +
> + if (MyUndoWorker->proc)
> + {
> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is already used by "
> + "another worker, cannot attach", slot)));
>
> These two error messages can have a common error message "could not
> attach to worker slot", with errdetail separate for each of them :
> slot %d is empty.
> slot %d is already used by another worker.
>
> --------------
>

Changed as per suggestion.

> +static int
> +IsUndoWorkerAvailable(void)
> +{
> + int i;
> + int alive_workers = 0;
> +
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
>
> Should have bool return value.
>
> Also, why is it keeping track of number of alive workers ? Sounds like
> earlier it used to return number of alive workers ? If it indeed needs
> to just return true/false, we can do away with alive_workers.
>
> Also, *exclusive* lock is unnecessary.
>
> --------------
>

Changed as per suggestion.  Additionally, I changed the name of the
function to UndoWorkerIsAvailable(), so that it is similar to other
functions in the file.

> +if (UndoGetWork(false, false, &urinfo, NULL) &&
> +    IsUndoWorkerAvailable())
> +    UndoWorkerLaunch(urinfo);
>
> There is no lock acquired between IsUndoWorkerAvailable() and
> UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
> returns true, there is a small window where UndoWorkerLaunch() does
> not find any worker slot with in_use false, causing assertion failure
> for (worker != NULL).
> --------------
>

I have removed the assert and instead added a warning.  I have also
added a comment from the place where we call UndoWorkerLaunch to
mention the race condition.

> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> *Exclusive* lock is unnecessary.
> -------------
>

Right, changed to share lock.

> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is empty",
> + slot)));
> I believe there is no need to explicitly release an lwlock before
> raising an error, since the lwlocks get released during error
> recovery. Please check all other places where this is done.
> -------------
>

Fixed.

> + * Start new undo apply background worker, if possible otherwise return false.
> worker, if possible otherwise => worker if possible, otherwise
> -------------
>
> +static bool
> +UndoWorkerLaunch(UndoRequestInfo urinfo)
> We don't check UndoWorkerLaunch() return value. Can't we make it's
> return value type void ?
>

Now, the function returns void and accordingly I have adjusted the
comment which should address both the above comments.

> Also, it would be better to have urinfo as pointer to UndoRequestInfo
> rather than UndoRequestInfo, so as to avoid structure copy.
> -------------
>

Okay, changed as per suggestion.

> +{
> + BackgroundWorker bgw;
> + BackgroundWorkerHandle *bgw_handle;
> + uint16 generation;
> + int i;
> + int slot = 0;
> We can remove variable i, and use slot variable in place of i.
> -----------
>
> + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
> I think it would be trivial to also append the worker->generation in
> the bgw_name.
> -------------
>

I am not sure if adding 'generation' is any useful. It might be better
to add database id as each worker can work on a particular database,
so that could be useful information.

>
> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + /* Failed to start worker, so clean up the worker slot. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> + UndoWorkerCleanup(worker);
> + LWLockRelease(UndoWorkerLock);
> +
> + return false;
> + }
>
> Is it intentional that there is no (warning?) message logged when we
> can't register a bg worker ?
> -------------
>

Added a warning in that code path.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: default_table_access_method is not in sample config file
Next
From: Amit Kapila
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs