Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAA4eK1+082pj9krF57_xE5KVXAFQGwoDyoG4t_3VrYSLd0uROA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I happened to open up 0001 from this series, which is from Thomas, and
> > > I do not think that the pg_buffercache changes are correct.  The idea
> > > here is that the customer might install version 1.3 or any prior
> > > version on an old release, then upgrade to PostgreSQL 13. When they
> > > do, they will be running with the old SQL definitions and the new
> > > binaries.  At that point, it sure looks to me like the code in
> > > pg_buffercache_pages.c is going to do the Wrong Thing.  [...]
> >
> > Yep, that was completely wrong.  Here's a new version.
> >
>
> One comment/question related to
> 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.
>

I have done some more review of undolog patch series and here are my comments:
0003-Add-undo-log-manager.patch
1.
allocate_empty_undo_segment()
{
..
..
if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+ {
+ char    *parentdir;
+
+ if (errno != ENOENT || !InRecovery)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ undo_path)));
+
+ /*
+ * In recovery, it's possible that the tablespace directory
+ * doesn't exist because a later WAL record removed the whole
+ * tablespace.  In that case we create a regular directory to
+ * stand in for it.  This is similar to the logic in
+ * TablespaceCreateDbspace().
+ */
+
+ /* create two parents up if not exist */
+ parentdir = pstrdup(undo_path);
+ get_parent_directory(parentdir);
+ get_parent_directory(parentdir);
+ /* Can't create parent and it doesn't already exist? */
+ if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)


All of this code is almost same as we have code in
TablespaceCreateDbspace still we have small differences like here you
are using mkdir instead of MakePGDirectory which as far as I can see
use similar permissions for creating directory.  Also, it checks
whether the directory exists before trying to create it.  Is there a
reason why we need to do a few things differently here if not, they
can both the places use one common function?

2.
allocate_empty_undo_segment()
{
..
..
/* Flush the contents of the file to disk before the next checkpoint. */
+ undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
..
}

+void
+undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
+{
+ char path[MAXPGPATH];
+ FileTag tag;
+
+ INIT_UNDOFILETAG(tag, logno, tablespace, segno);
+
+ /* Try to send to the checkpointer, but if out of space, do it here. */
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))


The comment in allocate_empty_undo_segment indicates that the code
wants to flush before checkpoint, but the actual function tries to
register the request with checkpointer.  Shouldn't this be similar to
XLogFileInit where we use pg_fsync to flush the contents immediately?
I guess that will avoid what you have written in comments in the same
function (we just want to make sure that the filesystem has allocated
physical blocks for it so that non-COW filesystems will report ENOSPC
now rather than later when space is needed).  OTOH, I think it is
performance-wise better to postpone the work to checkpointer.  If we
want to push this work to checkpointer, then we might need to change
comments or alternatively, we might want to use bigger segment sizes
to mitigate the performance effect.

If my above understanding is correct and the reason to fsync
immediately is to reserve space now, then we also need to think
whether we are always safe in postponing the work?  Basically, if this
means that it can fail when we are actually trying to write undo, then
it could be risky because we could be in the critical section at that
time.  I am not sure about this point, rather it is just to discuss if
there are any impacts of postponing the fsync work.

Another thing is that recently in commit 475861b261 (commit by you),
we have introduced a mechanism to not fill the files with zero's for
certain filesystems like ZFS.  Do we want similar behavior for undo
files?

3.
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+ UndoLogSlot *slot;
+ size_t end;
+
+ slot = find_undo_log_slot(logno, false);
+
+ /* TODO review interlocking */
+
+ Assert(slot != NULL);
+ Assert(slot->meta.end % UndoLogSegmentSize == 0);
+ Assert(new_end % UndoLogSegmentSize == 0);
+ Assert(InRecovery ||
+    CurrentSession->attached_undo_slots[slot->meta.category] == slot);

Can you write some comments explaining the above Asserts?  Also, can
you explain what interlocking issues are you worried about here?

4.
while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
+
+ /* Flush the directory entries before next checkpoint. */
+ undofile_request_sync_dir(slot->meta.tablespace);

I see that at two places after allocating empty undo segment, the
patch performs undofile_request_sync_dir whereas it doesn't perform
the same in UndoLogNewSegment? Is there a reason for the same or is it
missed from one of the places?

5.
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.

Is this assumption correct?  It seems patch also modified
slot->meta.end during discard in function UndoLogDiscard.  I am
referring below code:

+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
{
..
+ /* Update shmem to show the new discard and end pointers. */
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ slot->meta.discard = discard;
+ slot->meta.end = end;
+ LWLockRelease(&slot->mutex);
..
}

6.
extend_undo_log()
{
..
..
if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+ xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+ XLogFlush(ptr);
+ }

It is not obvious to me why we need to perform XLogFlush here, can you explain?

7.
+attach_undo_log(UndoLogCategory category, Oid tablespace)
{
..
if (candidate->meta.tablespace == tablespace)
+ {
+ logno = *place;
+ slot = candidate;
+ *place = candidate->next_free;
+ break;
+ }

Here, the code is breaking from the loop, so why do we need to set
*place?  Am I missing something obvious?

8.
+ /* WAL-log the creation of this new undo log. */
+ {
+ xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = slot->meta.tablespace;
+ xlrec.category = slot->meta.category;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));

Here and in most other places in this patch you are using
sizeof(xlrec) for xlog static data.  However, as far as I know in
other places in the code we define the size using offset of the last
parameter of corresponding structure to avoid any inconsistency in WAL
record size across different platforms.   Is there a reason to go
differently with this patch?  See below one for example:

typedef struct xl_hash_add_ovfl_page
{
uint16 bmsize;
bool bmpage_found;
} xl_hash_add_ovfl_page;

#define SizeOfHashAddOvflPage
\
(offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))

9.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+ UndoLogSlot *slot;
+
+ /* Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+ /* TODO: assert that it doesn't exist already? */
+
+ slot = allocate_undo_log_slot();
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);

Why do we need to acquire locks during recovery?

10.
I think UndoLogAllocate can leak allocation of slots.  It first
allocates the slot for a new log from the free pool in there is no
existing slot/log, writes a WAL record and then at a later point of
time it actually creates the required physical space in the log via
extend_undo_log which also writes a separate WAL.  Now, if there is a
error between these two operations, then we will have a redundant slot
allocated.  What if there are repeated errors for similar thing from
multiple backends after which system crashes.  Now, after restart, we
will allocate multiple slots for different lognos which don't have any
actual (physical) logs.  This might not be a big problem in practice
because the chances of error between two operations are less, but
can't we delay the WAL logging for allocation of a slot for a new log.

11.
+UndoLogAllocate()
{
..
..
+ /*
+ * Maintain our tracking of the and the previous transaction start
+ * locations.
+ */
+ if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
+ {
+ slot->meta.unlogged.last_xact_start =
+ slot->meta.unlogged.this_xact_start;
+ slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert;
+ }

".. of the and the ..", after first the, something is missing.

12.
UndoLogAllocate()
{
..
..
+ /*
+ * We don't need to acquire log->mutex to read log->meta.insert and
+ * log->meta.end, because this backend is the only one that can
+ * modify them.
+ */
+ if (unlikely(new_insert > slot->meta.end))

I might be confused but slot->meta.end is modified by discard process
also, so how is it safe?  If so, may be adding a comment to explain
the same would be good.  Also, I think in the comments log should be
replaced with the slot.

13.
UndoLogAllocate()
{
..
+ /* This undo log is entirely full.  Get a new one. */
+ if (logxid == GetTopTransactionId())
+ {
+ /*
+ * If the same transaction is split over two undo logs then
+ * store the previous log number in new log.  See detailed
+ * comments in undorecord.c file header.
+ */
..
}

The undorecord.c should be renamed to undoaccess.c

14.
UndoLogAllocate()
{
..
+ if (logxid != GetTopTransactionId())
+ {
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ * DROP TABLESPACE.  That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+ */

/DropUndoLogsInsTablespace/DropUndoLogsInTablespace

15.
UndoLogSegmentPath()
{
..
/*
+ * Build the path from log number and offset.  The pathname is the
+ * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+ * period inserted between the components.
+ */
+ snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+ segno * UndoLogSegmentSize);
..
}

a. It is not very clear from the above code why are we multiplying
segno with UndoLogSegmentSize?  I see that many of the callers pass
segno as segno/UndoLogSegmentSize.  Won't it be better if the caller
take care of passing correct value of segno?
b. In the comment above, instead of offset, shouldn't there be segment number.

16. UndoLogGetLastXactStartPoint is not used any where.  I think this
was required in previous version of patchset, now, we can remove it.

17.
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com

This discussion link seems to be from old discussion/thread, not this one.

0019-Add-developer-documentation-for-the-undo-log-storage
18.
+each undo log, a set of meta-data properties is tracked:
+tracked, including:
+
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged or temporary)

Here, don't we want to refer to UndoLogCategory rather than
persistence level?  "tracked, including:"  seems bit confusing.

0020-Add-user-facing-documentation-for-undo-logs
19.
<row>
+     <entry><structfield>persistence</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Persistence level of data stored in this undo log; one of
+      <literal>permanent</literal>, <literal>unlogged</literal> or
+      <literal>temporary</literal>.</entry>
+    </row>

Don't we want to cover the new (shared) undolog category here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: On the stability of TAP tests for LDAP
Next
From: Amit Kapila
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs