Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAA4eK1JgfJzvyiCAph6w7QAdVw6TTdZcUoA1J6D259QyDxUWLg@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have done some more review of undolog patch series and here are my comments:
>
> Hi Amit,
>
> Thanks!  There a number of actionable changes in your review.  I'll be
> posting a new patch set soon that will address most of your complaints
> individually.  In this message want to respond to one topic area,
> because the answer is long enough already:
>
> > 2.
> > allocate_empty_undo_segment()
> > {
> > ..
> > ..
> > /* Flush the contents of the file to disk before the next checkpoint. */
> > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> > ..
> > }
> >
> > +void
> > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> > +{
> > + char path[MAXPGPATH];
> > + FileTag tag;
> > +
> > + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> > +
> > + /* Try to send to the checkpointer, but if out of space, do it here. */
> > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
> >
> >
> > The comment in allocate_empty_undo_segment indicates that the code
> > wants to flush before checkpoint, but the actual function tries to
> > register the request with checkpointer.  Shouldn't this be similar to
> > XLogFileInit where we use pg_fsync to flush the contents immediately?
> > I guess that will avoid what you have written in comments in the same
> > function (we just want to make sure that the filesystem has allocated
> > physical blocks for it so that non-COW filesystems will report ENOSPC
> > now rather than later when space is needed).  OTOH, I think it is
> > performance-wise better to postpone the work to checkpointer.  If we
> > want to push this work to checkpointer, then we might need to change
> > comments or alternatively, we might want to use bigger segment sizes
> > to mitigate the performance effect.
>
> In an early version I was doing the fsync() immediately.  While
> testing zheap, Mithun CY reported that whenever segments couldn't be
> recycled in the background, such as during a bit long-running
> transaction, he could measure ~6% of the time time spent waiting for
> fsync(), and throughput increased with bigger segments (and thus fewer
> files to fsync()).  Passing the work off to the checkpointer is better
> not only because it's done in the background but also because there is
> a chance that the work can be consolidated with other sync requests,
> and perhaps even avoided completely if the file is discarded and
> unlinked before the next checkpoint.
>
> I'll update the comment to make it clearer.
>

Okay, that makes sense.

> > If my above understanding is correct and the reason to fsync
> > immediately is to reserve space now, then we also need to think
> > whether we are always safe in postponing the work?  Basically, if this
> > means that it can fail when we are actually trying to write undo, then
> > it could be risky because we could be in the critical section at that
> > time.  I am not sure about this point, rather it is just to discuss if
> > there are any impacts of postponing the fsync work.
>
> Here is my theory for why this arrangement is safe, and why it differs
> from what we're doing with WAL segments and regular relation files.
> First, let's review why those things work the way they do (as I
> understand it):
>
> 1.  WAL's use of fdatasync():
>

I was referring to function XLogFileInit which doesn't appear to be
directly using fdatasync.

>
> 3.  Separate size tracking:  Another reason that regular relations
> write out zeroes at relation-extension time is that that's the only
..
>
> To summarise, we write zeroes so we can report ENOSPC errors as early
> as possible, but we defer and consolidate fsync() calls because the
> files' contents and names don't actually have to survive power loss
> until a checkpoint says they existed at that point in the WAL stream.
>
> Does this make sense?
>

Yes, this makes sense.  However, I wonder if we need to do some
special handling for ENOSPC while writing to file in this function
(allocate_empty_undo_segment). Basically, unlink/remove the file if
fail to write because of disk full, something similar to what we do in
XLogFileInit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)
Next
From: Amit Kapila
Date:
Subject: Re: SegFault on 9.6.14