Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAFiTN-uUeV=p3x3KApZcBakh6HK3xdnVTKHWom2FXg_Zvj+aAA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Fri, Apr 19, 2019 at 10:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Currently, undo branch[1] contain an older version of the (undo
> > interface + some fixup).  Now, I have merged the latest changes from
> > the zheap branch[2] to the undo branch[1]
> > which can be applied on top of the undo storage commit[3].  For
> > merging those changes, I need to add some changes to the undo log
> > storage patch as well for handling the multi log transaction.  So I
> > have attached two patches, 1) improvement to undo log storage 2)
> > complete undo interface patch which include 0006+0007 from undo
> > branch[1] + new changes on the zheap branch.

Thanks for the review Robert.  Please find my reply inline.
>
> Some review comments:
>
> +#define AtAbort_ResetUndoBuffers() ResetUndoBuffers()
>
> I don't think this really belongs in xact.c.  Seems like we ought to
> declare it in the appropriate header file.  Perhaps we also ought to
> consider using a static inline function rather than a macro, although
> I guess it doesn't really matter.

Moved to undoinsert.h

>
> +void
> +SetCurrentUndoLocation(UndoRecPtr urec_ptr)
> +{
> + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
> + UndoPersistence upersistence = log->meta.persistence;
>
Right.  This should not be part of this patch so removed.

>
> + * When the undorecord for a transaction gets inserted in the next log then we
>
> undo record
Changed
>
> + * insert a transaction header for the first record in the new log and update
> + * the transaction header with this new logs location.  We will also keep
>
> This appears to be nonsensical.  You're saying that you add a
> transaction header to the new log and then update it with its own
> location.  That can't be what you mean.
Actually, what I meant is update the transaction's header which is in
the old log.  I have changed the comments

>
> + * Incase, this is not the first record in new log (aka new log already
>
> "Incase," -> "If"
> "aka" -> "i.e."
>
Done

> Overall this whole paragraph is a bit hard to understand.
I tired to improve it in newer version.

> + * same transaction spans across multiple logs depending on which log is
>
> delete "across"
Fixed
>
> + * processed first by the discard worker.  If it processes the first log which
> + * contains the transactions first record, then it can get the last record
> + * of that transaction even if it is in different log and then processes all
> + * the undo records from last to first.  OTOH, if the next log get processed
>
> Not sure how that would work if the number of logs is >2.
> This whole paragraph is also hard to understand.
Actually, what I meant is that if it spread in multiple logs for
example 3 logs(1,2,3) and the discard worker check the log 1 first
then for aborted transaction it will follow the chain of undo headers
and register complete request for rollback and it will apply all undo
action in log1,2and 3 together.  Whereas if it encounters log2 first
it will register request for undo actions in log2 and 3 and similarly
if it encounter log 3 first then it will only process that log.  We
have done that so that we can maintain the order of undo apply.
However, there is possibility that we always collect all undos and
apply together but for that we need to add one more pointer in the
transaction header (transaction's undo header in previous log).  May
be the next log pointer we can keep in separate header instead if
keeping in transaction  header so that it will only occupy space on
log switch.

I think this comment don't belong here it's more related to undo
discard processing so I have removed.

>
> +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
> +static int buffer_idx;
>
> This is a file-global variable with a very generic name that is very
> similar to a local variable name used by multiple functions within the
> file (bufidx) and no comments.  Ugh.
>
Comment added.

> +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
>
> The locking regime for this function is really confusing.  It requires
> that the caller hold discard_lock on entry, and on exit the lock will
> still be held if the return value is true but will no longer be held
> if the return value is false.  Yikes!  Anybody who reads code that
> uses this function is not going to guess that it has such strange
> behavior.  I'm not exactly sure how to redesign this, but I think it's
> not OK the way you have it. One option might be to inline the logic
> into each call site.

I think the simple solution will be that inside UndoRecordIsValid
function we can directly check UndoLogIsDiscarded if oldest_data is
not yet initialized, I think we don't need to release the discard lock
for that.  So I have changed it like that.

>
> +/*
> + * Overwrite the first undo record of the previous transaction to update its
> + * next pointer.  This will just insert the already prepared record by
> + * UndoRecordPrepareTransInfo.  This must be called under the critical section.
> + * This will just overwrite the undo header not the data.
> + */
> +static void
> +UndoRecordUpdateTransInfo(int idx)
>
> It bugs me that this function goes back in to reacquire the discard
> lock for the purpose of preventing a concurrent undo discard.
> Apparently, if the other transaction's undo has been discarded between
> the prepare phase and where we are now, we're OK with that and just
> exit without doing anything; otherwise, we update the previous
> transaction header.  But that seems wrong.  When we enter a critical
> section, I think we should aim to know exactly what modifications we
> are going to make within that critical section.
>
> I also wonder how the concurrent discard could really happen.  We must
> surely be holding exclusive locks on the relevant buffers -- can undo
> discard really discard undo when the relevant buffers are x-locked?
>
> It seems to me that remaining_bytes is a crock that should be ripped
> out entirely, both here and in InsertUndoRecord.  It seems to me that
> UndoRecordUpdateTransInfo can just contrive to set remaining_bytes
> correctly.  e.g.
>
> do
> {
>     // stuff
>     if (!BufferIsValid(buffer))
>     {
>         Assert(InRecovery);
>         already_written += (BLCKSZ - starting_byte);
>         done = (already_written >= undo_len);
>     }
>     else
>     {
>         page = BufferGetPage(buffer);
>         done = InsertUndoRecord(...);
>         MarkBufferDirty(buffer);
>     }
> } while (!done);
>
> InsertPreparedUndo needs similar treatment.
>
> To make this work, I guess the long string of assignments in
> InsertUndoRecord will need to be made unconditional, but that's
> probably pretty cheap.  As a fringe benefit, all of those work_blah
> variables that are currently referencing file-level globals can be
> replaced with something local to this function, which will surely
> please the coding style police.

Got fixed as part of last comment fix where we introduced
SkipInsertingUndoData and globals moved to the context.

>
> + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
> + * it refers to the top transaction id because undo log only stores mapping
> + * for the top most transactions.
> + */
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid,
>
> xid vs fxid
>
> + urec->uur_xidepoch = EpochFromFullTransactionId(fxid);
>
> We need to make some decisions about how we're going to handle 64-bit
> XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered
> scheme.  In general, PrepareUndoInsert expects the caller to have
> populated the UnpackedUndoRecord, but here, uur_xidepoch is getting
> overwritten with the high bits of the caller-specified XID.  The
> low-order bits aren't stored anywhere by this function, but the caller
> is presumably expected to have placed them inside urec->uur_xid.  And
> it looks like the low-order bits (urec->uur_xid) get stored for every
> undo record, but the high-order bits (urec->xidepoch) only get stored
> when we emit a transaction header.  This all seems very confusing.

Yeah it seems bit confusing.  Actually, discard worker process the
transaction chain from one transaction header to the next transaction
header so we need epoch only when it's first record of the transaction
and currently we have set all header information inside
PrepareUndoInsert.  Xid is stored by caller as caller needs it for the
MVCC purpose.  I think caller can always set it and if transaction
header get added then it will be stored otherwise not. So I think we
can remove setting it here.
>
> I would really like to see us replace uur_xid and uur_xidepoch with a
> FullTransactionId; now that we have that concept, it seems like bad
> practice to break what is really a FullTransactionId into two halves
> and store them separately.  However, it would be a bit unfortunate to
> store an additional 4 bytes of mostly-static data in every undo
> record.  What if we went the other way?   That is, remove urec_xid
> from UndoRecordHeader and from UnpackedUndoRecord.  Make it the
> responsibility of whoever is looking up an undo record to know which
> transaction's undo they are searching.  zheap, at least, generally
> does know this: if it's starting from a page, then it has the XID +
> epoch available from the transaction slot, and if it's starting from
> an undo record, you need to know the XID for which you are searching,
> I guess from uur_prevxid.

Right, from uur_prevxid we would know that for which xid's undo we are
looking for but without having uur_xid in undo record it self how we
would know which undo record is inserted by the xid we are looking
for?  Because in zheap while following the undo chain and if slot got
switch, then there is possibility (because of slot reuse) that we
might get some other transaction's undo record for the same zheap
tuple, but we want to traverse back as we want to find the record
inserted by uur_prevxid.  So we need uur_xid as well to tell who is
inserter of this undo record?

>
> I also think that we need to remove uur_prevxid.  That field does not
> seem to be properly a general-purpose part of the undo machinery, but
> a zheap-specific consideration.  I think it's job is to tell you which
> transaction last modified the current tuple, but zheap can put that
> data in the payload if it likes.  It is a waste to store it in every
> undo record, because it's not needed if the older undo has already
> been discarded or if the operation is an insert.
Done
>
> + * Insert a previously-prepared undo record.  This will write the actual undo
>
> Looks like this now inserts all previously-prepared undo records
> (rather than just a single one).
Fixed
>
> + * in page.  We start writting immediately after the block header.
>
> Spelling.
Done
>
> + * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
> + * by urp and unpack the record into urec.  This function will not release the
> + * pin on the buffer if complete record is fetched from one buffer, so caller
> + * can reuse the same urec to fetch the another undo record which is on the
> + * same block.  Caller will be responsible to release the buffer inside urec
> + * and set it to invalid if it wishes to fetch the record from another block.
> + */
> +UnpackedUndoRecord *
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoPersistence persistence)
>
> I don't really understand why uur_buffer is part of an
> UnpackedUndoRecord.  It doesn't look like the other fields, which tell
> you about the contents of an undo record that you have created or that
> you have parsed.  Instead, it's some kind of scratch space for keeping
> track of a buffer that we're using in the process of reading an undo
> record.  It looks like it should be an argument to UndoGetOneRecord()
> and ResetUndoRecord().
>
> I also wonder whether it's really a good design to make the caller
> responsible for invalidating the buffer before accessing another
> block.  Maybe it would be simpler to have this function just check
> whether the buffer it's been given is the right one; if not, unpin it
> and pin the new one instead.  But I'm not really sure...

I am not sure what will be better here, But I thought caller anyway
has to release the last buffer so why not to make it responsible to
keeping the track of the first buffer of the undo record and caller
understands it better that it needs to hold the first buffer of the
undo record because it hope that the previous undo record in the chain
might fall on the same buffer?
May be we can make caller completely responsible for reading the
buffer for the first block of the undo record and it will always pass
the valid buffer and UndoGetOneRecord only need to read buffer if the
undo record is spilt and it can release them right there.  So the
caller will always keep track of the first buffer where undo record
start and whenever the undo record pointer change it will be
responsible for changing the buffer.

> + /* If we already have a buffer pin then no need to allocate a new one. */
> + if (!BufferIsValid(buffer))
> + {
> + buffer = ReadBufferWithoutRelcache(SMGR_UNDO,
> +    rnode, UndoLogForkNum, cur_blk,
> +    RBM_NORMAL, NULL,
> +    RelPersistenceForUndoPersistence(persistence));
> +
> + urec->uur_buffer = buffer;
> + }
>
> I think you should move this code inside the loop that follows.  Then
> at the bottom of that loop, instead of making a similar call, just set
> buffer = InvalidBuffer.  Then when you loop around it'll do the right
> thing and you'll need less code.
Done

>
> Notice that having both the local variable buffer and the structure
> member urec->uur_buffer is actually making this code more complex. You
> are already setting urec->uur_buffer = InvalidBuffer when you do
> UnlockReleaseBuffer().  If you didn't have a separate 'buffer'
> variable you wouldn't need to clear them both here.  In fact I think
> what you should have is an argument Buffer *curbuf, or something like
> that, and no uur_buffer at all.
Done
>
> + /*
> + * If we have copied the data then release the buffer, otherwise, just
> + * unlock it.
> + */
> + if (is_undo_rec_split)
> + UnlockReleaseBuffer(buffer);
> + else
> + LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
> Ugh.  I think what's going on here is: UnpackUndoRecord copies the
> data if it switches buffers, but not otherwise.  So if the record is
> split, we can release the lock and pin, but otherwise we have to keep
> the pin to avoid having the data get clobbered.  But having this code
> know about what UnpackUndoRecord does internally seems like an
> abstraction violation.  It's also probably not right if we ever want
> to fetch undo records in bulk, as I see that the latest code in zheap
> master does.  I think we should switch UnpackUndoRecord over to always
> copying the data and just avoid all this.

Done
>
> (To make that cheaper, we may want to teach UnpackUndoRecord to store
> data into scratch space provided by the caller rather than using
> palloc to get its own space, but I'm not actually sure that's (a)
> worth it or (b) actually better.)
>
> [ Skipping over some things ]
>
> +bool
> +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
> + int *already_decoded, bool header_only)
>
> I think we should split this function into three functions that use a
> context object, call it say UnpackUndoContext.  The caller will do:
>
> BeginUnpackUndo(&ucontext);  // just once
> UnpackUndoData(&ucontext, page, starting_byte);  // any number of times
> FinishUnpackUndo(&uur, &ucontext);  // just once
>
> The undo context will store an enum value that tells us the "stage" of decoding:
>
> - UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record
> header; we need to do that next.
> - UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is
> the relation details, if present.
> - UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block
> details, if present.
> - UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the
> transaction details, if present.
> - UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the
> payload details, if present.
> - UNDO_DECODE_STAGE_DONE: Decoding is complete.
>
> It will also store the number of bytes that have been already been
> copied as part of whichever stage is current.  A caller who wants only
> part of the record can stop when ucontext.stage > desired_stage; e.g.
> the current header_only flag corresponds to stopping when
> ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential
> optimization mentioned in UndoGetOneRecord could be done by stopping
> when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know
> if that's worth doing).
>
> In this scheme, BeginUnpackUndo just needs to set the stage to
> UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0.  The
> context object contains all the work_blah things (no more global
> variables!), but BeginUnpackUndo does not need to initialize them,
> since they will be overwritten before they are examined.  And
> FinishUnpackUndo just needs to copy all of the fields from the
> work_blah things into the UnpackedUndoRecord.  The tricky part is
> UnpackUndoData itself, which I propose should look like a big switch
> where all branches fall through.  Roughly:
>
> switch (ucontext->stage)
> {
> case UNDO_DECODE_STAGE_HEADER:
> if (!ReadUndoBytes(...))
>   return;
> stage = UNDO_DECODE_STAGE_RELATION_DETAILS;
> /* FALLTHROUGH */
> case UNDO_DECODE_STAGE_RELATION_DETAILS:
> if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
> {
>   if (!ReadUndoBytes(...))
>     return;
> }
> stage = UNDO_DECODE_STAGE_BLOCK;
> /* FALLTHROUGH */
> etc.
>
> ReadUndoBytes would need some adjustments in this scheme; it wouldn't
> need my_bytes_read any more since it would only get called for
> structures that are not yet completely read.
Yeah so we can directly jump to the header which is not yet completely
read but if any of the header is partially read then we need to
maintain some kind of partial read variable otherwise from 'already
read' we wouldn't be able to know how many bytes of the header got
read in last call unless we calculate that from uur_info or maintain
the partial_read in context like I have done in the new version.

(Regardless of whether
> we adopt this idea, the nocopy flag to ReadUndoBytes appears to be
> unused and can be removed.)

Yup.

>
> We could use a similar context object for InsertUndoRecord.
> BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the
> work_blah structures within the context object.  InsertUndoData will
> be a big switch.  Maybe no need for a "finish" function here.  There
> can also be a SkipInsertingUndoData function that can be called
> instead of InsertUndoData if the page is discarded.  I think this
> would be more elegant than what we've got now.

Done.

Note:
- I think the ucontext->stage are same for the insert and DECODE can
we just declare only
  one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ?
- In SkipInsertingUndoData also I have to go through all the stages so
that if we find some valid
block then stage is right for inserting the partial record?  Do you
think I could have avoided that?

Apart from these changes I have also included UndoRecordBulkFetch in
the undoinsert.c.

I have tested this patch with my local test modules which basically
insert, fetch and bulk fetch multiple records and compare the
contents.  My test patch is still not in good shape so I will post the
test module later.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Shaoqi Bai
Date:
Subject: Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Next
From: Robert Haas
Date:
Subject: Re: block-level incremental backup