Re: Undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Undo logs
Date
Msg-id CAA4eK1LfrHqO9yAA=43_XRhVkWUN8GDfU0PBLOnUO4tukuWTWg@mail.gmail.com
Whole thread Raw
In response to Re: Undo logs  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Undo logs
List pgsql-hackers
On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 10.
> > + if (!UndoRecPtrIsValid(multi_prep_urp))
> > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
> > + else
> > + urecptr = multi_prep_urp;
> > +
> > + size = UndoRecordExpectedSize(urec);
> > ..
> > ..
> > + if (UndoRecPtrIsValid(multi_prep_urp))
> > + {
> > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
> > + insert = UndoLogOffsetPlusUsableBytes(insert, size);
> > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
> > + }
> >
> > Can't we use urecptr instead of multi_prep_urp in above code after
> > urecptr is initialized?
>
> I think we can't because urecptr is just the current pointer we are
> going to return but multi_prep_urp is the static variable we need to
> update so that
> the next prepare can calculate urecptr from this location.
> >

Okay, but that was not apparent from the code, so added a comment in
the attached delta patch.  BTW, wouldn't it be better to move this
code to the end of function once prepare for current record is
complete.

More comments
----------------------------
1.
* We can consider that the log as switched if

/that/ needs to be removed.

2.
+ if (prepare_idx == max_prepared_undo)
+ elog(ERROR, "Already reached the maximum prepared limit.");

a. /Already/already
b. we don't use full-stop (.) in error

3.
+ * also stores the dbid and the  progress of the undo apply during rollback.

/the  progress/  extra space.

4.
+UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{

nrecords should be a second parameter.

5.
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

It seems better to have xid parameter before UndoPersistence.

6.
+ /* FIXME: Should we just report error ? */
+ Assert(index < MAX_BUFFER_PER_UNDO);

No need of this Fixme.

7.
PrepareUndoInsert()
{
..
do
{
..
+ /* Undo record can not fit into this block so go to the next block. */
+ cur_blk++;
..
} while (cur_size < size);
..
}

This comment was making me uneasy, so slightly adjusted the code.
Basically, by that time it was not decided whether undo record can fit
in current buffer or not.

8.
+ /*
+ * Save referenced of undo record pointer as well as undo record.
+ * InsertPreparedUndo will use these to insert the prepared record.
+ */
+ prepared_undo[prepare_idx].urec = urec;
+ prepared_undo[prepare_idx].urp = urecptr;

Slightly adjust the above comment.

9.
+InsertPreparedUndo(void)
{
..
+ Assert(prepare_idx > 0);
+
+ /* This must be called under a critical section. */
+ Assert(InRecovery || CritSectionCount > 0);
..
}

I have added a few more comments for above assertions, see if those are correct.

10.
+InsertPreparedUndo(void)
{
..
+ prev_undolen = undo_len;
+
+ UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
..
}

There is no need to use an additional variable prev_undolen in the
above code. I have modified the code to remove it's usage, check if
that is correct.

11.
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease
to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+ TransactionId xid, UndoRecPtr * urec_ptr_out,
+ SatisfyUndoRecordCallback callback)


The comment above UndoFetchRecord is very zheap specific, so I have
tried to simplify it.  I think we can give so much detailed examples
only when we introduce zheap code.

Apart from above, there are miscellaneous comments and minor code
edits in the attached delta patch.

12.
PrepareUndoInsert()
{
..
+ /*
+ * If this is the first undo record for this top transaction add the
+ * transaction information to the undo record.
+ *
+ * XXX there is also an option that instead of adding the information to
+ * this record we can prepare a new record which only contain transaction
+ * informations.
+ */
+ if (xid == InvalidTransactionId)

The above comment seems to be out of place, we are doing nothing like
that here.  This work is done in UndoRecordAllocate, may be you can
move 'XXX ..' part of the comment in that function.

13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..

I think we should make above code a bit more bulletproof.  As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function.  How about if we take 'size' as output
parameter from  UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

14.
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+ int idx;
+ int flags;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ {
+ flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+ XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+ }
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+ int idx;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}

One line comment atop of these functions will be good.  It will be
better if we place these functions at the end of file or someplace
else, as right now they are between prepare* and insert* function
calls which makes code flow bit awkward.

15.
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{

Why only for persistent undo this step should be performed in the
critical section?  I think as this function operates on shred buffer,
even for unlogged undo, it should be done in a critical section.

16.
+InsertPreparedUndo(void)
{
..
/* if starting a new log then there is no prevlen to store */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ if (log->meta.prevlogno != InvalidUndoLogNumber)
+ {
+ UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+ uur->uur_prevlen = prevlog->meta.prevlen;
+ }
..
}

The comment here suggests that for new logs, we don't need prevlen,
but still, in one case you are maintaining the length, can you add few
comments to explain why?

17.
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)
{
..
+ /*
+ * FIXME: This can be optimized to just fetch header first and only if
+ * matches with block number and offset then fetch the complete
+ * record.
+ */
+ if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+ break;
..
}

I don't know how much it matters if we fetch complete record or just
it's header unless the record is big or it falls in two pages.  I
think both are boundary cases and I couldn't see this part much in
perf profiles.  There is nothing to fix here if you want you can a XXX
comment or maybe suggest it as a future optimization.

18.
+UndoFetchRecord()
{
...
+ /*
+ * Prevent UndoDiscardOneLog() from discarding data while we try to
+ * read it.  Usually we would acquire log->mutex to read log->meta
+ * members, but in this case we know that discard can't move without
+ * also holding log->discard_lock.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ if (!UndoRecPtrIsValid(log->oldest_data))
+ {
+ /*
+ * UndoDiscardInfo is not yet initialized. Hence, we've to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(urp))
+ {
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
+
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ }
+
+ /* Check if it's already discarded. */
+ if (urp < log->oldest_data)
+ {
+ LWLockRelease(&log->discard_lock);
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
..
}

Can't we replace this logic with UndoRecordIsValid?

19.
UndoFetchRecord()
{
..
while(true)
{
..
/*
+ * If we have a valid buffer pinned then just ensure that we want to
+ * find the next tuple from the same block.  Otherwise release the
+ * buffer and set it invalid
+ */
+ if (BufferIsValid(urec->uur_buffer))
+ {
+ /*
+ * Undo buffer will be changed if the next undo record belongs to
+ * a different block or undo log.
+ */
+ if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+ (prevrnode.relNode != rnode.relNode))
+ {
+ ReleaseBuffer(urec->uur_buffer);
+ urec->uur_buffer = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /*
+ * If there is not a valid buffer in urec->uur_buffer that means
+ * we had copied the payload data and tuple data so free them.
+ */
+ if (urec->uur_payload.data)
+ pfree(urec->uur_payload.data);
+ if (urec->uur_tuple.data)
+ pfree(urec->uur_tuple.data);
+ }
+
+ /* Reset the urec before fetching the tuple */
+ urec->uur_tuple.data = NULL;
+ urec->uur_tuple.len = 0;
+ urec->uur_payload.data = NULL;
+ urec->uur_payload.len = 0;
+ prevrnode = rnode;
..
}

Can't we move this logic after getting a record with UndoGetOneRecord
and matching with a callback?  This is certainly required after the
first record, so it looks a bit odd here. Also, if possible can we
move it to a separate function as this is not the main logic and makes
the main logic difficult to follow.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pgsql: Switch pg_verify_checksums back to a blacklist
Next
From: Amit Kapila
Date:
Subject: Re: WIP: Avoid creation of the free space map for small tables